Skip to content Skip to sidebar Skip to footer

Python Random Sample Selection Based On Multiple Conditions

I want to make a random sample selection in python from the following df such that at least 65% of the resulting sample should have color yellow and cumulative sum of the quantitie

Solution 1:

  1. Filter rows with 'yellow' and select a random sample of at least 65% of your total sample size

    importrandomyellow_size=float(random.randint(65,100)) / 100
    df_yellow = df3[df3['color'] == 'yellow'].sample(yellow_size*sample_size)
    
  2. Filter rows with other colors and select a random sample for the remaining of your sample size.

    others_size = 1 - yellow_size
    df_others = df3[df3['color'] != 'yellow].sample(others_size*sample_size)
    
  3. Combine them both and shuffle the rows.

    df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)
    

UPDATE:

If you want to check for both conditions simultaneously, this could be one way to do it:

import random

df_sample = dfwhilesum(df_sample['qty']) > 18:
    yellow_size = float(random.randint(65,100)) / 100
    df_yellow = df[df['color'] == 'yellow'].sample(yellow_size*sample_size)
    others_size = 1 - yellow_size
    df_others = df[df['color'] != 'yellow'].sample(others_size*sample_size)
    df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)

Solution 2:

I would use this package to over sample your yellows into a new sample that has the balance you want:

https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html

From there just randomly select items and check sum until you have the set you want.

Something less time complex would be binary searching a range the length of your data frame, and using the binary search term as your sample size, until you get the cumsum you want. The assumes the feature is symmetrically distributed.

Solution 3:

I think this example help you. I add columns df2['yellow_rate'] and calculate rate. You only check df2.iloc[df2.shape[0] - 1]['yellow_rate'] value.

df1=pd.DataFrame({'id':['A','B','C','D','E','G','H','I','J'],'color':['red','bule','green','yellow','yellow','yellow','orange','yellow','yellow'], 'qty':[5,2, 3, 4, 7, 6, 8, 1, 5]})
df2 = df1.sample(n=df1.shape[0])
df2['yellow_rate'] =  df2[df2.qty.cumsum() <= 18]['color'].apply( lambda x : 1if x =='yellow'else0)
df2 = df2.dropna().append(df2.sum(numeric_only=True)/ df2.count(numeric_only=True), ignore_index=True)

Post a Comment for "Python Random Sample Selection Based On Multiple Conditions"