Python Random Sample Selection Based On Multiple Conditions
Solution 1:
Filter rows with
'yellow'
and select a random sample of at least 65% of your total sample sizeimportrandomyellow_size=float(random.randint(65,100)) / 100 df_yellow = df3[df3['color'] == 'yellow'].sample(yellow_size*sample_size)
Filter rows with other colors and select a random sample for the remaining of your sample size.
others_size = 1 - yellow_size df_others = df3[df3['color'] != 'yellow].sample(others_size*sample_size)
Combine them both and shuffle the rows.
df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)
UPDATE:
If you want to check for both conditions simultaneously, this could be one way to do it:
import random
df_sample = dfwhilesum(df_sample['qty']) > 18:
yellow_size = float(random.randint(65,100)) / 100
df_yellow = df[df['color'] == 'yellow'].sample(yellow_size*sample_size)
others_size = 1 - yellow_size
df_others = df[df['color'] != 'yellow'].sample(others_size*sample_size)
df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)
Solution 2:
I would use this package to over sample your yellows into a new sample that has the balance you want:
https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html
From there just randomly select items and check sum until you have the set you want.
Something less time complex would be binary searching a range the length of your data frame, and using the binary search term as your sample size, until you get the cumsum you want. The assumes the feature is symmetrically distributed.
Solution 3:
I think this example help you. I add columns df2['yellow_rate'] and calculate rate. You only check df2.iloc[df2.shape[0] - 1]['yellow_rate'] value.
df1=pd.DataFrame({'id':['A','B','C','D','E','G','H','I','J'],'color':['red','bule','green','yellow','yellow','yellow','orange','yellow','yellow'], 'qty':[5,2, 3, 4, 7, 6, 8, 1, 5]})
df2 = df1.sample(n=df1.shape[0])
df2['yellow_rate'] = df2[df2.qty.cumsum() <= 18]['color'].apply( lambda x : 1if x =='yellow'else0)
df2 = df2.dropna().append(df2.sum(numeric_only=True)/ df2.count(numeric_only=True), ignore_index=True)
Post a Comment for "Python Random Sample Selection Based On Multiple Conditions"