Skip to content Skip to sidebar Skip to footer

Use Agg In Python For Pd.dataframe Wiht Customized Function Whose Inputs Are Multiple Dataframe Columns

I have a data frame like this. mydf = pd.DataFrame({'a':[1,1,3,3],'b':[np.nan,2,3,6],'c':[1,3,3,9]}) a b c 0 1 NaN 1 1 1 2.0 3 2 3 3.0 3 3 3 6.0 9 I would like

Solution 1:

One way of doing it

def func(x):
    C= (x['b']/x['c']).max()
    D= (x['c']/x['b']).max()
    return pd.Series([C, D], index=['b_c','c_b'])
mydf.groupby('a').apply(func).reset_index()

Output

    a        b_c    c_b
0   1   0.666667    1.5
1   3   1.000000    1.5

Solution 2:

Prepend new temporary columns to the dataframe via assign, then do your groupby and max functions. This method should provide significant performance benefits.

>>> (mydf
     .assign(b_c=df['b'].div(df['c']), c_b=df['c'].div(df['b']))
     .groupby('a')[['b_c', 'c_b']]
     .max()
    )
        b_c  c_b
a               
1  0.666667  1.5
3  1.000000  1.5

Timings

# Sample data.
n = 1000  # Sample data number of rows = 4 * n.
data = {
    'a':  list(range(n)) * 4, 
    'b': [np.nan, 2, 3, 6] * n,
    'c': [1, 3, 3, 9] * n
}
df = pd.DataFrame(data)

# Solution 1.
%timeit df.assign(b_c=df['b'].div(df['c']), c_b=df['c'].div(df['b'])).groupby('a')[['b_c', 'c_b']].max()
# 3.96 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Solution 2.
def func(x):
    C= (x['b']/x['c']).max()
    D= (x['c']/x['b']).max()
    return pd.Series([C, D], index=['b_c','c_b'])

%timeit df.groupby('a').apply(func)
# 1.09 s ± 56.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Both solutions give the same result.


Post a Comment for "Use Agg In Python For Pd.dataframe Wiht Customized Function Whose Inputs Are Multiple Dataframe Columns"