Skip to content Skip to sidebar Skip to footer

Pandas Split Column Name

I have a test dataframe that looks something like this: data = pd.DataFrame([[0,0,0,3,6,5,6,1],[1,1,1,3,4,5,2,0],[2,1,0,3,6,5,6,1],[3,0,0,2,9,4,2,1]], columns=['id', 'sex', 'split'

Solution 1:

Here is another way. It assumes that low/high group ends with the words Low and High respectively, so that we can use .str.endswith() to identify which rows are Low/High.

Here is the sample data

df = pd.DataFrame('group0Low group0High group1Low group1High routeLow routeHigh landmarkLow landmarkHigh'.split(), columns=['group_level'])
df

    group_level
0     group0Low
1    group0High
2     group1Low
3    group1High
4      routeLow
5     routeHigh
6   landmarkLow
7  landmarkHigh

Use np.where, we can do the following

df['level'] = np.where(df['group_level'].str.endswith('Low'), 'Low', 'High')
df['group'] = np.where(df['group_level'].str.endswith('Low'), df['group_level'].str[:-3], df['group_level'].str[:-4])

df

    group_level level     group
0     group0Low   Low    group0
1    group0High  High    group0
2     group1Low   Low    group1
3    group1High  High    group1
4      routeLow   Low     route
5     routeHigh  High     route
6   landmarkLow   Low  landmark
7  landmarkHigh  High  landmark

Solution 2:

I suppose it depends how general the strings you're working are. Assuming the only levels are always delimited by a capital letter you can do

In [30]:    
s = pd.Series(['routeHigh', 'routeLow', 'landmarkHigh', 
               'landmarkLow', 'routeMid', 'group0Level'])
s.str.extract('([\d\w]*)([A-Z][\w\d]*)')

Out[30]:
    0       1
0   route   High
1   route   Low
2   landmark    High
3   landmark    Low
4   route   Mid
5   group0  Level

You can even name the columns of the result in the same line by doing

s.str.extract('(?P<group>[\d\w]*)(?P<Level>[A-Z][\w\d]*)')

So in your use case you can do

group_level_df = stacked.group_level.extract('(?P<group>[\d\w]*)(?P<Level>[A-Z][\w\d]*)')
stacked = pd.concat([stacked, group_level_df])

Here's another approach which assumes only knowledge of the level names in advance. Suppose you have three levels:

lower = stacked.group_level.str.lower()
for level in ['low', 'mid', 'high']:

    rows_in = lower.str.contains(level)
    stacked.loc[rows_in, 'level'] = level.capitalize()  
    stacked.loc[rows_in, 'group'] = stacked.group_level[rows_in].str.replace(level, '')

Which should work as long as the level doesn't appear in the group name as well, e.g. 'highballHigh'. In cases where group_level didn't contain any of these levels you would end up with null values in the corresponding rows


Post a Comment for "Pandas Split Column Name"