Python: Split A String Field Into 3 Separate Fields Using Lambda

May 29, 2023 Post a Comment

I have a Python data frame which includes a column called 'SEGMENT'. I want to break the column up into three columns. Please see my desired output highlighted in yellow. Belo

Solution 1:

Setup

df = pd.DataFrame({'SEGMENT': {0: 'Hight:33-48', 1: 'Hight:33-48', 2: 'Very Hight:80-88'}})

df
Out[17]: 
            SEGMENT
0       Hight:33-48
1       Hight:33-48
2  Very Hight:80-88

Solution

use split to break the column to 3 parts and then expand to create a new DF.

df.SEGMENT.str.split(':|-',expand=True)\
  .rename(columns=dict(zip(range(3),\
  ['SEGMENT','SEGMENT RANGE LOW','SEGMENT RANGE HIGH'])))
Out[13]: 
      SEGMENT SEGMENT RANGE LOW SEGMENT RANGE HIGH
0       Hight                33                 48
1       Hight                33                 48
2  Very Hight                80                 88

Solution 2:

Use str.split by : or (|) \s*-\s* (\s* means zero or more whitespaces):

df = pd.DataFrame({'SEGMENT': ['Hight: 33 - 48', 'Hight: 33 - 48', 'Very Hight: 80 - 88']})

cols = ['SEGMENT','SEGMENT RANGE LOW','SEGMENT RANGE HIGH']
df[cols] = df['SEGMENT'].str.split(':\s*|\s*-\s*',expand=True)
print (df)
      SEGMENT SEGMENT RANGE LOW SEGMENT RANGE HIGH
0       Hight                33                 48
1       Hight                33                 48
2  Very Hight                80                 88

Solution with str.extract:

cols = ['SEGMENT','SEGMENT RANGE LOW','SEGMENT RANGE HIGH']
df[cols] = df['SEGMENT'].str.extract('([A-Za-z\s*]+):\s*(\d+)\s*-\s*(\d+)', expand = True)
print (df)
      SEGMENT SEGMENT RANGE LOW SEGMENT RANGE HIGH
0       Hight                33                 48
1       Hight                33                 48
2  Very Hight                80                 88

Solution 3:

Because I like naming columns from the str.extract regex

regex = '\s*(?P<SEGMENT>\S+)\s*:\s*(?P<SEGMENT_RANGE_LOW>\S+)\s*-\s*(?P<SEGMENT_RANGE_HIGH>\S+)\s*'
df.SEGMENT.str.extract(regex, expand=True)

  SEGMENT SEGMENT_RANGE_LOW SEGMENT_RANGE_HIGH
0    High                33                 48
1    High                33                 48
2    High                80                 88

Setup

df = pd.DataFrame({'SEGMENT': ['High: 33 - 48', 'High: 33 - 48', 'Very High: 80 - 88']})

Solution 4:

columns = ['SEGMENT', 'SEGMENT RANGE LOW', 'SEGMENT RANGE HIGH']
df['temp'] = df['SEGMENT'].str.replace(': ','-').str.split('-')
for i, c in enumerate(columns):
    df[c] = df['temp'].apply(lambda x: x[i])
del df['temp']

Replace colon with a hyphen and then split on hyphen to get list of values for the 3 columns. Then assign values to each of the 3 columns and delete the temporary column.

Solution 5:

I would do this with the str.extract using regex

df.SEGMENT.str.extract('([A-Za-z ]+):(\d+)-(\d+)', expand = True).rename(columns = {0: 'SEGMENT', 1: 'SEGMENT RANGE LOW', 2: 'SEGMENT RANGE HIGH'})

    SEGMENT     SEGMENT RANGE LOW   SEGMENT RANGE HIGH
0   High        33                  48
1   High        33                  48
2   Very High   80                  88

Free Interactive Python Tutorial