Sorting Pandas Dataframe With German Umlaute
Solution 1:
You could use sorted
with a locale aware sorting function (in my example, setlocale
returned 'German_Germany.1252'
) to sort the column values. The tricky part is to sort all the other columns accordingly. A somewhat hacky solution would be to temporarily set the index to the column to be sorted and then reindex on the properly sorted index values and reset the index.
import functools
import locale
locale.setlocale(locale.LC_ALL, '')
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern'],'code':['ö','z','b']})
df = df.set_index('location')
df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index()
Output of print(df):
location code0 Bern b1 Österreich ö
2 Zürich z
Update for mixed type columns If the column to be sorted is of mixed types (e.g. strings and integers), then you have two possibilities:
a) convert the column to string and then sort as written above (result column will be all strings):
locale.setlocale(locale.LC_ALL, '')
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b','v']})
df.location=df.location.astype(str)
df = df.set_index('location')
df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index()
print(df.location.values)
# ['254345''Bern''Österreich''Zürich']
b) sort on a copy of the column converted to string (result column will retain mixed types)
locale.setlocale(locale.LC_ALL, '')
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b','v']})
df = df.set_index(df.location.astype(str))
df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index(drop=True)
print(df.location.values)
# [254345 'Bern' 'Österreich' 'Zürich']
Solution 2:
you can use unicode NFD normal form
>>> names = pd.Series(['Österreich', 'Ost', 'S', 'N'])
>>> names.str.normalize('NFD').sort_values()
3 N
1 Ost
0 Österreich
2 S
dtype: object# use result to rearrange a dataframe>>> df[names.str.normalize('NFD').sort_values().index]
It's not quite what you wanted, but for proper ordering you need language knowladge (like locale you mentioned).
NFD employs two symbols for umlauts e.g. Ö
becomes O\xcc\x88
(you can see the difference with names.str.normalize('NFD').encode('utf-8')
)
Post a Comment for "Sorting Pandas Dataframe With German Umlaute"