Skip to content Skip to sidebar Skip to footer

Use External Library In Pandas_udf In Pyspark

It's possible to use a external library like textdistance inside pandas_udf? I have tried and I get this error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a

Solution 1:

You can package the textdistance together with your own code (use setup.py and bdist_egg to build an egg file), and specify the final package with option --py-files while you run spark.

btw, the error message doesn't seem to relate with textdistance at all.

Solution 2:

You can use a Spark UDF, for example to implement the Ratcliff-Obershelp function:

import textdistance

def my_ro(s1,s2):
  d = textdistance.ratcliff_obershelp(s1,s2)
  return d

spark.udf.register("my_ro", my_ro, FloatType())

spark.sql("SELECT word1, word2, my_ro(word1,word2) as ro FROM spark_df")\
.show(100,False)

Post a Comment for "Use External Library In Pandas_udf In Pyspark"