Use External Library In Pandas_udf In Pyspark
It's possible to use a external library like textdistance inside pandas_udf? I have tried and I get this error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a
Solution 1:
You can package the textdistance
together with your own code (use setup.py and bdist_egg
to build an egg
file), and specify the final package with option --py-files
while you run spark.
btw, the error message doesn't seem to relate with textdistance
at all.
Solution 2:
You can use a Spark UDF, for example to implement the Ratcliff-Obershelp function:
import textdistance
def my_ro(s1,s2):
d = textdistance.ratcliff_obershelp(s1,s2)
return d
spark.udf.register("my_ro", my_ro, FloatType())
spark.sql("SELECT word1, word2, my_ro(word1,word2) as ro FROM spark_df")\
.show(100,False)
Post a Comment for "Use External Library In Pandas_udf In Pyspark"