What Hashing Function Does Spark Use For Hashingtf And How Do I Duplicate It?
Solution 1:
If you're in doubt is it usually good to check the source. The bucket for a given term is determined as follows:
defindexOf(self, term):
""" Returns the index of the input term. """returnhash(term) % self.numFeatures
As you can see it is just a plain old hash
module number of buckets.
Final hash is just a vector of counts per bucket (I've omitted docstring and RDD case for brevity):
def transform(self, document):
freq = {}
for term in document:
i = self.indexOf(term)
freq[i] = freq.get(i, 0) + 1.0
return Vectors.sparse(self.numFeatures, freq.items())
If you want to ignore frequencies then you can use set(document)
as an input, but I doubt there is much to gain here. To create set
you'll have to compute hash
for each element anyway.
Solution 2:
It seems to me that there is something else going on under the hood other than what the source that zero323 linked. I found that hashing and then doing the modulus as the source code did wouldn't give me the same indices as hashingTF generates. At least for single characters, what I had to do was convert the char to the ascii code, like so: (Python 2.7)
index = ord('a') # 97
Which corresponds to what hashingtf outputs for the index. If I did the same thing as hashingtf appears to do, which is:
index = hash('a') % 1<<20# 897504
I would get very clearly the wrong index.
Post a Comment for "What Hashing Function Does Spark Use For Hashingtf And How Do I Duplicate It?"