Skip to content Skip to sidebar Skip to footer

How To Preserve Number Of Records In Word2vec?

I have 45000 text records in my dataframe. I wanted to convert those 45000 records into word vectors so that I can train a classifier on the word vector. I am not tokenizing the se

Solution 1:

If you are splitting each entry into a list of words, that's essentially 'tokenization'.

Word2Vec just learns vectors for each word, not for each text example ('record') – so there's nothing to 'preserve', no vectors for the 45,000 records are ever created. But if there are 26,000 unique words among the records (after applying min_count), you will have 26,000 vectors at the end.

Gensim's Doc2Vec (the ' Paragraph Vector' algorithm) can create a vector for each text example, so you may want to try that.

If you only have word-vectors, one simplistic way to create a vector for a larger text is to just add all the individual word vectors together. Further options include choosing between using the unit-normed word-vectors or raw word-vectors of many magnitudes; whether to then unit-norm the sum; and whether to otherwise weight the words by any other importance factor (such as TF/IDF).

Note that unless your documents are very long, this is a quite small training set for either Word2Vec or Doc2Vec.

Post a Comment for "How To Preserve Number Of Records In Word2vec?"