How To Preserve Number Of Records In Word2vec?
Solution 1:
If you are splitting each entry into a list of words, that's essentially 'tokenization'.
Word2Vec just learns vectors for each word, not for each text example ('record') – so there's nothing to 'preserve', no vectors for the 45,000 records are ever created. But if there are 26,000 unique words among the records (after applying min_count
), you will have 26,000 vectors at the end.
Gensim's Doc2Vec (the ' Paragraph Vector' algorithm) can create a vector for each text example, so you may want to try that.
If you only have word-vectors, one simplistic way to create a vector for a larger text is to just add all the individual word vectors together. Further options include choosing between using the unit-normed word-vectors or raw word-vectors of many magnitudes; whether to then unit-norm the sum; and whether to otherwise weight the words by any other importance factor (such as TF/IDF).
Note that unless your documents are very long, this is a quite small training set for either Word2Vec or Doc2Vec.
Post a Comment for "How To Preserve Number Of Records In Word2vec?"