Skip to content Skip to sidebar Skip to footer

Load Pickled Classifier Data : Vocabulary Not Fitted Error

I have read all related questions here but couldn't find a working solution : My classifier creation : class StemmedTfidfVectorizer(TfidfVectorizer): def build_analyzer(self):

Solution 1:

Ok, I solved the issue by using a pipeline to get my vectorizer saved within the .plk

Here's how it looks (also, way simpler) :

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.externals import joblib
from sklearn.pipeline import Pipeline
import Stemmer
import pickle

english_stemmer = Stemmer.Stemmer('en')


classStemmedTfidfVectorizer(TfidfVectorizer):
    defbuild_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        returnlambda doc: english_stemmer.stemWords(analyzer(doc))


defcreate_tfidf(f):
    docs = []
    targets = []
    withopen(f, "r") as sentences_file:
        reader = csv.reader(sentences_file, delimiter=';')
        reader.next()
        for row in reader:
            docs.append(row[1])
            targets.append(row[0])
    return docs, targets


docs,y = create_tfidf("l1.csv")
tf = StemmedTfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df = 0, max_features=200000, stop_words = 'english')
clf = LinearSVC()

vec_clf = Pipeline([('tfvec', tf), ('svm', clf)])

vec_clf.fit(docs,y)

_ = joblib.dump(vec_clf, 'linearL0_3gram_100K.pkl', compress=9)

And on the other side :

from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.externals import joblib
import Stemmer
import pickle

english_stemmer = Stemmer.Stemmer('en')

classStemmedTfidfVectorizer(TfidfVectorizer):
    defbuild_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        returnlambda doc: english_stemmer.stemWords(analyzer(doc))


clf = joblib.load('linearL0_3gram_100K.pkl')
test = ["My super elaborate test string to test predictions"]
print test + clf.predict(test)[0]

Important things to mention :

The transformer is part of the pipeline, as is tf, so there's no need either to redeclare a new vectorizer (which was the failing point earlier as it needed the vocabulary from the trained data), or to .transform() the test string.

Post a Comment for "Load Pickled Classifier Data : Vocabulary Not Fitted Error"