Load Pickled Classifier Data : Vocabulary Not Fitted Error

May 30, 2024 Post a Comment

I have read all related questions here but couldn't find a working solution : My classifier creation : class StemmedTfidfVectorizer(TfidfVectorizer): def build_analyzer(self):

Solution 1:

Ok, I solved the issue by using a pipeline to get my vectorizer saved within the .plk

Here's how it looks (also, way simpler) :

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.externals import joblib
from sklearn.pipeline import Pipeline
import Stemmer
import pickle

english_stemmer = Stemmer.Stemmer('en')


classStemmedTfidfVectorizer(TfidfVectorizer):
    defbuild_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        returnlambda doc: english_stemmer.stemWords(analyzer(doc))


defcreate_tfidf(f):
    docs = []
    targets = []
    withopen(f, "r") as sentences_file:
        reader = csv.reader(sentences_file, delimiter=';')
        reader.next()
        for row in reader:
            docs.append(row[1])
            targets.append(row[0])
    return docs, targets


docs,y = create_tfidf("l1.csv")
tf = StemmedTfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df = 0, max_features=200000, stop_words = 'english')
clf = LinearSVC()

vec_clf = Pipeline([('tfvec', tf), ('svm', clf)])

vec_clf.fit(docs,y)

_ = joblib.dump(vec_clf, 'linearL0_3gram_100K.pkl', compress=9)

And on the other side :

from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.externals import joblib
import Stemmer
import pickle

english_stemmer = Stemmer.Stemmer('en')

classStemmedTfidfVectorizer(TfidfVectorizer):
    defbuild_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        returnlambda doc: english_stemmer.stemWords(analyzer(doc))


clf = joblib.load('linearL0_3gram_100K.pkl')
test = ["My super elaborate test string to test predictions"]
print test + clf.predict(test)[0]

Important things to mention :

The transformer is part of the pipeline, as is tf, so there's no need either to redeclare a new vectorizer (which was the failing point earlier as it needed the vocabulary from the trained data), or to .transform() the test string.

Free Interactive Python Tutorial

Load Pickled Classifier Data : Vocabulary Not Fitted Error

Solution 1:

Post a Comment for "Load Pickled Classifier Data : Vocabulary Not Fitted Error"