Python Decision Tree Classification Of Complex Objects
Solution 1:
You can use scikit-learn, but you need to preprocess your data. Other implementations of decision trees can deal with categorical data directly, that will not solve your problems however. You still need to preprocess the data.
First, I would leave out the images, as using them is somewhat complex. For all the other variables, you need to encode them in a way that is sensible for machine learning. For example the available sizes could be encoded as a 0 or 1 depending on whether a given size is available. The colors could be encoded as a categorical if they come from a fixed set of strings. If this is a free text field, using a categorical might not be great (for example people might be using gray and grey, which would be two completely unrelated values, or have typos, etc.)
The descriptions and names are probably unique to each product, so using categorical variables there doesn't make sense, as each one will only be seen once. For these it would probably be best to encode them using a bag of word approach.
You can find a tutorial on text classification in the tutorials section of the scikit-learn documentation. You might want to have a look a the other tutorials, too.
Finally, I would suggest starting with a linear classifier, like Naive Bayes or LinearSVC. Single trees are mostly useful if you want to extract the actual rules, and are rarely used in text processing afaik (there are often tens or hundreds of thousands of features / word, so extracting meaningful rules is hard). If you want to use a tree-based method, using an ensemble like a random forest or gradient boosting will most likely yield better results.
Post a Comment for "Python Decision Tree Classification Of Complex Objects"