Skip to content Skip to sidebar Skip to footer

Optimal Feature Selection Technique After Pca?

I'm implementing a classification task with binary outcome using RandomForestClassifier and I know the importance of data preprocessing to improve the accuracy score. In particular

Solution 1:

Let's begin with when should we use PCA?

The PCA is most useful when you are not sure which component of your data is affecting the accuracy.

Let's think about the face-recognition task. Can we say the most crucial pixels at a glance?

For instance: Olivetti faces. 40 people, dark homogenous background, varying the lighting, facial expressions (open/closed eyes, smiling / not smiling), and facial details (glasses / no glasses).

So if we look at the correlations between the pixels:

from sklearn.datasets import fetch_olivetti_faces
from numpy import corrcoef
from numpy import zeros_like
from numpy import triu_indices_from
from matplotlib.pyplot import figure
from matplotlib.pyplot import get_cmap
from matplotlib.pyplot import plot
from matplotlib.pyplot import colorbar
from matplotlib.pyplot import subplots
from matplotlib.pyplot import suptitle
from matplotlib.pyplot import imshow
from matplotlib.pyplot import xlabel
from matplotlib.pyplot import ylabel
from matplotlib.pyplot import savefig
from matplotlib.image import imread
import seaborn


olivetti = fetch_olivetti_faces()

X = olivetti.images  # Train
y = olivetti.target  # Labels

X = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))

seaborn.set(font_scale=1.2)
seaborn.set_style("darkgrid")
mask = zeros_like(corrcoef(X_resp))
mask[triu_indices_from(mask)] = Truewith seaborn.axes_style("white"):
    f, ax = subplots(figsize=(20, 15))
    ax = seaborn.heatmap(corrcoef(X), 
                         annot=True, 
                         mask=mask, 
                         vmax=1,
                         vmin=0,
                         square=True, 
                         cmap="YlGnBu",
                         annot_kws={"size": 1})
    
savefig('heatmap.png')

enter image description here

From above can you tell me which pixels are the most important for the classification?

However, if I ask you, "Could you please tell me the most important features for chronic kidney disease?"

You can tell me at a glance:

enter image description here

If we resume from the face-recognition task, do we need really all the pixels for the classification?

No, we don't.

enter image description here

Above you can see only 63 pixels suffice for to recognize a face as a human.

Please note that 63 pixes suffice for to recognize a face, not face-recognition. You need more pixels for the discrimination between faces.

So what we do is reducing the dimensionality. You might want to read more on about the Curse of dimensionality

Ok, so we decide to use PCA, since we don't need each pixel of the face image. We have to reduce the dimension.

To make visually understandable, I'm using 2 dimension.

defprojection(obj, x, x_label, y_label, title, class_num=40, sample_num=10, dpi=300):
    x_obj = obj.transform(x)
    idx_range = class_num * sample_num
    fig = figure(figsize=(6, 3), dpi=dpi)
    ax = fig.add_subplot(1, 1, 1)
    c_map = get_cmap(name='jet', lut=class_num)
    scatter = ax.scatter(x_obj[:idx_range, 0], x_obj[:idx_range, 1], c=y[:idx_range],
                         s=10, cmap=c_map)
    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
    ax.set_title(title.format(class_num))
    colorbar(mappable=scatter)
    


pca_obj = PCA(n_components=2).fit(X)
x_label = "First Principle Component"
y_label = "Second Principle Component"
title = "PCA Projection of {} people"
projection(obj=pca_obj, x=X, x_label=x_label, y_label=y_label, title=title)

enter image description here

As you can see, PCA with 2 components isn't sufficient to discriminate.

So how many components do you need?

def display_n_components(obj):
    figure(1, figsize=(6,3), dpi=300)
    plot(obj.explained_variance_, linewidth=2)
    xlabel('Components')
    ylabel('Explained Variaces')


pca_obj2 = PCA().fit(X)
display_n_components(pca_obj2)

enter image description here

You need 100 components for good discrimination.

Now we need to split the train and test set.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

X_train = X_train.reshape((X_train.shape[0], X.shape[1] * X.shape[2])) 
X_test = X_test.reshape((X_test.shape[0], X.shape[1] * X.shape[2]))

pca = PCA(n_components=100).fit(X)
X_pca_tr = pca.transform(X_train)
X_pca_te = pca.transform(X_test)

forest1 = RandomForestClassifier(random_state=42)
forest1.fit(X_pca_tr, y_train)
y_pred = forest1.predict(X_pca_te)
print("\nAccuracy:{:,.2f}%".format(accuracy_score(y_true=y_test, y_pred=y_pred_)*100))

The accuracy will be:

enter image description here

You might wonder, does PCA improves the accuracy?

The answer is Yes.

Without PCA : enter image description here

Post a Comment for "Optimal Feature Selection Technique After Pca?"