Featured post

Quiz: Data PreProcessing

Wednesday, 29 April 2020

StandardScaler: Why fit_transform for train data and only transform for test data?

StandardScaler: Why fit_transform for train data and only transform for test data?

Below are the steps to perform Feature Scaling after the Train and Test Split
# Perform Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

We can observe the use of different function for train and test set. The question arises, why we use different function and if we use same fit_transform for both train and test set, will be it be right?

Lets see manual steps to compute Scaling:

First Case: Using Same Function:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

X_train(scaled) =  (X_train - X_train_mean) / X_train_std_devia

X_test(scaled) =  (X_test - X_train_mean) / X_train_std_devia


Second Case: Using Different Function:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

X_train(scaled) =  (X_train - X_train_mean) / X_train_std_devia

X_test(scaled) =  (X_test - X_test_mean) / X_test_std_devia


So, if we observe the difference between first and second case, it is the use of X_train_mean and X_test_mean in computing the X_test(Scaled). 
The question is which approach is better/correct. In real scenario we want to know the model which we built, how it performs on new data. What happen if samples in new data is very few. Will we get good mean and standard deviation values? What happen if there is only one reading? So, we use First Case. Same approach we take when we build model using libraries. For e.g: PCA

from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)

X_test = pca.transform(X_test)


No comments:

Post a Comment