Featured post

Quiz: Data PreProcessing

Monday 10 May 2021

Quiz: Data PreProcessing

1. What are some examples of data quality problems:

A. Duplicate Data

B. Correlation between features

C. Missing values

D. All of the Above  


2. Which Method is used for encoding the categorical variables?

A. LabelEncoder

B. OneHotEncoder

C. None of the Above

D. All of the Above 


3. Which of the below is valid for Imputation

A. Imputation with mean/median

B. Imputing with random numbers

C. Imputing with one

D. All of the above


4. What's the purpose of feature scaling

A. Accelerating the training time

B. Getting better accuracy

C. Both A and B

D. None


5. In standardization, the features will be rescaled with

A. Mean 0 and Variance 0

B. Mean 0 and Variance 1

C. Mean 1 and Variance 0

D. Mean 1 and Variance 1 


6. What is a Dummy Variable Trap?

A. Multicollinearity among the dummy variables

B. One variable predicts the value of other

C. Both A and B

D. None of the Above


7. Which of the following(s) is/are features scaling techniques?

A. Standardization

B. Normalization

C. Min-Max Scaling

D. All of the Above 


8. Whats the best way to handle missing values in the dataset?

A. Dropping the missing rows or columns

B. Imputation with mean/median/mode value

C. Taking missing values into a new row or column

D. All of the above 

Solution:

1. D, 2. A, 3. A, 4. C, 5. B, 6. C, 7. D, 8.B


Hint:


What is Standardization?

In Standardization the values are centered around the mean with a unit standard deviation. Which means that the mean of the attribute becomes 0 and the resultant distribution has a unit standard deviation.


What is Normalization?

In Normalization values are shifted and rescaled so that they are between 0 and 1. It is also caleed as Min-Max scaling.

There is no hard and fast rule to decide which one to be used on the data. Best way is to use them one by one on the dataset and compare the result.


Tuesday 5 January 2021

Installing Multiple Python

mkdir /opt
cd /opt
sudo yum install tk-devel gdbm-devel
mkdir  python
cd python
export http://www-proxy-idc.in.oracle.com:80
wget http://www.python.org/ftp/python/2.7.9/Python-2.7.9.tgz
tar xvzf Python-2.7.9.tgz
echo $PWD
cd Python-2.7.9
./configure --prefix=/opt/python2.7
echo $PWD
make
sudo make install
sudo ln -s /opt/python2.7/bin/python2.7 /usr/bin/python27
sudo ln -s /opt/python2.7/bin/idle2.7 /usr/bin/idle-python27
sudo ln -s /opt/python2.7/bin/pip2.7 /usr/bin/pip27
 

python27



curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

python27 get-pip.py
pip27 install pandas
pip27 install cx_oracle




install cxOracle:

mkdir  /opt/oracle
cd /opt/oracle/
wget https://download.oracle.com/otn_software/linux/instantclient/185000/instantclient-basic-linux.x64-18.5.0.0.0dbru.zip
unzip instantclient-basic-linux.x64-18.5.0.0.0dbru.zip
cd instantclient_18_5/

/opt/oracle/instantclient_18_5

export LD_LIBRARY_PATH=/opt/oracle/instantclient_18_5:$LD_LIBRARY_PATH



for Python : 3.6
wget http://www.python.org/ftp/python/3.6.5/Python-3.6.5.tgz
tar xvzf Python-3.6.5.tgz
echo $PWD
cd Python-3.6.5
./configure --prefix=/opt/python3.6
echo $PWD
make
sudo make install
 ln -s /opt/python3.6/bin/python3.6 /usr/bin/python36
ln -s /opt/python3.6/bin/idle3.6 /usr/bin/idle-python36
ln -s /opt/python3.6/bin/pip3 /usr/bin/pip36
which pip36
pip36 install django
pip36 install pandas
pip36 install cx_oracle

Saturday 25 July 2020

Install Tensorflow Keras

Platform Windows 10
1> Create virtual env
#conda create -n tensorflow pip python=3.5

2> activate env
#activate tensorflow
#conda info --envs

3> Install tensorflow
#conda install -c conda-forge tensorflow
this will install tensorflow 1.10.0
#python -m pip install --upgrade pip
#pip install setuptools==39.1.0

3> Install keras
#pip install keras==2.2.2

4> Install other package

#pip install matplotlib
#pip install sklearn
#pip install pydot

5> Install spyder separately so that you can launch it without activating your virtual env
#conda install spyder

Monday 1 June 2020

Natural Language Processing: Q & A

Question: What is Tokenization? What is the purpose of it?
Answer: Breaking the sentence into constituent words is called as Tokenization. It is the basic step of Stopwords removal, stemming, lemmatization, parsing, text mining, etc.




Question: Why we need to remove stop words?
Answer: Stopwords are those words which don't contribute much towards meaning of the sentence like a, the, and,etc. Tasks like text classification, where we classify text into different categories, we remove stopwords to give more focus to words which contribute to the meaning of the text.


Question: Why we do stemming?
Answer: It is the process of reducing inflected words to their root word. Retrieving, Searching and Identifying more forms of a word returns more results. In absence of this many results might have been missed. That's why stemming is integral part of search queries and information retrieval. 


Question: Whats the difference between stemming and lemmatization?
Answer:
Stemming is a crude form that remove end of words but sometime remove derivational affixes.
Lemmatization refers to use of  vocabulary and morphological analysis of words, the main aim is to  remove inflectional endings only and return the base or dictionary form(lemma) of a word.
E.g: Word: SAW
Stemming:  S
Lemmatization: See or Saw depending on form (noun or verb) it is used.


Question: Give one use case of stemming.
Answer: Search Engine (E.g: Google, Bing)


Question: N-gram is combination of N keywords together. How many bigrams can be formed from given below sentence.
"Google is the most popular search engine"
 Answer: 6
Bigrams: "Google is, is the, the most, most popular, popular  search, search engine"


Question:  Difference Between CountVectorizer and HashVectorizer?
Answer:
HashingVectorizer :
-  If dataset is large and there is no use for the resulting dictionary of tokens
- You have maxed out your computing resources and it’s time to optimize
CountVectorizer:
-  If need is to access the actual tokens.
-  If you are worried about hash collisions (when matrix size is small)

Question: Arrange below components of Text classification model into right sequence.

Gradient Descent, Text Predictors, Text cleaning, Text annotation, Model tuning

Answer:  Correct Sequence: Text Cleaning to remove noise, creating more features using annotation, converting text to numerical form(predictors), using gradient descent to create the model and then tunning the model.

Tuesday 12 May 2020

Difference between Ordinary Least Square (OLS) and Gradient Descent to find best fit line


Ordinary Least Square(OLS):

- Non Iterative method to find best fit line such that the sum of squares of diff of Observed and Predicted values is minimized.

Error = (y_pred – y_act)^2
Line => y = bo + b1x

y_i =  Actual Value

- Above formula is for Univariate(one variable)
- For multivariate case, when we have many variables, the formula becomes complicated and requires too much calculation while implementing in software. 
- fail for collinear predictors(correlation between features)
- can be run in parallel but its still much complicated and expensive.

Gradient Descent:

- finds the linear model parameters iteratively.
- applies to non-linear model as well.
- works well for collinear predictors
- saves lot of time in calculation as it can be run parallely and distribute load across multiple processors.

•Cost Function, J(m,c) = (y_pred – y_act)^2 / No. of data point
•Hypothesis: y_pred = c + mx

Saturday 9 May 2020

Deep Learning: Create Neural Network to Recognize plus and cross image

Problem:
1> Consider Image of size 3*3 for plus and cross
2> In terms of input plus = [0,1,0,1,1,1,0,1,0], cross = [1,0,1,0,1,0,1,0,1]


Algorithm:
1> Error at Output Layer = (Out_exp - Oct_pre)*Oct_pre*(1 - Oct_pre)
2>  Re calculate Weight = Weight + learning*error*input
3> Error at Hidden Layer = w11*error1 + w12*error2

Code:
import numpy as np
import os
import random
import cv2, os
import matplotlib.pyplot as plt
import math

lr = 1 #learning rate
#Initialize Bias and Weights
bias = [random.random() for x in range(4)]
bias_out = [random.random() for x in range(2)]
h, w = 10, 5
weight_hid = [[random.random() for x in range(w)] for y in range(h)]
h, w = 5, 3
weight_out = [[random.random() for x in range(w)] for y in range(h)]

def flatten_image(image_path):
    image = cv2.imread(image_path)
    image_grayscale = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY )
    size=(3,3)
    resized_image = cv2.resize(image_grayscale, size)
    image_flatten = resized_image.flatten()
    for i in range(len(image_flatten)):
        if image_flatten[i] > 150:
            image_flatten[i] = 1
        else:
            image_flatten[i] = 0
    return image_flatten    

def sigmoid(x):
    return 1/(1+math.exp(-x))

def hidden_layer(x):
    hid_outputP = [0,0,0,0]
    #hid_error = [0,0,0,0]

    for i in range(0,4):
        hid_outputP[i] = bias[i]*weight_hid[0][i+1]
        for j in range(0,9):
            hid_outputP[i] =  hid_outputP[i] + x[j]*weight_hid[j+1][i+1]
         
        hid_outputP[i] = sigmoid(hid_outputP[i])
    return hid_outputP

def output_layer(h, result, plus):
    outputP = [0,0]
    error = [0,0]
    for i in range(0,2):
        outputP[i] = bias_out[i]
        for j in range(0,4):
            outputP[i] =  outputP[i] + h[j]*weight_out[j+1][i+1]

    outputP[0] = sigmoid(outputP[0])
    outputP[1] = sigmoid(outputP[1])
    #Calculate Error at Output Layer
    error[0] = (result[0] - outputP[0])*outputP[0]*(1 - outputP[0])
    error[1] = (result[1] - outputP[1])*outputP[1]*(1 - outputP[1])

    #Calculate Error at Hidden Layer
    for i in range(1,5):
        error_hid[i] = (error[0]*weight_out[i][1] + error[1]*weight_out[i][2])*h[i-1]*(1 - h[i-1])
    #Recalculate Weight to Output Layer
    for i in range(1,5):
        weight_out[i][1] =  weight_out[i][1] + error[0]*h[i-1]*lr
        weight_out[i][2] =  weight_out[i][2] + error[1]*h[i-1]*lr
       
    #Recalculate Weight to Hidden Layer
    for i in range(1,5):
        for j in range(1,10):
            weight_hid[j][i] = weight_hid[j][i] + error_hid[i]*plus[j-1]

def predict(input):
    input_hid = hidden_layer(input)
    outputP = [0,0]
    for i in range(0,2):
        outputP[i] = bias_out[i]
        for j in range(0,4):
            outputP[i] =  outputP[i] + input_hid[j]*weight_out[j+1][i+1]

    outputP[0] = sigmoid(outputP[0])
    outputP[1] = sigmoid(outputP[1])   
    print(outputP[0],outputP[1]) 
 
result_plus = [0.9, 0.1]
result_cross = [0.1, 0.9]
error_hid = [0,0,0,0,0]

#Image can be taken from : https://github.com/bansalrishi/MachineLearningWithPython_UD
#E.g: Path in GitHub: Data/NN/plus.jpg

plus_path = 'Data/NN/plus.jpg'
plus = flatten_image(plus_path)
plus_cross = 'Data/NN/cross.jpg'
cross = flatten_image(plus_cross)

for i in range(200) :
    #print("Running loop i=%s"%i)   
    H = hidden_layer(plus)
    output_layer(H, result_plus, plus)
    H = hidden_layer(cross)
    output_layer(H, result_cross,cross)
   
print("Giving input as cross")
predict(cross)
print("Giving input as plus")
predict(plus)



Output:

Giving input as cross
0.12235658465850822 0.8819751642398935
Giving input as plus
0.881445602118824 0.11705488953094145

Wednesday 29 April 2020

StandardScaler: Why fit_transform for train data and only transform for test data?

StandardScaler: Why fit_transform for train data and only transform for test data?

Below are the steps to perform Feature Scaling after the Train and Test Split
# Perform Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

We can observe the use of different function for train and test set. The question arises, why we use different function and if we use same fit_transform for both train and test set, will be it be right?

Lets see manual steps to compute Scaling:

First Case: Using Same Function:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

X_train(scaled) =  (X_train - X_train_mean) / X_train_std_devia

X_test(scaled) =  (X_test - X_train_mean) / X_train_std_devia


Second Case: Using Different Function:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

X_train(scaled) =  (X_train - X_train_mean) / X_train_std_devia

X_test(scaled) =  (X_test - X_test_mean) / X_test_std_devia


So, if we observe the difference between first and second case, it is the use of X_train_mean and X_test_mean in computing the X_test(Scaled). 
The question is which approach is better/correct. In real scenario we want to know the model which we built, how it performs on new data. What happen if samples in new data is very few. Will we get good mean and standard deviation values? What happen if there is only one reading? So, we use First Case. Same approach we take when we build model using libraries. For e.g: PCA

from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)

X_test = pca.transform(X_test)