Featured post

Quiz: Data PreProcessing

Saturday, 25 July 2020

Install Tensorflow Keras

Platform Windows 10
1> Create virtual env
#conda create -n tensorflow pip python=3.5

2> activate env
#activate tensorflow
#conda info --envs

3> Install tensorflow
#conda install -c conda-forge tensorflow
this will install tensorflow 1.10.0
#python -m pip install --upgrade pip
#pip install setuptools==39.1.0

3> Install keras
#pip install keras==2.2.2

4> Install other package

#pip install matplotlib
#pip install sklearn
#pip install pydot

5> Install spyder separately so that you can launch it without activating your virtual env
#conda install spyder

Monday, 1 June 2020

Natural Language Processing: Q & A

Question: What is Tokenization? What is the purpose of it?
Answer: Breaking the sentence into constituent words is called as Tokenization. It is the basic step of Stopwords removal, stemming, lemmatization, parsing, text mining, etc.




Question: Why we need to remove stop words?
Answer: Stopwords are those words which don't contribute much towards meaning of the sentence like a, the, and,etc. Tasks like text classification, where we classify text into different categories, we remove stopwords to give more focus to words which contribute to the meaning of the text.


Question: Why we do stemming?
Answer: It is the process of reducing inflected words to their root word. Retrieving, Searching and Identifying more forms of a word returns more results. In absence of this many results might have been missed. That's why stemming is integral part of search queries and information retrieval. 


Question: Whats the difference between stemming and lemmatization?
Answer:
Stemming is a crude form that remove end of words but sometime remove derivational affixes.
Lemmatization refers to use of  vocabulary and morphological analysis of words, the main aim is to  remove inflectional endings only and return the base or dictionary form(lemma) of a word.
E.g: Word: SAW
Stemming:  S
Lemmatization: See or Saw depending on form (noun or verb) it is used.


Question: Give one use case of stemming.
Answer: Search Engine (E.g: Google, Bing)


Question: N-gram is combination of N keywords together. How many bigrams can be formed from given below sentence.
"Google is the most popular search engine"
 Answer: 6
Bigrams: "Google is, is the, the most, most popular, popular  search, search engine"


Question:  Difference Between CountVectorizer and HashVectorizer?
Answer:
HashingVectorizer :
-  If dataset is large and there is no use for the resulting dictionary of tokens
- You have maxed out your computing resources and it’s time to optimize
CountVectorizer:
-  If need is to access the actual tokens.
-  If you are worried about hash collisions (when matrix size is small)

Question: Arrange below components of Text classification model into right sequence.

Gradient Descent, Text Predictors, Text cleaning, Text annotation, Model tuning

Answer:  Correct Sequence: Text Cleaning to remove noise, creating more features using annotation, converting text to numerical form(predictors), using gradient descent to create the model and then tunning the model.

Tuesday, 12 May 2020

Difference between Ordinary Least Square (OLS) and Gradient Descent to find best fit line


Ordinary Least Square(OLS):

- Non Iterative method to find best fit line such that the sum of squares of diff of Observed and Predicted values is minimized.

Error = (y_pred – y_act)^2
Line => y = bo + b1x

y_i =  Actual Value

- Above formula is for Univariate(one variable)
- For multivariate case, when we have many variables, the formula becomes complicated and requires too much calculation while implementing in software. 
- fail for collinear predictors(correlation between features)
- can be run in parallel but its still much complicated and expensive.

Gradient Descent:

- finds the linear model parameters iteratively.
- applies to non-linear model as well.
- works well for collinear predictors
- saves lot of time in calculation as it can be run parallely and distribute load across multiple processors.

•Cost Function, J(m,c) = (y_pred – y_act)^2 / No. of data point
•Hypothesis: y_pred = c + mx

Saturday, 9 May 2020

Deep Learning: Create Neural Network to Recognize plus and cross image

Problem:
1> Consider Image of size 3*3 for plus and cross
2> In terms of input plus = [0,1,0,1,1,1,0,1,0], cross = [1,0,1,0,1,0,1,0,1]


Algorithm:
1> Error at Output Layer = (Out_exp - Oct_pre)*Oct_pre*(1 - Oct_pre)
2>  Re calculate Weight = Weight + learning*error*input
3> Error at Hidden Layer = w11*error1 + w12*error2

Code:
import numpy as np
import os
import random
import cv2, os
import matplotlib.pyplot as plt
import math

lr = 1 #learning rate
#Initialize Bias and Weights
bias = [random.random() for x in range(4)]
bias_out = [random.random() for x in range(2)]
h, w = 10, 5
weight_hid = [[random.random() for x in range(w)] for y in range(h)]
h, w = 5, 3
weight_out = [[random.random() for x in range(w)] for y in range(h)]

def flatten_image(image_path):
    image = cv2.imread(image_path)
    image_grayscale = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY )
    size=(3,3)
    resized_image = cv2.resize(image_grayscale, size)
    image_flatten = resized_image.flatten()
    for i in range(len(image_flatten)):
        if image_flatten[i] > 150:
            image_flatten[i] = 1
        else:
            image_flatten[i] = 0
    return image_flatten    

def sigmoid(x):
    return 1/(1+math.exp(-x))

def hidden_layer(x):
    hid_outputP = [0,0,0,0]
    #hid_error = [0,0,0,0]

    for i in range(0,4):
        hid_outputP[i] = bias[i]*weight_hid[0][i+1]
        for j in range(0,9):
            hid_outputP[i] =  hid_outputP[i] + x[j]*weight_hid[j+1][i+1]
         
        hid_outputP[i] = sigmoid(hid_outputP[i])
    return hid_outputP

def output_layer(h, result, plus):
    outputP = [0,0]
    error = [0,0]
    for i in range(0,2):
        outputP[i] = bias_out[i]
        for j in range(0,4):
            outputP[i] =  outputP[i] + h[j]*weight_out[j+1][i+1]

    outputP[0] = sigmoid(outputP[0])
    outputP[1] = sigmoid(outputP[1])
    #Calculate Error at Output Layer
    error[0] = (result[0] - outputP[0])*outputP[0]*(1 - outputP[0])
    error[1] = (result[1] - outputP[1])*outputP[1]*(1 - outputP[1])

    #Calculate Error at Hidden Layer
    for i in range(1,5):
        error_hid[i] = (error[0]*weight_out[i][1] + error[1]*weight_out[i][2])*h[i-1]*(1 - h[i-1])
    #Recalculate Weight to Output Layer
    for i in range(1,5):
        weight_out[i][1] =  weight_out[i][1] + error[0]*h[i-1]*lr
        weight_out[i][2] =  weight_out[i][2] + error[1]*h[i-1]*lr
       
    #Recalculate Weight to Hidden Layer
    for i in range(1,5):
        for j in range(1,10):
            weight_hid[j][i] = weight_hid[j][i] + error_hid[i]*plus[j-1]

def predict(input):
    input_hid = hidden_layer(input)
    outputP = [0,0]
    for i in range(0,2):
        outputP[i] = bias_out[i]
        for j in range(0,4):
            outputP[i] =  outputP[i] + input_hid[j]*weight_out[j+1][i+1]

    outputP[0] = sigmoid(outputP[0])
    outputP[1] = sigmoid(outputP[1])   
    print(outputP[0],outputP[1]) 
 
result_plus = [0.9, 0.1]
result_cross = [0.1, 0.9]
error_hid = [0,0,0,0,0]

#Image can be taken from : https://github.com/bansalrishi/MachineLearningWithPython_UD
#E.g: Path in GitHub: Data/NN/plus.jpg

plus_path = 'Data/NN/plus.jpg'
plus = flatten_image(plus_path)
plus_cross = 'Data/NN/cross.jpg'
cross = flatten_image(plus_cross)

for i in range(200) :
    #print("Running loop i=%s"%i)   
    H = hidden_layer(plus)
    output_layer(H, result_plus, plus)
    H = hidden_layer(cross)
    output_layer(H, result_cross,cross)
   
print("Giving input as cross")
predict(cross)
print("Giving input as plus")
predict(plus)



Output:

Giving input as cross
0.12235658465850822 0.8819751642398935
Giving input as plus
0.881445602118824 0.11705488953094145

Wednesday, 29 April 2020

StandardScaler: Why fit_transform for train data and only transform for test data?

StandardScaler: Why fit_transform for train data and only transform for test data?

Below are the steps to perform Feature Scaling after the Train and Test Split
# Perform Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

We can observe the use of different function for train and test set. The question arises, why we use different function and if we use same fit_transform for both train and test set, will be it be right?

Lets see manual steps to compute Scaling:

First Case: Using Same Function:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

X_train(scaled) =  (X_train - X_train_mean) / X_train_std_devia

X_test(scaled) =  (X_test - X_train_mean) / X_train_std_devia


Second Case: Using Different Function:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

X_train(scaled) =  (X_train - X_train_mean) / X_train_std_devia

X_test(scaled) =  (X_test - X_test_mean) / X_test_std_devia


So, if we observe the difference between first and second case, it is the use of X_train_mean and X_test_mean in computing the X_test(Scaled). 
The question is which approach is better/correct. In real scenario we want to know the model which we built, how it performs on new data. What happen if samples in new data is very few. Will we get good mean and standard deviation values? What happen if there is only one reading? So, we use First Case. Same approach we take when we build model using libraries. For e.g: PCA

from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)

X_test = pca.transform(X_test)


Tuesday, 4 February 2020

Machine Learning Interview Questions & Answers(Complex)

Machine Learning Interview Questions & Answers(Complex)

Question: Whats the difference between Black-box and white-box models?
Ans:
  • accurate and ‘black-box’:
    Black-box models such as neural networks, gradient boosting models or complicated ensembles often provide great accuracy. The inner workings of these models are harder to understand and they don’t provide an estimate of the importance of each feature on the model predictions, nor is it easy to understand how the different features interact.
  • weaker and ‘white-box’:
    Simpler models such as linear regression and decision trees on the other hand provide less predictive capacity and are not always capable of modelling the inherent complexity of the dataset (i.e. feature interactions). They are however significantly easier to explain and interpret.
More Read

Question:What is the difference between Surogate and LIME interpretability Technique?
Ans: Surrogate models are (generally simpler) models that are used to explain a more complex model. Linear models and decision tree models are often used because of their simple interpretation. The surrogate model is created to represent the decision making process of the complex model (the response function) and is a model trained on the input and model predictions, rather than input and targets.  Surrogate models provide a layer of global interpretability on top of non-linear and non-monotonic models, but they should not be relied on exclusively.The general idea behind LIME is the same as surrogate models. LIME however does not build a global surrogate model that represents the entire dataset and only builds local surrogate models (linear models) that explain the predictions at local regions.

More Read

Question: Explain LIME interpretability Technique in detail ?
Ans: 
Link

Question: What are the Random Forest Importances? Explain in Detail?
Ans:
Link

Saturday, 25 January 2020

Python Django Interview Questions & Answers

Python Django Interview Questions & Answers

Question: What are Django shortcut functions?
Ans: The package django.shortcuts collects helper functions and classes that “span” multiple levels of MVC. These functions/classes introduce controlled coupling for convenience’s sake.
More Read 

Question: What is the function of render()?
Ans: render() Combines a given template with a given context dictionary and returns an HttpResponse object with that rendered text. Django does not provide a shortcut function which returns a TemplateResponse because the constructor of TemplateResponse offers the same level of convenience as render() .

Question: What is the diff between {{ }} and {% %}?
Ans: There are three category in the template in Django
1> Template Variable {{ }}: To render variables in the template or displaying variables.
2> Template Tag {% %}: For sentences such as if and for or to call tags such as load, static, etc.
3> Template Filter:  {{variable |filter:arg}}

Question:What does CSRF token mean?
Ans: Cross-site request forgery, also known as one-click attack or session riding and abbreviated as CSRF (sometimes pronounced sea-surf) or XSRF, is a type of malicious exploit of a website where unauthorized commands are transmitted from a user that the web application trusts.

Question: How does Django CSRF work?
Ans: When receiving the form submission, Django checks that the alphanumeric string value from the hidden form field matches and the csrftoken cookie received from the browser. ... A CSRF attack might come in the form of a malicious web site that includes an iframe. The iframe includes a POST form and some JavaScript.

Question: How long is CSRF token?
Ans: That said, assuming an attacker can do 100,000 requests per second, it should take around 2.93 million years on average to brute force a 64-bit CSRF token. (And there shouldn't be more than one token in the whole token space, unlike with session id's.) So, maybe 64 bits is enough.

Friday, 10 January 2020

Simple Linear Regression

Simple Linear Regression

# Simple Linear Regression Code Python


# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)

# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the Test set results
y_pred = regressor.predict(X_test)

# Visualising the Training set results
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

# Visualising the Test set results
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

********************************************************************

# Simple Linear Regression Code R


# Importing the dataset
dataset = read.csv('Salary_Data.csv')

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Salary, SplitRatio = 2/3)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Feature Scaling
# training_set = scale(training_set)
# test_set = scale(test_set)

# Fitting Simple Linear Regression to the Training set
regressor = lm(formula = Salary ~ YearsExperience,
               data = training_set)

# Predicting the Test set results
y_pred = predict(regressor, newdata = test_set)

# Visualising the Training set results
#install.packages('ggplot2')
library(ggplot2)
ggplot() +
  geom_point(aes(x = training_set$YearsExperience, y = training_set$Salary),
             colour = 'red') +
  geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
            colour = 'blue') +
  ggtitle('Salary vs Experience (Training set)') +
  xlab('Years of experience') +
  ylab('Salary')

# Visualising the Test set results
library(ggplot2)
ggplot() +
  geom_point(aes(x = test_set$YearsExperience, y = test_set$Salary),
             colour = 'red') +
  geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
            colour = 'blue') +
  ggtitle('Salary vs Experience (Test set)') +
  xlab('Years of experience') +
  ylab('Salary')

***************************************************************

Github:
https://github.com/bansalrishi/MLData

Data Preprocessing

1.Data Preprocessing

# Data Preprocessing Code Python


# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Data.csv')
#: means we take all lines, :-1 means we take all columns except last one
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

# Encoding categorical data
# Encoding the Independent Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

***********************************************************

# Data Preprocessing Code R


# Importing the dataset
dataset = read.csv('Data.csv')

# Taking care of missing data
dataset$Age = ifelse(is.na(dataset$Age),
                     ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
                     dataset$Age)
dataset$Salary = ifelse(is.na(dataset$Salary),
                        ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
                        dataset$Salary)

# Encoding categorical data
dataset$Country = factor(dataset$Country,
                         levels = c('France', 'Spain', 'Germany'),
                         labels = c(1, 2, 3))
dataset$Purchased = factor(dataset$Purchased,
                           levels = c('No', 'Yes'),
                           labels = c(0, 1))

*******************************************************************

Github:
https://github.com/bansalrishi/MLData

Tuesday, 7 January 2020

Machine Learning Interview Questions & Answers - 2

Machine Learning Interview Questions & Answers - 2


Question: What is pruning of decision tree and why we do it?
Ans: Pruning is a technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.

Question: If we have 100gb of data , how will you manage to build model on your machine?
Ans: Refer to this link.


Question: What is Central Limit Theorem?
Ans: The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement , then the distribution of the sample means will be approximately normally distributed.
More Read
 
Question: How does parallel processing in XG Boost works? (Remember it is boosting so trees are 
dependent on the above tree )
Ans: Xgboost doesn't run multiple trees in parallel, as you need predictions after each tree to update
gradients.Rather it does the parallelization within a single tree my using openMP to create branches
independently. To observe this,build a giant dataset and run with n_rounds=1. You can observe that all
your cores are getting used on one tree. This is why it's so fast.


Question: Why Light GBM is faster than XG Boost?
Ans: Light Gradient Boosting - it is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms. Also, it is surprisingly very fast, hence the word ‘Light’. 
More read

Question 19: What are different hyperpaarmeters in all the above algorithms?
Question 20: How do you find best hyper-parameters? 
Question 21: What is Bias Variance tradeoff?
Question 22: What is Overfitting under fitting and best fit?
Question 23: How do you identify whether model is fitting well or over fitted or under fitting?
Question 24: What is the difference between objective and evaluation functions?

Monday, 6 January 2020

Machine Learning Interview Questions & Answers

Machine Learning Interview Questions & Answers

Question: Write equation for Linear Regression?
Ans: y = a + bx

Question: Write equation for Logistic Regression?
Ans: Sigmoid Function 1 / (1 + e^-value)

Question: How will You calculate AUCROC (Area Under Curve ROC) value manually?
Ans: Using the formula for the area of a trapezoid.

Question: What performance metrics you used in model building?
Ans:
    Confusion Matrix
    F1 Score
    Gain and Lift Charts
    Kolmogorov Smirnov Chart
    AUC – ROC
    Log Loss
    Gini Coefficient
    Concordant – Discordant Ratio
    Root Mean Squared Error
    Cross Validation (Not a metric though!)

Question: Assumptions of Linear Regression?
Ans: 5 key assumptions:
    Linear relationship (Outliers need to be checked)
    Multivariate normality (can be checked with a histogram or a Q-Q-Plot)
    No or little multicollinearity  (Can be tested with 3 criteria, Correlation matrix , Tolerance, Variance Inflation Factor (VIF) )
    No auto-correlation
    Homoscedasticity (can be checked using scatter plot)

Question: Different ways you used to treat missing values and outliers?
Ans: 
 For missing Values below one can be used but not limited to these only:

  1. If a feature has too many missing values then drop the whole feature(column).
  2. If the feature is too important to drop then introduce another binary feature as isnull of this feature and impute the null values of the existing feature with median/mean.
  3. If there are very few missing values in a feature and removing those rows doesn't hurt the sample size then remove the rows.
  4. If removing rows with missing values in either of the feature reduce the sample size drastically then go for imputation. There are multiple ways for that.
    • impute with mean/median of the column.
    • mean/median of that column for N nearest neighbors
    • If it's a time series data set then use Marcov Chain to predict the missing values
    • If each row is a time series and your algorithm doesn't demand the rows to be of same size then leave it as is. One example would be dynamic time warping distance between time series.
For Outliers:
  1.  Remove the outliers. (Trimming)
  2.  Replacing the values of outliers or reducing the influence of outliers through outlier weight adjustments. (Winsorization)
  3.  To estimate the values of outliers using robust techniques
More Read: Source

Question: What is Feature Selection?
Ans:

“Feature Selection is a process of selection a subset of Relevant Features(Variables or Predictors) from all features, which is used to make Model Building.”
With N(high Dimension) number of features data analysis is challenging to the engineers in the field of Machine Learning and Data Mining.Feature Selection gives an effective way to solve this problem by removing irrelevant and redundant data, which can reduce computation time, improve learning accuracy, and facilitate a better understanding for the learning model or data.

Question:  How many Features to have in the Model?
Ans: One important thing is we have to take consideration Trade off between Predictive accuracy vs Model Interpretability. because if we use large number of Features the Predictive accuracy is likely to go up and Model Interpretability goes down.
If we have less number of Features then it is easy to interpret the model, less likely to overfit but it will give low prediction accuracy.
And if we have large number of Features then it is difficult to Interpret model, more likely to overfit and it will give high prediction accuracy.

Question: Types of Feature Selection?
Ans: High number of features in the data increases the risk of Overfitting in the Model.
Feature Selection method helps to reduce the dimension of features by without much loss of information.
Below are the some methods used for Feature Selection:
a> Filter Method, b> Wrapper Method (subset selection, Forward Stepwise selection, Backward Stepwise Selection), c> Embedded Method (Shrinkage)( LASSO Regression, RIDGE Regression)

More read on Feature Selection: Source

Question: Explain Decision Tree?
Ans: Decision tree is one of the most powerful and popular tool for classification and prediction. It is a flowchart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label.
More Read: Source

Question: Difference between K-Means and KNN?
Ans:

These are completely different methods.
K-means is a clustering algorithm that tries to partition a set of points into K sets (clusters) such that the points in each cluster tend to be near each other. It is unsupervised because the points have no external classification.
K-Nearest Neighbors(K-NN) is a classification (or regression) algorithm that in order to determine the classification of a point, combines the classification of the K nearest points. It is supervised because you are trying to classify a point based on the known classification of other points.
Question: What is K in KNN?
Ans: K is just the number of neighbors "voting" to classify the point.


Question: What is Confusion matrics, explain?

Ans: A confusion matrix is a table that describes the performance of a classifier/classification model. It contains information about the actual and prediction classifications done by the classifier and this information is used to evaluate the performance of the classifier.
The confusion matrix is only used for classification tasks, and as such cannot be used in regression models or other non-classification models.


Question: What is cross-validation and whats its purpose?
Ans: Cross-validation is a technique in which we train our model using the subset of the data-set and then evaluate using the complementary subset of the data-set.The purpose of cross-validation is model checking, not model building.
Now, suppose we have two models, one is linear regression model and other is neural network. To find out which model is better at predicting the test set points, we can do K-fold cross-validation. But once we have used cross-validation to select the better performing model, we train that model (whether it be the linear regression or the neural network) on all the data. We don't use the actual model instances we trained during cross-validation for our final predictive model.
Note that there is a technique called bootstrap aggregation (usually shortened to 'bagging') that does in a way use model instances produced in a way similar to cross-validation to build up an ensemble model.

Question: Whats the difference between Parameter and Hyper-Parameter?
Ans: A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data.
  • They are often used in processes to help estimate model parameters.
  • They are often specified by the practitioner.
  • They can often be set using heuristics.
  • They are often tuned for a given predictive modeling problem.

E.g: The learning rate for training a neural network, the C and sigma hyperparameters for support vector machines, the k in k-nearest neighbors.

A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data.
  • They are required by the model when making predictions.
  • They values define the skill of the model on your problem.
  • They are estimated or learned from data.
  • They are often not set manually by the practitioner.
  • They are often saved as part of the learned model.
E.g: The weights in an artificial neural network, the support vectors in a support vector machine, the coefficients in a linear regression or logistic regression.

Question: What is the difference between supervised and unsupervised machine learning?
Ans
: Supervised learning requires labeled training data. You should know which data point belongs to which class or has what label. Unsupervised learning, on the other hand, does not require labeling data.

Question: What is the difference between L1 and L2 regularization?
Ans
: L1 regularization is more binary -- many variables are assigned a 1 or a 0 in weighting. It is like setting a Laplacian prior on the terms On the other hand, L2 regularization tends to spread the error among all the terms and corresponds to a Gaussian prior.

Question: What are the Assumptions of Naive Bayes?
Ans:
No assumption in Naive Bayes. We treat each feature independently and equal. This also means we can't apply PCA in Naive Bayes and we consider weightage of each feature equal.

                                                                                                                                                       Next->