All About Machine Learning: January 2020

Saturday, 25 January 2020

Python Django Interview Questions & Answers

Question: What are Django shortcut functions?

Ans: The package django.shortcuts collects helper functions and classes that “span” multiple levels of MVC. These functions/classes introduce controlled coupling for convenience’s sake.

Friday, 10 January 2020

Simple Linear Regression

# Simple Linear Regression Code Python

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)

# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the Test set results
y_pred = regressor.predict(X_test)

# Visualising the Training set results
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

# Visualising the Test set results
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

********************************************************************

# Simple Linear Regression Code R

# Importing the dataset
dataset = read.csv('Salary_Data.csv')

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Salary, SplitRatio = 2/3)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Feature Scaling
# training_set = scale(training_set)
# test_set = scale(test_set)

# Fitting Simple Linear Regression to the Training set
regressor = lm(formula = Salary ~ YearsExperience,
               data = training_set)

# Predicting the Test set results
y_pred = predict(regressor, newdata = test_set)

# Visualising the Training set results
#install.packages('ggplot2')
library(ggplot2)
ggplot() +
geom_point(aes(x = training_set$YearsExperience, y = training_set$Salary),
             colour = 'red') +
geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
            colour = 'blue') +
ggtitle('Salary vs Experience (Training set)') +
xlab('Years of experience') +
ylab('Salary')

# Visualising the Test set results
library(ggplot2)
ggplot() +
geom_point(aes(x = test_set$YearsExperience, y = test_set$Salary),
             colour = 'red') +
geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
            colour = 'blue') +
ggtitle('Salary vs Experience (Test set)') +
xlab('Years of experience') +
ylab('Salary')

***************************************************************

Github:

https://github.com/bansalrishi/MLData

Data Preprocessing

1.Data Preprocessing

# Data Preprocessing Code Python

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Data.csv')
#: means we take all lines, :-1 means we take all columns except last one
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

# Encoding categorical data
# Encoding the Independent Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

***********************************************************

# Data Preprocessing Code R

# Importing the dataset
dataset = read.csv('Data.csv')

# Taking care of missing data
dataset$Age = ifelse(is.na(dataset$Age),
                     ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
                     dataset$Age)
dataset$Salary = ifelse(is.na(dataset$Salary),
                        ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
                        dataset$Salary)

# Encoding categorical data
dataset$Country = factor(dataset$Country,
                         levels = c('France', 'Spain', 'Germany'),
                         labels = c(1, 2, 3))
dataset$Purchased = factor(dataset$Purchased,
                           levels = c('No', 'Yes'),
                           labels = c(0, 1))

*******************************************************************

Github:

https://github.com/bansalrishi/MLData

Tuesday, 7 January 2020

Machine Learning Interview Questions & Answers - 2

Question: What is pruning of decision tree and why we do it?
Ans: Pruning is a technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.

Question: If we have 100gb of data , how will you manage to build model on your machine?
Ans: Refer to this link.

Question: What is Central Limit Theorem?
Ans: The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement , then the distribution of the sample means will be approximately normally distributed.
More Read

Question: How does parallel processing in XG Boost works? (Remember it is boosting so trees are
dependent on the above tree )
Ans: Xgboost doesn't run multiple trees in parallel, as you need predictions after each tree to update
gradients.Rather it does the parallelization within a single tree my using openMP to create branches
independently. To observe this,build a giant dataset and run with n_rounds=1. You can observe that all
your cores are getting used on one tree. This is why it's so fast.

Question: Why Light GBM is faster than XG Boost?
Ans: Light Gradient Boosting - it is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms. Also, it is surprisingly very fast, hence the word ‘Light’.
More read

Question 19: What are different hyperpaarmeters in all the above algorithms?

Question 20: How do you find best hyper-parameters?

Question 21: What is Bias Variance tradeoff?

Question 22: What is Overfitting under fitting and best fit?

Question 23: How do you identify whether model is fitting well or over fitted or under fitting?

Question 24: What is the difference between objective and evaluation functions?

Monday, 6 January 2020

Machine Learning Interview Questions & Answers

Question: Write equation for Linear Regression?
Ans: y = a + bx

Question: Write equation for Logistic Regression?
Ans: Sigmoid Function 1 / (1 + e^-value)

Question: How will You calculate AUCROC (Area Under Curve ROC) value manually?
Ans: Using the formula for the area of a trapezoid.

Question: What performance metrics you used in model building?
Ans:
    Confusion Matrix
    F1 Score
    Gain and Lift Charts
    Kolmogorov Smirnov Chart
    AUC – ROC
    Log Loss
    Gini Coefficient
    Concordant – Discordant Ratio
    Root Mean Squared Error
    Cross Validation (Not a metric though!)

Question: Assumptions of Linear Regression?
Ans: 5 key assumptions:
    Linear relationship (Outliers need to be checked)
    Multivariate normality (can be checked with a histogram or a Q-Q-Plot)
    No or little multicollinearity (Can be tested with 3 criteria, Correlation matrix , Tolerance, Variance Inflation Factor (VIF) )
    No auto-correlation
    Homoscedasticity (can be checked using scatter plot)

Question: Different ways you used to treat missing values and outliers?
Ans:
For missing Values below one can be used but not limited to these only:

If a feature has too many missing values then drop the whole feature(column).
If the feature is too important to drop then introduce another binary feature as isnull of this feature and impute the null values of the existing feature with median/mean.
If there are very few missing values in a feature and removing those rows doesn't hurt the sample size then remove the rows.
If removing rows with missing values in either of the feature reduce the sample size drastically then go for imputation. There are multiple ways for that.

impute with mean/median of the column.
mean/median of that column for N nearest neighbors
If it's a time series data set then use Marcov Chain to predict the missing values
If each row is a time series and your algorithm doesn't demand the rows to be of same size then leave it as is. One example would be dynamic time warping distance between time series.

For Outliers:

Remove the outliers. (Trimming)
Replacing the values of outliers or reducing the influence of outliers through outlier weight adjustments. (Winsorization)
To estimate the values of outliers using robust techniques.

More Read: Source

Question: What is Feature Selection?
Ans:

“Feature Selection is a process of selection a subset of Relevant Features(Variables or Predictors) from all features, which is used to make Model Building.”

With N(high Dimension) number of features data analysis is challenging to the engineers in the field of Machine Learning and Data Mining.Feature Selection gives an effective way to solve this problem by removing irrelevant and redundant data, which can reduce computation time, improve learning accuracy, and facilitate a better understanding for the learning model or data.

Question: How many Features to have in the Model?
Ans: One important thing is we have to take consideration Trade off between Predictive accuracy vs Model Interpretability. because if we use large number of Features the Predictive accuracy is likely to go up and Model Interpretability goes down.
If we have less number of Features then it is easy to interpret the model, less likely to overfit but it will give low prediction accuracy.
And if we have large number of Features then it is difficult to Interpret model, more likely to overfit and it will give high prediction accuracy.

Question: Types of Feature Selection?
Ans: High number of features in the data increases the risk of Overfitting in the Model.
Feature Selection method helps to reduce the dimension of features by without much loss of information.
Below are the some methods used for Feature Selection:
a> Filter Method, b> Wrapper Method (subset selection, Forward Stepwise selection, Backward Stepwise Selection), c> Embedded Method (Shrinkage)( LASSO Regression, RIDGE Regression)

More read on Feature Selection: Source

Question: Explain Decision Tree?
Ans: Decision tree is one of the most powerful and popular tool for classification and prediction. It is a flowchart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label.
More Read: Source

Question: Difference between K-Means and KNN?
Ans:

These are completely different methods.
K-means is a clustering algorithm that tries to partition a set of points into K sets (clusters) such that the points in each cluster tend to be near each other. It is unsupervised because the points have no external classification.
K-Nearest Neighbors(K-NN) is a classification (or regression) algorithm that in order to determine the classification of a point, combines the classification of the K nearest points. It is supervised because you are trying to classify a point based on the known classification of other points.

Question: What is K in KNN?
Ans: K is just the number of neighbors "voting" to classify the point.

Question: What is Confusion matrics, explain?
Ans: A confusion matrix is a table that describes the performance of a classifier/classification model. It contains information about the actual and prediction classifications done by the classifier and this information is used to evaluate the performance of the classifier.
The confusion matrix is only used for classification tasks, and as such cannot be used in regression models or other non-classification models.

Question: What is cross-validation and whats its purpose?
Ans: Cross-validation is a technique in which we train our model using the subset of the data-set and then evaluate using the complementary subset of the data-set.The purpose of cross-validation is model checking, not model building.
Now, suppose we have two models, one is linear regression model and other is neural network. To find out which model is better at predicting the test set points, we can do K-fold cross-validation. But once we have used cross-validation to select the better performing model, we train that model (whether it be the linear regression or the neural network) on all the data. We don't use the actual model instances we trained during cross-validation for our final predictive model.
Note that there is a technique called bootstrap aggregation (usually shortened to 'bagging') that does in a way use model instances produced in a way similar to cross-validation to build up an ensemble model.

Question: Whats the difference between Parameter and Hyper-Parameter?

Ans: A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data.

They are often used in processes to help estimate model parameters.
They are often specified by the practitioner.
They can often be set using heuristics.
They are often tuned for a given predictive modeling problem.

E.g: The learning rate for training a neural network, the C and sigma hyperparameters for support vector machines, the k in k-nearest neighbors.

A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data.

They are required by the model when making predictions.
They values define the skill of the model on your problem.
They are estimated or learned from data.
They are often not set manually by the practitioner.
They are often saved as part of the learned model.

E.g: The weights in an artificial neural network, the support vectors in a support vector machine, the coefficients in a linear regression or logistic regression.

Question: What is the difference between supervised and unsupervised machine learning?
Ans: Supervised learning requires labeled training data. You should know which data point belongs to which class or has what label. Unsupervised learning, on the other hand, does not require labeling data.

Question: What is the difference between L1 and L2 regularization?
Ans: L1 regularization is more binary -- many variables are assigned a 1 or a 0 in weighting. It is like setting a Laplacian prior on the terms On the other hand, L2 regularization tends to spread the error among all the terms and corresponds to a Gaussian prior.

Question: What are the Assumptions of Naive Bayes?
Ans: No assumption in Naive Bayes. We treat each feature independently and equal. This also means we can't apply PCA in Naive Bayes and we consider weightage of each feature equal.

Next->

All About Machine Learning

Featured post

Quiz: Data PreProcessing

Saturday, 25 January 2020

Python Django Interview Questions & Answers

Python Django Interview Questions & Answers

Friday, 10 January 2020

Simple Linear Regression

Simple Linear Regression

# Simple Linear Regression Code Python

# Simple Linear Regression Code R

Data Preprocessing

1.Data Preprocessing

# Data Preprocessing Code Python

# Data Preprocessing Code R

Tuesday, 7 January 2020

Machine Learning Interview Questions & Answers - 2

Machine Learning Interview Questions & Answers - 2

Monday, 6 January 2020

Machine Learning Interview Questions & Answers

Machine Learning Interview Questions & Answers