Question: What is Tokenization? What is the purpose of it?
Answer: Breaking the sentence into constituent words is called as Tokenization. It is the basic step of Stopwords removal, stemming, lemmatization, parsing, text mining, etc.
Question: Why we need to remove stop words?
Answer: Stopwords are those words which don't contribute much towards meaning of the sentence like a, the, and,etc. Tasks like text classification, where we classify text into different categories, we remove stopwords to give more focus to words which contribute to the meaning of the text.
Answer: Breaking the sentence into constituent words is called as Tokenization. It is the basic step of Stopwords removal, stemming, lemmatization, parsing, text mining, etc.
Question: Why we need to remove stop words?
Answer: Stopwords are those words which don't contribute much towards meaning of the sentence like a, the, and,etc. Tasks like text classification, where we classify text into different categories, we remove stopwords to give more focus to words which contribute to the meaning of the text.
Question: Why we do stemming?
Answer: It is the process of reducing inflected words to their root word. Retrieving, Searching and Identifying more forms of a word returns more results. In absence of this many results might have been missed. That's why stemming is integral part of search queries and information retrieval.
Question: Whats the difference between stemming and lemmatization?
Answer:
Stemming is a crude form that remove end of words but sometime remove derivational affixes.
Lemmatization refers to use of vocabulary and morphological analysis of words, the main aim is to remove inflectional endings only and return the base or dictionary form(lemma) of a word.
E.g: Word: SAW
Stemming: S
Lemmatization: See or Saw depending on form (noun or verb) it is used.
Answer: It is the process of reducing inflected words to their root word. Retrieving, Searching and Identifying more forms of a word returns more results. In absence of this many results might have been missed. That's why stemming is integral part of search queries and information retrieval.
Question: Whats the difference between stemming and lemmatization?
Answer:
Stemming is a crude form that remove end of words but sometime remove derivational affixes.
Lemmatization refers to use of vocabulary and morphological analysis of words, the main aim is to remove inflectional endings only and return the base or dictionary form(lemma) of a word.
E.g: Word: SAW
Stemming: S
Lemmatization: See or Saw depending on form (noun or verb) it is used.
Question: Give one use case of stemming.
Answer: Search Engine (E.g: Google, Bing)
Question: N-gram is combination of N keywords together. How many bigrams can be formed from given below sentence.
"Google is the most popular search engine"
Answer: 6
Bigrams: "Google is, is the, the most, most popular, popular search, search engine"
HashingVectorizer :
- If dataset is large and there is no use for the resulting dictionary of tokens
- You have maxed out your computing resources and it’s time to optimize
CountVectorizer:
- If need is to access the actual tokens.
- If you are worried about hash collisions (when matrix size is small)
Question: Arrange below components of Text classification model into right sequence.
Gradient Descent, Text Predictors, Text cleaning, Text annotation, Model tuning
Answer: Correct Sequence: Text Cleaning to remove noise, creating more features using annotation, converting text to numerical form(predictors), using gradient descent to create the model and then tunning the model.
Answer: Search Engine (E.g: Google, Bing)
Question: N-gram is combination of N keywords together. How many bigrams can be formed from given below sentence.
"Google is the most popular search engine"
Answer: 6
Bigrams: "Google is, is the, the most, most popular, popular search, search engine"
Question: Difference Between CountVectorizer and HashVectorizer?
Answer:HashingVectorizer :
- If dataset is large and there is no use for the resulting dictionary of tokens
- You have maxed out your computing resources and it’s time to optimize
CountVectorizer:
- If need is to access the actual tokens.
- If you are worried about hash collisions (when matrix size is small)
Question: Arrange below components of Text classification model into right sequence.
Gradient Descent, Text Predictors, Text cleaning, Text annotation, Model tuning
Answer: Correct Sequence: Text Cleaning to remove noise, creating more features using annotation, converting text to numerical form(predictors), using gradient descent to create the model and then tunning the model.