Before knowing about the bag of words we need to know a little about natural language processing.
What is natural language processing?
Natural language processing or NLP is a subbranch of artificial intelligence. It is used to help humans to interact with machines. It is simply like to talk to our friends.
The main objective of NLP is to read, convert and understand and make sense of human language that so valuable. Most NLP techniques are used in machine learning to get valuable information from humans.
What is bag of words in the NLP machine learning?
Bag of words is a technique in NLP machine learning that extracts the features from a text document to a matrix. Because the machine can’t understand the human language. so NLP comes to simplify this problem.
let’s understand how the bag of words in NLP machine learning.
we tokenize or split the words in each sentence in a text document and count the words and assign the occurrences of the words. let me give a clear picture of it. let’s take some sample sentences which are similar.
‘Devpyjp is a good blog’,
‘Devpyjp is an online platform’,
‘jp is a machine learning NLP tutor’,
‘jp is a technology enthusiast’
Now, split the sentence to words as below. This is also called “tokenization” in NLP. And take all the unique words.
'Devpyjp','is','a','good','blog' 'Devpyjp','is','an','online','platform' # same as the others also. #Unique words 'Devpyjp','is','a','an','good','blog','online','platform','machine','learning','NLP','tutor','technology','enthusiast'.
Now count the occurrences of the words in each sentence and store them in a matrix. Now that looks as below.
we have 4 sentences so we go four rows in the above matrix. Each of our sentences is considered as a row or observation.
Wherever the word occurred in a sentence then it is noted as 1 otherwise 0.
This matrix or features can help us to build a model in machine learning. In the bag of words, we have different attributes to get numerical features. we can apply n-grams or bi-grams to extract features from text documents in NLP machine learning.
in the bag of words, “n” represents the number of words to tokenize in a sentence. we can specify the “n” in the bag of words algorithm.
“bi” means two, so we take 2 words in a sentence to separate the unique words. It is as follows.
#unique words 'Devpyjp is','is a','a good','good blog','jp is','is a','a online' # list goes on..
There are so many methods to extract features from text documents like TF-IDF and word2vec models. we will discuss later tutorials in our NLP section.
Now we will see how to implement the bag of word algorithms in NLP machine learning.
Bag of words implementation python
Open your Jupyter notebook or a python file. write the below lines. If you don’t have a jupyter download here.
import pandas as pd import numpy as np
These are necessary libraries that work with data in machine learning. And now import this..
from sklearn.feature_extraction.text import CountVectorizer countvect = CountVectorizer()
Sklearn provides the all feature extraction methods to work fastly. Countvectorizer is used to extract features from text documents.
doc = np.array(['Devpyjp is a good blog','Devpyjp is an online platform','jp is a machine learning NLP tutor','jp is a technology enthusiast'])
This is our text document, we have 4 sentences in our numpy array.
bag_words = countvect.fit_transform(doc )
Here, we extract the features from the text document using a “countvect” object and we store them into “bag_words”.
when we run the above statement, we will get an array with 0’s and 1’s.
doc_features = countvect.get_feature_names() print(doc_features)
countvect.get_feature_names() give feature names. we are assigning our features names to doc_features variable.
df = pd.DataFrame(bag_words.toarray(),columns=doc_features) df
It will give us a data frame as output. we gave data = bag_words.toarray() and columns= doc_features.
That’s it, guys! we successfully extract features from the text documents using a bag of words algorithm.
Full code – Natural language processing – what is Bag of words
import pandas as pd import numpy as np from sklearn.feature_extraction.text import CountVectorizer #here we can take different sentences text = np.array(['I love Biryani.',' Biryani is the best dish i had ever eat .',"India's famous dish is Biryani."]) feature_names = countvect.get_feature_names() print(feature_names) df = pd.DataFrame(bag_words.toarray(),columns=feature_names) print(df)
If you like our article on Natural language processing – what is Bag of words? , let us appreciate us. If you have any queries regarding this article please feel free to comment below. Thank you.
Check out our other blog posts in our Blog. If you subscribe to our newsletter, you will get natural processing with python pdf FREE.