Understanding the bags of word in NLP

栏目: IT技术 · 发布时间: 6年前

Understanding the bags of word in NLP

Understanding the bags of word in NLP

Natural language processing is an important branch of Artificial intelligence where many interesting and important pieces of research are going on. As a machine learning enthusiast it is important to understand the sub-processes of NLP.

Here I would like to share one of the buzz words used in NLP with (link to notebook attached at the bottom) code examples. “BAG OF WORDS!!!”.

What is Bag of words in NLP?

Bag of words is a method that is used to find out the important topics in a text (paragraph). What are the topics? let’s say you are reading the below paragraph,

As a pet, cat is a very useful animal and helps in protecting or saving our rashan from rats. The offspring of a cat is called as kitten, it is a smaller and a cuter version of a cat. Cat has got four thin, short and sturdy limbs that helps it in walking, running and jumping for long distances.It’s bright eyes help it in seeing long distances and also help during the dark. Cats are found all over the world. There is no place without a cat. Sometimes a cat can be mistaken for a tiger cub, because of its extreme similarities with it.A cat’s body is completely covered with soft and beautiful fur. Cats make meaw meaw sound. God has provided cats with soft shoes or pads, which help a cat in walking without making a sound.

[Text Credits: https://www.atozessays.com/animals/essay-on-cat/ ]

As a human when you are reading this, you know this paragraph is about the cat. Cat is an important topic in the above paragraph. But,

  • how does a machine suppose to figure this out?
  • how can you tell your model cat is an important topic in this paragraph?
"The more frequent a word, the more important it might be!!!"

This where the bag of words comes into play!!!

How to get the bag of words of a paragraph?

  1. First, create the tokens of the paragraph using tokenization, tokens can be anything that is a part of a text, i.e words, digits, punctuations or special characters
  2. Apply necessary text preprocessing steps to filter out the good tokens, such as lowercasing words, lemmatization/stemming (bring the words to their root form), removing stop words and punctuations and etc.
  3. Count up the occurrences of each token to find the most common words.
# declare your text here
paragraph = "As a pet, cat is a very useful animal and helps in protecting or saving our rashan from rats........................."# tokenize the paragraph using word_tokenize,return tokens
tokens = word_tokenize(paragraph)# change the tokens to lower case
lower_tokens = [token.lower() for token in tokens]# Retain alphabetic words: alpha_only, eliminate punctions and special characters
alpha_only = [t for t in lower_tokens if t.isalpha()]#remove all stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in alpha_only if not token in stop_words]# lemmatize the words to bring them to their root form
wordnet_lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [wordnet_lemmatizer.lemmatize(token) for token in filtered_tokens]# create bag of words
bag_of_words = Counter(lemmatized_tokens)# print the top 5 most common words
print(bag_of_words.most_common(5))Output:
[('cat', 11), ('help', 5), ('walking', 2), ('long', 2), ('without', 2)]

By looking at the output you can simply say, this is a text about cats since it is the most occurring word in the text. Now you can proceed to your next steps of NLP using this bag of words as features.

This is just one way of doing it, I have only added simple preprocessing steps for simplicity. But the preprocesses step will vary from case to case. Read similar works and fine-tune preprocessing steps for your case.

You can have a look at the notebook for this example here. I hope this article is helpful to you!!!

https://github.com/Mathanraj-Sharma/sample-for-medium-article/blob/master/bag-of-words-nltk.ipynb


以上所述就是小编给大家介绍的《Understanding the bags of word in NLP》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

最愚蠢的一代

最愚蠢的一代

马克·鲍尔莱因 / 杨蕾 / 天津社会科学院出版社 / 2011-7 / 39.80元

《最愚蠢的一代》 美国大学教授的鲍尔莱恩认为,数码时代正在使美国的年轻一代成为知识最贫乏的一代人。 美国的青少年和年轻人正在被数码时代各种娱乐消遣性的工具所淹没。这些工具包括手机、社交网络和信息传送等等。他们通过这些工具传达的却是幼稚浮浅的东西,而且这些东西正在妨碍他们同历史、公民义务、国际事务和美术等成年人的现实世界进行重要的接触。 我们想当然地以为,这些善于吸收新技术的美国年......一起来看看 《最愚蠢的一代》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

MD5 加密
MD5 加密

MD5 加密工具

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具