Hotel Review NLP Classifier for The Hilton

栏目: IT技术 · 发布时间: 3年前

Hotel Review NLP Classifier for The Hilton

Detailed project report following a CRISP-DM framework. Working Demo App Included.

In this blog I will go over how I improved my classifier by implementing a neural network model for my NLP project on Hilton Hotel reviews. My model aimed to use sentiment analytics that can classify a review with a score between 1 and 5.

Here is a link to the Github link for my project: www.github.com/awesomeahi95/Hotel_Review_NLP

I will go over the goal, the processes, and the result of my project.

Business case for my project:

In the modern day, public discussion and critiquing of products and services occurs beyond dedicated mediums, and now also takes place in the realm of social media, too.

Online Hilton Hotel reviews are currently found on tripadvisor, trustpilot, and expedia. The majority of reviewers gave a score between 3 and 5, so if a new customer browses online reviews on any of the previously mentioned review sites, they may consider booking a room at the Hilton.

What if they already made up there mind from hearing what a friend had to say? Potential customers, could have their hotel choice be influenced by a tweet. Opinions are shared constantly on social media platforms, and are read by their followers. The knowledge, of what these followers think about our hotel, from reading these online posts, could help us better understand the general public’s perception of our hotel.

By using sentiment analysis, on existing hotel reviews from Tripadvisor.com, I created a model that can quantify on a scale of 1–5, how the author of a tweet on twitter, or a post on a reddit thread, feels about our hotel, and as a result, also how the readers think about us. If a review classifies to be less than a score of 3, this post/tweet could be looked into, find out why they had a negative opinion of our hotel, and in return fix the problem.

Example of a review on Tripadvisor:

Hotel Review NLP Classifier for The Hilton

1 star review for Hilton Hotel London Gatwick

A human could relatively easily classify the score (to some degree) of the review above, by just reading the text. We are accustomed to understanding how another person feels about a topic, from the words they use, and the context around it.

For a computer to both interpret the opinion of a human and then understand the sentiment there a few stages:

  • Breaking down words to their root form:

Hotel Review NLP Classifier for The Hilton

A review (Original/Stemmed/Lemmed)

Using techniques like stemmation and lemmatisation, to break down words like disgusting and disgusted to a root word, disgust.

  • Tokenisation:

Using regular expressions to break down the sentence to only words, and no punctuation.

  • Removing Stopwords

Words like ‘I’, ‘he’, ‘and’, etc are the most frequent words and could impact the value of other words, so we remove these words. As for my project, that was orientated around hotels, I also removed frequent words such as ‘hotel’, ‘room’, and ‘airport’.

  • Vectorisation

(THIS IS THE LEAST HUMAN STEP)

Prior to the initial phase of modelling, I had 2 choices: count vectorisation (bag of words) and TF-IDF vectorisation. Both of these methods consider frequency of words as the general metric, although TF-IDF also compares the frequency with the entire corpus for a more meaningful metric.

I decided to use TF-IDF vectorisation for my project. So my lemmatised review columns changed from this:

Hotel Review NLP Classifier for The Hilton

To this:

Hotel Review NLP Classifier for The Hilton

So, now a review was represented by a singular value associated to 138 of the most frequent words in my review corpus.

I wasn’t too happy about the number of zeros I saw, despite it making sense.

  • Modelling and Testing

The machine learning phase. Here I experimented with 5 classification algorithms and 5 ensemble methods too, all with some hyperparameter tuning. For further detail please look at the 3rd Notebook in my Github Repo link I shared at the start of the blog.

These were my results:

Hotel Review NLP Classifier for The Hilton

Train and Validation Accuracies in the Model Training Phase
Hotel Review NLP Classifier for The Hilton
Train and Test (in validation accuracy column) Accuracies in the Model Testing Phase + Extra Metric for Each Score

I chose the Stacking (ensemble of Adaboost of Logistic Regression and Logistic Regression) model, as it had a decent training accuracy, and a reasonable validation accuracy. You might be thinking, these accuracies are in the 0.5 to 0.6 range, surely that’s not great. Well, considering this was a 5 way multiclass classification, the odds of randomly choosing one and getting it right was 0.2. Also, these are subjective scores, it can be hard even for a human to be on the dot with choosing the right score. This is better demonstrated with a confusion matrix.

Hotel Review NLP Classifier for The Hilton

You can see most the time the model does predict the correct score, illustrated by the diagonal line. The majority of the error we saw (accuracy being in 50–60% range), you can see here, comes from the adjacent score, e.g. predicting a score of 1 but true score was 2. I was happy with this as the model would still be good enough to distinguish between great reviews, average reviews, and bad reviews.

At this point the computer could interpret the inputted the text, and somewhat understand the sentinment from it.

I wanted better.

Why not make it more human? Neural networks are designed like the functionality of neurons in our brains, so that was probably the change I could make to better my model.

  • Neural Network

The preprocessing was a bit different before creating my neural network model.

I created a dictionary with keys that were words, all the unique words in the corpus, and values, a number associated with each unique word. I also added 4 special keys for padding, start of review, unknown words, and unused words. In total I had 17317 word entries in the dictionary. This comes from 9405 reviews.

word_index_dict['<PAD>'] = 0
word_index_dict['<START>'] = 1
word_index_dict['<UNK>'] = 2
word_index_dict['<UNUSED>'] = 3

Pre-Indexing:

Hotel Review NLP Classifier for The Hilton
Reviews Prior to Indexing

Post-Indexing:

Hotel Review NLP Classifier for The Hilton
Reviews After Indexing

As a final preprocessing step, I added a padding layer, with a max length of 250 words. Then I trained the model.

  • Neural Network Architecture:

Hotel Review NLP Classifier for The Hilton

Architecture of Neural Network Model Used

The special layer for NLP here is the Embedding Layer.

The words are mapped to vectors in a vector space, in my case 16 dimensional vectors. This time each word has a vector based on the words around it, the context. The vectorisation is different to the TF-IDF vectorisation from earlier, we aren’t just looking at frequency based metrics, but actually looking into the impact of each word, given the context.

This is starting to feel more human.

Now words like good, great, bad, and worse have some more meaningful numbers (vectors) associated with them. New reviews that the model can be tested on, won’t just contain some of these words, but also the words that surround it, that paint a better picture of what the writer of the review is trying to say. This picture could be better explained with more data but the current 9405 review will do a fine job.

  • Testing Neural Network Model

The testing accuracy of the model came to 0.5710 which is better than our previous model’s accuracy of 0.5077. So we have an improvement of 7% which is quite significant, but again the best way to observe this 5 way multi-class classifcation is by looking at a confusion matrix.

Hotel Review NLP Classifier for The Hilton

Neural Network Model Test Data Confusion Matrix

As you can see, the model didn’t predict a review with a score of 5 as a score of 1 once or vice versa. The other mis-classified scores have improved, and the majority of the predictions are closer to the middle diagonal.

  • Application

I have designed a demo application of the model using Streamlit and Heroku, that you can try out here: www.hilton-hotel-app.herokuapp.com/

Improvements to be made:

  • Use a bigger training dataset
  • Try a deeper neural network
  • Reduce complexity of classification to binary classification
  • Implement other pre-made vectorisation methods — word2vec or GloVe

以上所述就是小编给大家介绍的《Hotel Review NLP Classifier for The Hilton》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Java常用算法手册

Java常用算法手册

2012-5 / 59.00元

《Java常用算法手册》分三篇,共13章,分别介绍了算法基础、算法应用和算法面试题。首先介绍了算法概述,然后重点分析了数据结构和基本算法思想;接着,详细讲解了算法在排序、查找、数学计算、数论、历史趣题、游戏、密码学等领域中的应用;最后,列举了算法的一些常见面试题。书中知识点覆盖全面,结构安排紧凑,讲解详细,实例丰富。全书对每一个知识点都给出了相应的算法及应用实例,虽然这些例子都是以Java语言来编......一起来看看 《Java常用算法手册》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换