- What is sentiment analysis?
- Sentiment Analysis Datasets: Countless good samples are half the battle
- Do it yourself: when accuracy is top priority
- Use an annotated corpus: when time and money are tight
- Anyway, how many data samples are enough?
- Word embedding: making natural language understandable to machines
- TF-IDF: measurement of importance
- Word2Vec: Neighborhood survey
- Glove: counts co-occurrences of words
- Sentiment Analysis Algorithms: Evaluating Guest Reviews
- Score Ratings: Excellent to Terrible
- Location of facilities: How good is the bar and food?
- Visualization: all comments at a glance
reading time: 10 minutes
How long does it take the average traveler to choose a hotel? To our knowledge, no scientific studies have been conducted to answer this question. But real-world experience clearly shows that people spend hours or even days sifting through dozens, if not hundreds, of options.
The amount of things to consider and the variety of reviews from past guests is astounding. This article describes our experience using sentiment analysis to create instant feedback snapshots so travelers can quickly compare different options and make the best choice in no time.
Such solutions can benefit hotel owners,online travel agencies, booking sites, meta search engines and travel review platforms looking for ways to put their customers in a more relaxed mood.
What is sentiment analysis?
Sentiment analysis is a technique to capture the emotional color of the text. It uses natural language processing (NLP) andmachine learningto discover, extract and study how customers perceive a product or service. Therefore, this type of research is often mentionedelicit opinionszAI emotions.
The purpose of opinion mining is to identify textpolarity, meaning they should be classified as positive, negative or neutral. For example, we can say that a comment of the type
“We spent five days at this hotel”is neutral
“I enjoyed staying here”is positive and
"I didn't like this hotel."negative.
You can learn much more about its types, tools and usage scenariossentiment analysisin our dedicated blog post. This time we will focus on how exactly we taught the machine to recognize emotions in reviews and what lessons we learned from creating an NLP-based tool calledPicky. So let's get started!
The use of machine learning to analyze sentiment.
Sentiment Analysis Datasets: Countless good samples are half the battle
The first step in sentiment analysis is to get a training data set with annotations that tell the algorithm what is positive and what is negative about it. Here you have two options: do it yourself or use publicly available dictionaries. To find out more aboutdata preparationin general, read our article or watch our explanation on YouTube:
Explanation of data preparation
Do it yourself: when accuracy is top priority
You don't need the power of machine learning to predict that a custom data set will produce the best results. Greater efficiency and accuracy come at a price, however, as preparing data for sentiment analysis is a time-consuming and labor-intensive process involving three key steps.
Step 1 - data collection.First, collect real reviews from hotel guests. The best way to achieve this is to use feedback from your website. If this option is not available, try to work with sources that have such data. The usual data collection method - scraping - is not recommended as it can lead to legal problems. BelowMEANWCCPAprinciples, this technique cannot be applied to personal data. You may also inadvertently infringe on the proprietary rights of website owners.
Step 2 - Annotating Moods.For reviews hidden in a review to be visible to machines, you must manually assign a reviewlabels(positive, neutral or negative) for words and phrases.Data markingsentiments are considered reliable if more than one judge scored the data set. The general rule is to include three annotators.
Step 3 - text cleaning.Raw hotel reviews contain a lot of irrelevant or just junk data that can negatively affect the accuracy of the model. So we need to clear them up, including:
- removal of sound– or things like special characters, hyperlinks, tags, numbers, spaces and punctuation marks;
- remove stop wordsincluding articles, pronouns, conjunctions and prepositions. etc. One of the most popular NLP libraries,NLTK(acronym for Natural Language Toolkit) contains 179 stop words for the English language;
- small lettersto avoid case differences between words with the same meaning;
- normalization– i.e. convert words into canonical form. For example, the standardized form2 waterW2mwIsMorning;
- comeor shorten each word to the stem by cutting off the endings (prefixes and suffixes). This technique often produces grammatically incorrect results, e.gat havewill be interruptedhebr; W
- lemmatisering -meaning the word returns to its dictionary form. Let us state a lemma forswimming, swimming,WswamIsswimming.
Comparison of stems and lemmatization. Source:Kaggle
Vocabulary and lemmatization are interchangeable techniques because they solve the same task: to filter variations of a word and reduce them to a basic unit. But when choosing between the two, keep in mind that stringing is simpler and faster, and lemmatization gives more accurate results.
It's worth noting that all of these routine tasks are typically done by freelancers or interns, not data scientists themselves. The latter simply monitors the process and provides instructions on what to collect and how to pre-process and annotate the raw data to make it suitable for use in machine learning.
Use an annotated corpus: when time and money are tight
That's why we've taken you through the canonical way of creating a dataset - which really is the best option, given the time and resources. In our case, we had to deal with strict deadlines and budgetary constraints. Therefore, we chose the second approach and selected from the available labeled bodies that best suited our requirements. Fortunately, sentiment labeling is common practice and many datasets have been developed over the years.
There are two main factors to consider when choosing an annotated dataset:
- the length of the texts.If sentiment labeling was applied to long reads (articles or blog posts), such a dataset would not be suitable for short texts such as tweets or comments and vice versa.
- subject or field.The size of the corpus of tagged political tweets is acceptable, but it will still perform poorly when training a model that analyzes hotel reviews. Examples of annotated restaurant or airline reviews are a better option and provide a satisfactory level of accuracy for the hospitality industry. However, the best match for hotel reviews is… well, a dataset generated from hotel reviews.
Below are some free downloadable datasets for training machine learning models for sentiment analysis. We experimented with a few of them.
Bank Sentiment boom bank Stanfordcontains almost 12,000 sentences from film reviews onRotten tomatoes. Each sentence is represented by a parse tree with annotated words and phrases that capture the mood behind the statement. In total, the dataset contains over 215,000 unique sentences, marked by three judges.
Example of parse tree annotated with words and phrases from the Stanford Treebank.
Feeling 140includes over 1.6 million tweets downloaded viaTwitter-API. All tweets are marked as positive or negative and can be used to detect sentiment related to a brand, product or topic on Twitter.
Restaurant Review Datasetstores a total of 52,077 reviews with ratings and lists of pros and cons.
Hotelanmeldelser for Trip Advisorcollect almost 20,000 pre-processed hotel reviews with reviews.
Anyway, how many data samples are enough?
Whether you're building a dataset yourself or looking for a ready-made corpus, the question is how big should a machine learning model be?
“The more data you have, the more complex models you can use.says Alexander Konduforov, Data Science competency manager at AltexSoft. "Deep learning models that achieve the highest level of accuracy require tens of thousands or even hundreds of thousands of samples. It just doesn't make sense to train them on small data sets."
For simpler algorithms, fewer samples are sufficient. But that said, we're talking about thousands, not hundreds, of commented reviews."For example, the first thousand will give you 70 percent accuracy"says Alexander, explaining how quantity affects quality."Each additional thousand will continue to increase accuracy, but at a slower rate. Suppose that with 15,000 samples 90 percent can be achieved, while 150,000 will give 95 percent. At some point the growth curve will flatten out, and from then on it will not make sense to add new samples."
To train the Choicy model, we collected a dataset of over 100,000 examples of ratings from public sources.
Word embedding: making natural language understandable to machines
No machine learning model – not even the smartest – can understand natural languages. So before we feed data into the ML algorithm, we need to convert words and phrases into numerical or vector representations.
This process is calledcontain a word. After researching several techniques, we finally settled on one of the most advanced. But to put our decision into context, let's look at three popular options.
TF-IDF: measurement of importance
Term Frequency Inverse Document Frequency (TF-IDF)calculates the frequency of words in a text dataset and assigns higher weight to rarer words or phrases. In other words, it measures the weight of a given word and acts as a simple keyword extraction method.
While TF-IDF is still used in sentiment analysis, it is a fairly old technique that lacks a lot of valuable information, such as the context around nearby words or their order.
Word2Vec: Neighborhood survey
Word2Vecis a shallow, two-layer neural network that converts words to vectors (hence the name - word to vector). Created by Google in 2013 to work with large corpora of text data, Word2Vec places words used in the same context close together.
Shows tweets in World2Vec. Source:Fryderyk Godings
The model is pretrained on approximately 100 million words from Google News and is available for downloadHer. It learns from data (ie vectors) and improves its ability to predict a word based on context over time.
Glove: counts co-occurrences of words
Similar to Word2Vec,Glove (Global Vectors)creates vector representations of text data. The difference is that Word2Vec focuses on adjacent words, while GloVe counts the co-occurrence of words in the entire text corpus. Simply put, it first builds a huge context matrix that shows how often a given word appears in a given context. It then creates word-by-word vectors that visualize co-occurrence frequencies.
Co-occurrence of comparative and superlative adjectives, measured by the GloVe model. Source:NLP group from Stanford
You can download GloVe word vectors pre-trained on Tweets, Wikipedia texts and the Common Crawl CorpusGitHub. The Stanford NLP Group, which developed this model, also supports data scientiststookcustom glove vectors.
Both Word2Vec and GloVe can capture semantic and syntactic relations between words. However, it is the last oneworks betterfirst in accuracy and training time. That's why our data scientists ultimately chose GloVe to "translate" hotel reviews into machine-readable form for further analysis.
Sentiment Analysis Algorithms: Evaluating Guest Reviews
From a machine learning point of view, sentiment analysis is one of themSupervised learningproblem. This means that the selected data set already contains the correct answers. After training and evaluating the results, the model is ready to classify sentiments in new unlabeled hotel reviews.
Score Ratings: Excellent to Terrible
Various models are used for sentiment analysis tasks which were ultimately used by our data scientistsone-dimensional convolutional neural network (1D-CNN)as one of the most effective options.
CNNs with two or three dimensions are particularly good at thisimage recognition— for their ability to detect specific patterns and take into account the spatial relationships between them. These properties also make them effective for sentiment classification, except that for sequential data such as texts, one-dimensional models are better than the 2D and 3D alternatives.
Our CNN model is trained to perform a sentiment score that takes into account any negative or positive polarity words found in a review. Based on the value obtained, we classify each review as "Excellent", "Very Good", "Average", "Poor" or "Terrible". After reading all the opinions about a given hotel, it becomes clear which of the five opinions prevails and whether the hotel is even worth considering. The deed is done!
However, we decided to go ahead.
Location of facilities: How good is the bar and food?
Most often, feedback contains a mixture of positive and negative feedback. Suppose a guest appreciates the convenient location, praises the restaurant, but complains about the noise at night and the lack of air conditioning - all in one review. In this case, the overall assessment of the model is almost neutral. Also, it doesn't specify exactly why people like or dislike a particular place.
In order not to miss key information, we have developed a mechanism that captures the emotions behind nearly thirty individual hotel amenities, such as bar/lounge, food, peripherals, cleanliness, tranquility, comfort and more. To this end, we have divided each review into sentences and each complex sentence into simpler ones. We then performed two classification tasks.
1. Classification by object.Our first task was to place each sentence or part of it into one or more categories of features. Ideally, we should have trained the model on a specially annotated dataset to do things likeAir conditioningzMagnificationin real reviews.
Unfortunately, there are no completed datasets marked with thirty facilities. Due to time and budget constraints, we therefore chose a shorter approach. Instead of using machine learning, we created a vocabulary of keywords for each feature to automatically find them in sentences and categorize them accordingly. For example, if a certain sentence contains wordsbath, bathroom,zshower, is classified asbathroom. Simple but good enough!
2. Emotion classification.To find the feelings behind the sentences, we used ahierarchical, attention-based, position-aware network or HAPN– another advanced model that can capture the context and relationships of the words. So it gives a clear idea of what the positive or negative word refers to. As a result, each categorized sentence was conveniently assigned an individual sentiment score.
After an independent HAPN review of the sentences, we summarized the sentiment ratings by category and created a series of facilities.
Quite simple, but as they say: "A picture is worth a thousand words." So let's move on to the visualization.
Visualization: all comments at a glance
And that's where we come in: a simple but clean interface that hosts hundreds of reviews in just a few images.
Choose an interface with overall scores and facilities.
Reading reviews when choosing a hotel can take hours – or days – depending on a person's anxiety level. Sentiment analysis reduces the time it takes to weigh the pros and cons to minutes. Good enough, although we still see room for improvement. No hotel is perfect, no model is perfect, but it's worth working on.
For reference only! Sentiment analysis cannot understand sarcasm. Every time a traveler leaves a comment likeThe best family hotel, yes!meaning they will never stay here again, the chances are that the ML model will classify this feedback as positive. It's clear that technology is on the hotels' side – at least for now.