In Part 1 of this study project I explored the motivation and data of deceptive hotel reviews provided by researchers at Cornell University.
In Part 2, I will attempt to replicate their approach using statistical and machine learning packages available from The R Project for Statistical Computing.
rm(list=ls())
library(tidyverse)
library(gridExtra) #viewing multiple plots together
# Text Mining Packages
library(tidytext)
library(tokenizers)
library("e1071")
library(tm)
library(RTextTools)
# Graphics Packages
library(ggthemes)
library(moments)
library(ggplot2)
library(scales)
library(knitr) # for dynamic reporting
library(kableExtra) # create a nicely formated HTML table
library(formattable) # for the color_tile function
publication_theme <- function() {
theme_economist() +
theme(text=element_text(family="Rockwell"),
plot.title = element_text(family="Rockwell", size=12)
)
}
publication.color.background <- '#d6e4ea'
publication.color.orange <- '#f0716f'
publication.color.cyan <- '#3cbfc2'
Project_Dir <- "/Users/amkhosla/Desktop/Statistics/Projects/Hotel_Reviews/code"
setwd(Project_Dir)
First let’s restore our recently tagged dataset: Here’s a quick refresher:
reviews.df <- as.tibble(read.csv('./files/reviews_pos.csv', stringsAsFactors = FALSE))
names(reviews.df)[names(reviews.df)=="X1"] <- "word_id"
names(reviews.df)[names(reviews.df)=="Truthfulness"] <- "truthfulness"
names(reviews.df)[names(reviews.df)=="Polarity"] <- "polarity"
reviews.df <- reviews.df %>%
filter(word != '#NAME?') %>%
select(-entity)
tag_pos <- function(anIndex) {
return(paste(reviews.df[anIndex,]$lemma, '/', reviews.df[anIndex,]$pos, sep =''))}
reviews.df$word <- sapply(1:length(reviews.df$word),tag_pos)
kable(reviews.df[1:15,], format = "markdown")
X | doc_id | truthfulness | polarity | word | lemma | pos |
---|---|---|---|---|---|---|
1 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | stay/VERB | stay | VERB |
2 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | schicago/PROPN | schicago | PROPN |
3 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | hilton/PROPN | hilton | PROPN |
4 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | day/NOUN | day | NOUN |
5 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | night/NOUN | night | NOUN |
6 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | conference/NOUN | conference | NOUN |
7 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | easy/ADJ | easy | ADJ |
8 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | amenity/NOUN | amenity | NOUN |
9 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | cleanliness/NOUN | cleanliness | NOUN |
10 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | experience/NOUN | experience | NOUN |
11 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | hilton/PROPN | hilton | PROPN |
12 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | awful/ADJ | awful | ADJ |
13 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | take/VERB | take | VERB |
14 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | time/NOUN | time | NOUN |
15 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | write/VERB | write | VERB |
As part of training our machine model, we want to give it information about how words are distributed across our set of data. This is commonly done using Term Frequency/Inverse Document Frequency techniques.
The first step, then, is to calculate the term frequency (Tf), the inverse document frequency (Idf), and their product (Tf°Idf)
Apologies: There’s an occasional Google Chrome bug with latex/equation rendering - see equations in Safari if you want to understand them
(Idf): INVERSE DOCUMENT FREQUENCY - How common or rare a word is across ALL documents. If it’s a common word across all documents, it’s regarded as a non-differentiating word, and we want to weight it lower.
Alternatively, it can be thought of as the specificity of a term - quantified as an inverse function of the number of documents in which it occurs. The most common function used is log(natural) of the number of docs divided by the number of docs in which this term appears. \[
Idf(\omega) = ln \left( \frac{ \# docs}{\# docs.containing(\omega) } \right)
\]
(Tf°Idf):TERM SPECIFICITY Tf°Idf, the product of Tf and Idf, is used to weight words according to how “important” they are to the machine learning algorithm ☺. The higher the TF°IDF score (weight), the rarer the term and vice versa. The intuition for this measure is this:
If a word appears frequently in a document, then it should be important and we should give that word a high score. But if a word appears in too many other documents, it’s probably not a unique identifier, therefore we should assign a lower score to that word.
Tf°idf is one of the most popular term-weighting schemes today. Over 80% of text-based recommender systems, including Google’s search engine, use Tf°Idf. Introduced, as “term specificity” by Karen Spärck Jones in a 1972 paper, it has worked well as a heuristic. However, its theoretical foundations have been troublesome for at least three decades afterward, with many researchers trying to find information theoretic justifications for it.
Here’s a peek at our datasets Tf°Idf:
reviews.tfidf <- reviews.df %>%
count(doc_id, word, sort = TRUE)
total_words.tfidf <- reviews.tfidf %>%
group_by(doc_id) %>%
summarize(total = sum(n))
reviews.tfidf <- left_join(reviews.tfidf, total_words.tfidf)
reviews.tfidf <- reviews.tfidf %>%
bind_tf_idf(word, doc_id, n)
kable(reviews.tfidf[1:15,], format = "markdown")
doc_id | word | n | total | tf | idf | tf_idf |
---|---|---|---|---|---|---|
negative/MTurk/f4/d_homewood_15 | hotel/NOUN | 11 | 49 | 0.2244898 | 0.2076394 | 0.0466129 |
negative/MTurk/f5/d_intercontinental_17 | hotel/NOUN | 10 | 75 | 0.1333333 | 0.2076394 | 0.0276852 |
negative/Web/f4/t_homewood_19 | credit/NOUN | 10 | 233 | 0.0429185 | 3.8513984 | 0.1652961 |
negative/Web/f4/t_knickerbocker_17 | hotel/NOUN | 10 | 62 | 0.1612903 | 0.2076394 | 0.0334902 |
positive/MTurk/f3/d_omni_5 | omni/PROPN | 9 | 157 | 0.0573248 | 3.2033716 | 0.1836328 |
negative/MTurk/f2/d_affinia_6 | pillow/NOUN | 8 | 112 | 0.0714286 | 2.9234116 | 0.2088151 |
negative/MTurk/f4/d_swissotel_12 | hotel/NOUN | 8 | 73 | 0.1095890 | 0.2076394 | 0.0227550 |
negative/Web/f1/t_james_5 | spoon/NOUN | 8 | 124 | 0.0645161 | 6.6846117 | 0.4312653 |
negative/Web/f1/t_monaco_10 | garage/NOUN | 8 | 130 | 0.0615385 | 5.0751738 | 0.3123184 |
negative/Web/f2/t_affinia_7 | elevator/NOUN | 8 | 179 | 0.0446927 | 2.8559703 | 0.1276411 |
negative/Web/f2/t_ambassador_14 | hotel/NOUN | 8 | 36 | 0.2222222 | 0.2076394 | 0.0461421 |
negative/Web/f2/t_talbott_11 | hotel/NOUN | 8 | 95 | 0.0842105 | 0.2076394 | 0.0174854 |
negative/Web/f2/t_talbott_17 | service/NOUN | 8 | 252 | 0.0317460 | 0.9976364 | 0.0316710 |
negative/Web/f3/t_hyatt_20 | night/NOUN | 8 | 158 | 0.0506329 | 1.1067705 | 0.0560390 |
negative/Web/f4/t_homewood_19 | hotel/NOUN | 8 | 233 | 0.0343348 | 0.2076394 | 0.0071292 |
As you can see words like hotel, that are common, and common to a lot of documents will not be weighted highly for the machine learning algorithm. The tf_idf range for “hotel” is [0.00 - 0.05], whereas, words like “bedbug”, “concierge”, or “stinky”, have a range of [0.40 - 0.50] Let’s look at the top 30 tf°idf’s for deceptive and truthful reviews.
reviews.tfidf.df <- left_join(reviews.df, reviews.tfidf)
deceptive.tfidf <- reviews.tfidf.df %>%
filter(truthfulness=="deceptive") %>%
filter(pos!='PROPN')
deceptive.tfidf <- distinct(deceptive.tfidf, word, .keep_all = TRUE)
deceptive.tfidf <- deceptive.tfidf[order(-deceptive.tfidf$tf_idf),]
kable(deceptive.tfidf[1:19,c(2,3,4,5,6,9,10,11)], format = "markdown")
doc_id | truthfulness | polarity | word | lemma | total | tf | idf |
---|---|---|---|---|---|---|---|
positive/MTurk/f4/d_swissotel_9 | deceptive | positive | beat/NOUN | beat | 11 | 0.0909091 | 7.377759 |
positive/MTurk/f2/d_talbott_4 | deceptive | positive | ritzy/ADJ | ritzy | 24 | 0.0833333 | 7.377759 |
positive/MTurk/f3/d_conrad_8 | deceptive | positive | brightly/ADV | brightly | 12 | 0.0833333 | 7.377759 |
positive/MTurk/f3/d_conrad_8 | deceptive | positive | garishness/NOUN | garishness | 12 | 0.0833333 | 7.377759 |
positive/MTurk/f2/d_talbott_7 | deceptive | positive | bedbug/ADV | bedbug | 13 | 0.0769231 | 7.377759 |
positive/MTurk/f1/d_monaco_12 | deceptive | positive | tidy/ADJ | tidy | 10 | 0.1000000 | 5.431849 |
positive/MTurk/f4/d_sheraton_14 | deceptive | positive | -bobby/INTJ | -bobby | 14 | 0.0714286 | 7.377759 |
positive/MTurk/f2/d_talbott_10 | deceptive | positive | talbott/NOUN | talbott | 59 | 0.0677966 | 7.377759 |
positive/MTurk/f5/d_amalfi_18 | deceptive | positive | complimentartry/NOUN | complimentartry | 15 | 0.0666667 | 7.377759 |
positive/MTurk/f2/d_talbott_7 | deceptive | positive | above/ADP | above | 13 | 0.0769231 | 6.279147 |
positive/MTurk/f1/d_james_9 | deceptive | positive | go/VERB | go | 11 | 0.0909091 | 5.298317 |
negative/MTurk/f1/d_james_19 | deceptive | negative | deceive/VERB | deceive | 12 | 0.0833333 | 5.768321 |
negative/MTurk/f1/d_sofitel_11 | deceptive | negative | garish/ADJ | garish | 14 | 0.0714286 | 6.684612 |
negative/MTurk/f1/d_monaco_7 | deceptive | negative | fin/NOUN | fin | 16 | 0.0625000 | 7.377759 |
positive/MTurk/f2/d_affinia_8 | deceptive | positive | nearby/ADP | nearby | 16 | 0.0625000 | 7.377759 |
positive/MTurk/f4/d_sheraton_6 | deceptive | positive | coffeepot/NOUN | coffeepot | 16 | 0.0625000 | 7.377759 |
positive/MTurk/f4/d_swissotel_20 | deceptive | positive | clientel/NOUN | clientel | 16 | 0.0625000 | 7.377759 |
negative/MTurk/f5/d_amalfi_6 | deceptive | negative | planner/NOUN | planner | 33 | 0.0606061 | 7.377759 |
negative/MTurk/f5/d_allegro_17 | deceptive | negative | prolem/NOUN | prolem | 17 | 0.0588235 | 7.377759 |
Deceptive writers use the word deceive in deceptive reviews.
truthful.tfidf <- reviews.tfidf.df %>%
filter(truthfulness=="truthful") %>%
filter(pos!='PROPN')
truthful.tfidf <- distinct(truthful.tfidf, word, .keep_all = TRUE)
truthful.tfidf <- truthful.tfidf[order(-truthful.tfidf$tf_idf),]
kable(truthful.tfidf[2:19,c(2,3,4,5,6,9,10,11)], format = "markdown")
doc_id | truthfulness | polarity | word | lemma | total | tf | idf |
---|---|---|---|---|---|---|---|
negative/Web/f3/t_fairmont_9 | truthful | negative | sully/ADV | sully | 12 | 0.0833333 | 7.377759 |
negative/Web/f1/t_james_14 | truthful | negative | yeah/INTJ | yeah | 29 | 0.1034483 | 5.180534 |
negative/Web/f2/t_ambassador_13 | truthful | negative | carpeting/NOUN | carpeting | 24 | 0.0833333 | 5.991465 |
negative/Web/f3/t_fairmont_9 | truthful | negative | lousy/ADJ | lousy | 12 | 0.0833333 | 5.991465 |
negative/Web/f1/t_sofitel_9 | truthful | negative | princess/NOUN | princess | 15 | 0.0666667 | 7.377759 |
negative/Web/f3/t_omni_4 | truthful | negative | stinky/ADJ | stinky | 15 | 0.0666667 | 7.377759 |
negative/Web/f3/t_hyatt_9 | truthful | negative | strike/NOUN | strike | 26 | 0.0769231 | 6.279147 |
negative/Web/f3/t_fairmont_9 | truthful | negative | boy/NOUN | boy | 12 | 0.0833333 | 5.768321 |
positive/TripAdvisor/f2/t_hardrock_5 | truthful | positive | lime/NOUN | lime | 16 | 0.0625000 | 7.377759 |
positive/TripAdvisor/f2/t_hardrock_5 | truthful | positive | punch/NOUN | punch | 16 | 0.0625000 | 7.377759 |
positive/TripAdvisor/f2/t_hardrock_5 | truthful | positive | extradionary/ADJ | extradionary | 16 | 0.0625000 | 7.377759 |
negative/Web/f1/t_sofitel_13 | truthful | negative | blanket/NOUN | blanket | 31 | 0.0967742 | 4.605170 |
negative/Web/f3/t_omni_4 | truthful | negative | hurt/VERB | hurt | 15 | 0.0666667 | 6.684612 |
negative/Web/f1/t_sofitel_10 | truthful | negative | lexus/NOUN | lexus | 17 | 0.0588235 | 7.377759 |
negative/Web/f1/t_sofitel_10 | truthful | negative | smash/VERB | smash | 17 | 0.0588235 | 7.377759 |
positive/TripAdvisor/f5/t_intercontinental_2 | truthful | positive | doller/NOUN | doller | 17 | 0.0588235 | 7.377759 |
negative/Web/f1/t_james_5 | truthful | negative | spoon/NOUN | spoon | 124 | 0.0645161 | 6.684612 |
positive/TripAdvisor/f1/t_hilton_5 | truthful | positive | place.amazing/NOUN | place.amazing | 18 | 0.0555556 | 7.377759 |
Truthful people don’t say smelly, they say stinky.