</>

by Ashok Khosla, Mendocino County, California


Using Machine Learning techniques


In Part 1 of this study project I explored the motivation and data of deceptive hotel reviews provided by researchers at Cornell University.

In Part 2, I will attempt to replicate their approach using statistical and machine learning packages available from The R Project for Statistical Computing.

rm(list=ls())
library(tidyverse)
library(gridExtra) #viewing multiple plots together

 # Text Mining Packages
library(tidytext)
library(tokenizers)
library("e1071")
library(tm)
library(RTextTools)

# Graphics Packages
library(ggthemes)
library(moments)
library(ggplot2)
library(scales)
library(knitr) # for dynamic reporting
library(kableExtra) # create a nicely formated HTML table
library(formattable) # for the color_tile function

publication_theme <- function() {
    theme_economist() +
    theme(text=element_text(family="Rockwell"),
          plot.title = element_text(family="Rockwell", size=12)
    )
}
publication.color.background <- '#d6e4ea'
publication.color.orange <- '#f0716f'
publication.color.cyan <- '#3cbfc2'


Project_Dir <- "/Users/amkhosla/Desktop/Statistics/Projects/Hotel_Reviews/code"
setwd(Project_Dir)

Restore the previously tagged database

First let’s restore our recently tagged dataset: Here’s a quick refresher:

reviews.df <- as.tibble(read.csv('./files/reviews_pos.csv', stringsAsFactors = FALSE)) 

names(reviews.df)[names(reviews.df)=="X1"] <- "word_id"
names(reviews.df)[names(reviews.df)=="Truthfulness"] <- "truthfulness"
names(reviews.df)[names(reviews.df)=="Polarity"] <- "polarity"

reviews.df <- reviews.df %>%
    filter(word != '#NAME?') %>%
    select(-entity)
tag_pos <- function(anIndex) { 
    return(paste(reviews.df[anIndex,]$lemma, '/', reviews.df[anIndex,]$pos, sep =''))}
reviews.df$word <- sapply(1:length(reviews.df$word),tag_pos)

kable(reviews.df[1:15,], format = "markdown")
X doc_id truthfulness polarity word lemma pos
1 negative/MTurk/f1/d_hilton_1 deceptive negative stay/VERB stay VERB
2 negative/MTurk/f1/d_hilton_1 deceptive negative schicago/PROPN schicago PROPN
3 negative/MTurk/f1/d_hilton_1 deceptive negative hilton/PROPN hilton PROPN
4 negative/MTurk/f1/d_hilton_1 deceptive negative day/NOUN day NOUN
5 negative/MTurk/f1/d_hilton_1 deceptive negative night/NOUN night NOUN
6 negative/MTurk/f1/d_hilton_1 deceptive negative conference/NOUN conference NOUN
7 negative/MTurk/f1/d_hilton_1 deceptive negative easy/ADJ easy ADJ
8 negative/MTurk/f1/d_hilton_1 deceptive negative amenity/NOUN amenity NOUN
9 negative/MTurk/f1/d_hilton_1 deceptive negative cleanliness/NOUN cleanliness NOUN
10 negative/MTurk/f1/d_hilton_1 deceptive negative experience/NOUN experience NOUN
11 negative/MTurk/f1/d_hilton_1 deceptive negative hilton/PROPN hilton PROPN
12 negative/MTurk/f1/d_hilton_1 deceptive negative awful/ADJ awful ADJ
13 negative/MTurk/f1/d_hilton_1 deceptive negative take/VERB take VERB
14 negative/MTurk/f1/d_hilton_1 deceptive negative time/NOUN time NOUN
15 negative/MTurk/f1/d_hilton_1 deceptive negative write/VERB write VERB

Term Frequency and Inverse Document Frequency and TfIdf

As part of training our machine model, we want to give it information about how words are distributed across our set of data. This is commonly done using Term Frequency/Inverse Document Frequency techniques.

The first step, then, is to calculate the term frequency (Tf), the inverse document frequency (Idf), and their product (Tf°Idf)
Apologies: There’s an occasional Google Chrome bug with latex/equation rendering - see equations in Safari if you want to understand them

  • (Tf): TERM FREQUENCY - How probable is that this particular word appears in a SINGLE document. \[ Tf(\omega) = \left( \frac{ \# occurences\ of\ \omega\ in\ document}{\# words\ in\ document} \right) \]
  • (Idf): INVERSE DOCUMENT FREQUENCY - How common or rare a word is across ALL documents. If it’s a common word across all documents, it’s regarded as a non-differentiating word, and we want to weight it lower.

    Alternatively, it can be thought of as the specificity of a term - quantified as an inverse function of the number of documents in which it occurs. The most common function used is log(natural) of the number of docs divided by the number of docs in which this term appears. \[ Idf(\omega) = ln \left( \frac{ \# docs}{\# docs.containing(\omega) } \right) \]

  • (Tf°Idf):TERM SPECIFICITY Tf°Idf, the product of Tf and Idf, is used to weight words according to how “important” they are to the machine learning algorithm ☺. The higher the TF°IDF score (weight), the rarer the term and vice versa. The intuition for this measure is this:

    If a word appears frequently in a document, then it should be important and we should give that word a high score. But if a word appears in too many other documents, it’s probably not a unique identifier, therefore we should assign a lower score to that word.

Tf°idf is one of the most popular term-weighting schemes today. Over 80% of text-based recommender systems, including Google’s search engine, use Tf°Idf. Introduced, as “term specificity” by Karen Spärck Jones in a 1972 paper, it has worked well as a heuristic. However, its theoretical foundations have been troublesome for at least three decades afterward, with many researchers trying to find information theoretic justifications for it.

Here’s a peek at our datasets Tf°Idf:

reviews.tfidf <- reviews.df  %>%
  count(doc_id, word, sort = TRUE)

total_words.tfidf <- reviews.tfidf %>% 
    group_by(doc_id) %>% 
    summarize(total = sum(n))

reviews.tfidf <- left_join(reviews.tfidf, total_words.tfidf)
reviews.tfidf <- reviews.tfidf %>%
    bind_tf_idf(word, doc_id, n)
kable(reviews.tfidf[1:15,], format = "markdown")
doc_id word n total tf idf tf_idf
negative/MTurk/f4/d_homewood_15 hotel/NOUN 11 49 0.2244898 0.2076394 0.0466129
negative/MTurk/f5/d_intercontinental_17 hotel/NOUN 10 75 0.1333333 0.2076394 0.0276852
negative/Web/f4/t_homewood_19 credit/NOUN 10 233 0.0429185 3.8513984 0.1652961
negative/Web/f4/t_knickerbocker_17 hotel/NOUN 10 62 0.1612903 0.2076394 0.0334902
positive/MTurk/f3/d_omni_5 omni/PROPN 9 157 0.0573248 3.2033716 0.1836328
negative/MTurk/f2/d_affinia_6 pillow/NOUN 8 112 0.0714286 2.9234116 0.2088151
negative/MTurk/f4/d_swissotel_12 hotel/NOUN 8 73 0.1095890 0.2076394 0.0227550
negative/Web/f1/t_james_5 spoon/NOUN 8 124 0.0645161 6.6846117 0.4312653
negative/Web/f1/t_monaco_10 garage/NOUN 8 130 0.0615385 5.0751738 0.3123184
negative/Web/f2/t_affinia_7 elevator/NOUN 8 179 0.0446927 2.8559703 0.1276411
negative/Web/f2/t_ambassador_14 hotel/NOUN 8 36 0.2222222 0.2076394 0.0461421
negative/Web/f2/t_talbott_11 hotel/NOUN 8 95 0.0842105 0.2076394 0.0174854
negative/Web/f2/t_talbott_17 service/NOUN 8 252 0.0317460 0.9976364 0.0316710
negative/Web/f3/t_hyatt_20 night/NOUN 8 158 0.0506329 1.1067705 0.0560390
negative/Web/f4/t_homewood_19 hotel/NOUN 8 233 0.0343348 0.2076394 0.0071292

As you can see words like hotel, that are common, and common to a lot of documents will not be weighted highly for the machine learning algorithm. The tf_idf range for “hotel” is [0.00 - 0.05], whereas, words like “bedbug”, “concierge”, or “stinky”, have a range of [0.40 - 0.50] Let’s look at the top 30 tf°idf’s for deceptive and truthful reviews.

For Deceptive Reviews, these are terms with high specificity

reviews.tfidf.df <- left_join(reviews.df, reviews.tfidf)
deceptive.tfidf <- reviews.tfidf.df %>%
    filter(truthfulness=="deceptive") %>%
    filter(pos!='PROPN')
deceptive.tfidf <- distinct(deceptive.tfidf, word, .keep_all = TRUE)
deceptive.tfidf <- deceptive.tfidf[order(-deceptive.tfidf$tf_idf),] 
kable(deceptive.tfidf[1:19,c(2,3,4,5,6,9,10,11)], format = "markdown")
doc_id truthfulness polarity word lemma total tf idf
positive/MTurk/f4/d_swissotel_9 deceptive positive beat/NOUN beat 11 0.0909091 7.377759
positive/MTurk/f2/d_talbott_4 deceptive positive ritzy/ADJ ritzy 24 0.0833333 7.377759
positive/MTurk/f3/d_conrad_8 deceptive positive brightly/ADV brightly 12 0.0833333 7.377759
positive/MTurk/f3/d_conrad_8 deceptive positive garishness/NOUN garishness 12 0.0833333 7.377759
positive/MTurk/f2/d_talbott_7 deceptive positive bedbug/ADV bedbug 13 0.0769231 7.377759
positive/MTurk/f1/d_monaco_12 deceptive positive tidy/ADJ tidy 10 0.1000000 5.431849
positive/MTurk/f4/d_sheraton_14 deceptive positive -bobby/INTJ -bobby 14 0.0714286 7.377759
positive/MTurk/f2/d_talbott_10 deceptive positive talbott/NOUN talbott 59 0.0677966 7.377759
positive/MTurk/f5/d_amalfi_18 deceptive positive complimentartry/NOUN complimentartry 15 0.0666667 7.377759
positive/MTurk/f2/d_talbott_7 deceptive positive above/ADP above 13 0.0769231 6.279147
positive/MTurk/f1/d_james_9 deceptive positive go/VERB go 11 0.0909091 5.298317
negative/MTurk/f1/d_james_19 deceptive negative deceive/VERB deceive 12 0.0833333 5.768321
negative/MTurk/f1/d_sofitel_11 deceptive negative garish/ADJ garish 14 0.0714286 6.684612
negative/MTurk/f1/d_monaco_7 deceptive negative fin/NOUN fin 16 0.0625000 7.377759
positive/MTurk/f2/d_affinia_8 deceptive positive nearby/ADP nearby 16 0.0625000 7.377759
positive/MTurk/f4/d_sheraton_6 deceptive positive coffeepot/NOUN coffeepot 16 0.0625000 7.377759
positive/MTurk/f4/d_swissotel_20 deceptive positive clientel/NOUN clientel 16 0.0625000 7.377759
negative/MTurk/f5/d_amalfi_6 deceptive negative planner/NOUN planner 33 0.0606061 7.377759
negative/MTurk/f5/d_allegro_17 deceptive negative prolem/NOUN prolem 17 0.0588235 7.377759

Deceptive writers use the word deceive in deceptive reviews.



For Truthful Reviews, these are terms with high specificity

truthful.tfidf <- reviews.tfidf.df %>%
    filter(truthfulness=="truthful") %>%
    filter(pos!='PROPN') 
truthful.tfidf <- distinct(truthful.tfidf, word, .keep_all = TRUE)
truthful.tfidf <- truthful.tfidf[order(-truthful.tfidf$tf_idf),] 
kable(truthful.tfidf[2:19,c(2,3,4,5,6,9,10,11)], format = "markdown")
doc_id truthfulness polarity word lemma total tf idf
negative/Web/f3/t_fairmont_9 truthful negative sully/ADV sully 12 0.0833333 7.377759
negative/Web/f1/t_james_14 truthful negative yeah/INTJ yeah 29 0.1034483 5.180534
negative/Web/f2/t_ambassador_13 truthful negative carpeting/NOUN carpeting 24 0.0833333 5.991465
negative/Web/f3/t_fairmont_9 truthful negative lousy/ADJ lousy 12 0.0833333 5.991465
negative/Web/f1/t_sofitel_9 truthful negative princess/NOUN princess 15 0.0666667 7.377759
negative/Web/f3/t_omni_4 truthful negative stinky/ADJ stinky 15 0.0666667 7.377759
negative/Web/f3/t_hyatt_9 truthful negative strike/NOUN strike 26 0.0769231 6.279147
negative/Web/f3/t_fairmont_9 truthful negative boy/NOUN boy 12 0.0833333 5.768321
positive/TripAdvisor/f2/t_hardrock_5 truthful positive lime/NOUN lime 16 0.0625000 7.377759
positive/TripAdvisor/f2/t_hardrock_5 truthful positive punch/NOUN punch 16 0.0625000 7.377759
positive/TripAdvisor/f2/t_hardrock_5 truthful positive extradionary/ADJ extradionary 16 0.0625000 7.377759
negative/Web/f1/t_sofitel_13 truthful negative blanket/NOUN blanket 31 0.0967742 4.605170
negative/Web/f3/t_omni_4 truthful negative hurt/VERB hurt 15 0.0666667 6.684612
negative/Web/f1/t_sofitel_10 truthful negative lexus/NOUN lexus 17 0.0588235 7.377759
negative/Web/f1/t_sofitel_10 truthful negative smash/VERB smash 17 0.0588235 7.377759
positive/TripAdvisor/f5/t_intercontinental_2 truthful positive doller/NOUN doller 17 0.0588235 7.377759
negative/Web/f1/t_james_5 truthful negative spoon/NOUN spoon 124 0.0645161 6.684612
positive/TripAdvisor/f1/t_hilton_5 truthful positive place.amazing/NOUN place.amazing 18 0.0555556 7.377759

Truthful people don’t say smelly, they say stinky.