</>

by Ashok Khosla, Mendocino County, California


Using Machine Learning techniques


In Part 1 of this study project I explored the motivation and data of deceptive hotel reviews provided by researchers at Cornell University.

In Part 2, I will attempt to replicate their approach using statistical and machine learning packages available from The R Project for Statistical Computing.

rm(list=ls())
library(tidyverse)
library(gridExtra) #viewing multiple plots together

 # Text Mining Packages
library(tidytext)
library(tokenizers)
library("e1071")
library(tm)
library(RTextTools)

# Graphics Packages
library(ggthemes)
library(moments)
library(ggplot2)
library(scales)
library(knitr) # for dynamic reporting
library(kableExtra) # create a nicely formated HTML table
library(formattable) # for the color_tile function

publication_theme <- function() {
    theme_economist() +
    theme(text=element_text(family="Rockwell"),
          plot.title = element_text(family="Rockwell", size=12)
    )
}
publication.color.background <- '#d6e4ea'
publication.color.orange <- '#f0716f'
publication.color.cyan <- '#3cbfc2'


Project_Dir <- "/Users/amkhosla/Desktop/Statistics/Projects/Hotel_Reviews/code"
setwd(Project_Dir)

Restore the previously tagged database

First let’s restore our recently tagged dataset: Here’s a quick refresher:

reviews.df <- as.tibble(read.csv('./files/reviews_pos.csv', stringsAsFactors = FALSE)) 

names(reviews.df)[names(reviews.df)=="X1"] <- "word_id"
names(reviews.df)[names(reviews.df)=="Truthfulness"] <- "truthfulness"
names(reviews.df)[names(reviews.df)=="Polarity"] <- "polarity"

reviews.df <- reviews.df %>%
    filter(word != '#NAME?') %>%
    select(-entity)
tag_pos <- function(anIndex) { 
    return(paste(reviews.df[anIndex,]$lemma, '/', reviews.df[anIndex,]$pos, sep =''))}
reviews.df$word <- sapply(1:length(reviews.df$word),tag_pos)

kable(reviews.df[1:15,], format = "markdown")
X doc_id truthfulness polarity word lemma pos
1 negative/MTurk/f1/d_hilton_1 deceptive negative stay/VERB stay VERB
2 negative/MTurk/f1/d_hilton_1 deceptive negative schicago/PROPN schicago PROPN
3 negative/MTurk/f1/d_hilton_1 deceptive negative hilton/PROPN hilton PROPN
4 negative/MTurk/f1/d_hilton_1 deceptive negative day/NOUN day NOUN
5 negative/MTurk/f1/d_hilton_1 deceptive negative night/NOUN night NOUN
6 negative/MTurk/f1/d_hilton_1 deceptive negative conference/NOUN conference NOUN
7 negative/MTurk/f1/d_hilton_1 deceptive negative easy/ADJ easy ADJ
8 negative/MTurk/f1/d_hilton_1 deceptive negative amenity/NOUN amenity NOUN
9 negative/MTurk/f1/d_hilton_1 deceptive negative cleanliness/NOUN cleanliness NOUN
10 negative/MTurk/f1/d_hilton_1 deceptive negative experience/NOUN experience NOUN
11 negative/MTurk/f1/d_hilton_1 deceptive negative hilton/PROPN hilton PROPN
12 negative/MTurk/f1/d_hilton_1 deceptive negative awful/ADJ awful ADJ
13 negative/MTurk/f1/d_hilton_1 deceptive negative take/VERB take VERB
14 negative/MTurk/f1/d_hilton_1 deceptive negative time/NOUN time NOUN
15 negative/MTurk/f1/d_hilton_1 deceptive negative write/VERB write VERB

Term Frequency and Inverse Document Frequency and TfIdf

As part of training our machine model, we want to give it information about how words are distributed across our set of data. This is commonly done using Term Frequency/Inverse Document Frequency techniques.

The first step, then, is to calculate the term frequency (Tf), the inverse document frequency (Idf), and their product (Tf°Idf)
Apologies: There’s an occasional Google Chrome bug with latex/equation rendering - see equations in Safari if you want to understand them

  • (Tf): TERM FREQUENCY - How probable is that this particular word appears in a SINGLE document. \[ Tf(\omega) = \left( \frac{ \# occurences\ of\ \omega\ in\ document}{\# words\ in\ document} \right) \]
  • (Idf): INVERSE DOCUMENT FREQUENCY - How common or rare a word is across ALL documents. If it’s a common word across all documents, it’s regarded as a non-differentiating word, and we want to weight it lower.

    Alternatively, it can be thought of as the specificity of a term - quantified as an inverse function of the number of documents in which it occurs. The most common function used is log(natural) of the number of docs divided by the number of docs in which this term appears. \[ Idf(\omega) = ln \left( \frac{ \# docs}{\# docs.containing(\omega) } \right) \]

  • (Tf°Idf):TERM SPECIFICITY Tf°Idf, the product of Tf and Idf, is used to weight words according to how “important” they are to the machine learning algorithm ☺. The higher the TF°IDF score (weight), the rarer the term and vice versa. The intuition for this measure is this:

    If a word appears frequently in a document, then it should be important and we should give that word a high score. But if a word appears in too many other documents, it’s probably not a unique identifier, therefore we should assign a lower score to that word.

Tf°idf is one of the most popular term-weighting schemes today. Over 80% of text-based recommender systems, including Google’s search engine, use Tf°Idf. Introduced, as “term specificity” by Karen Spärck Jones in a 1972 paper, it has worked well as a heuristic. However, its theoretical foundations have been troublesome for at least three decades afterward, with many researchers trying to find information theoretic justifications for it.

Here’s a peek at our datasets Tf°Idf:

reviews.tfidf <- reviews.df  %>%
  count(doc_id, word, sort = TRUE)

total_words.tfidf <- reviews.tfidf %>% 
    group_by(doc_id) %>% 
    summarize(total = sum(n))

reviews.tfidf <- left_join(reviews.tfidf, total_words.tfidf)
reviews.tfidf <- reviews.tfidf %>%
    bind_tf_idf(word, doc_id, n)
kable(reviews.tfidf[1:15,], format = "markdown")
doc_id word n total tf idf tf_idf
negative/MTurk/f4/d_homewood_15 hotel/NOUN 11 49 0.2244898 0.2076394 0.0466129
negative/MTurk/f5/d_intercontinental_17 hotel/NOUN 10 75 0.1333333 0.2076394 0.0276852
negative/Web/f4/t_homewood_19 credit/NOUN 10 233 0.0429185 3.8513984 0.1652961
negative/Web/f4/t_knickerbocker_17 hotel/NOUN 10 62 0.1612903 0.2076394 0.0334902
positive/MTurk/f3/d_omni_5 omni/PROPN 9 157 0.0573248 3.2033716 0.1836328
negative/MTurk/f2/d_affinia_6 pillow/NOUN 8 112 0.0714286 2.9234116 0.2088151
negative/MTurk/f4/d_swissotel_12 hotel/NOUN 8 73 0.1095890 0.2076394 0.0227550
negative/Web/f1/t_james_5 spoon/NOUN 8 124 0.0645161 6.6846117 0.4312653
negative/Web/f1/t_monaco_10 garage/NOUN 8 130 0.0615385 5.0751738 0.3123184
negative/Web/f2/t_affinia_7 elevator/NOUN 8 179 0.0446927 2.8559703 0.1276411
negative/Web/f2/t_ambassador_14 hotel/NOUN 8 36 0.2222222 0.2076394 0.0461421
negative/Web/f2/t_talbott_11 hotel/NOUN 8 95 0.0842105 0.2076394 0.0174854
negative/Web/f2/t_talbott_17 service/NOUN 8 252 0.0317460 0.9976364 0.0316710
negative/Web/f3/t_hyatt_20 night/NOUN 8 158 0.0506329 1.1067705 0.0560390
negative/Web/f4/t_homewood_19 hotel/NOUN 8 233 0.0343348 0.2076394 0.0071292

As you can see words like hotel, that are common, and common to a lot of documents will not be weighted highly for the machine learning algorithm. The tf_idf range for “hotel” is [0.00 - 0.05], whereas, words like “bedbug”, “concierge”, or “stinky”, have a range of [0.40 - 0.50] Let’s look at the top 30 tf°idf’s for deceptive and truthful reviews.

For Deceptive Reviews, these are terms with high specificity

reviews.tfidf.df <- left_join(reviews.df, reviews.tfidf)
deceptive.tfidf <- reviews.tfidf.df %>%
    filter(truthfulness=="deceptive") %>%
    filter(pos!='PROPN')
deceptive.tfidf <- distinct(deceptive.tfidf, word, .keep_all = TRUE)
deceptive.tfidf <- deceptive.tfidf[order(-deceptive.tfidf$tf_idf),] 
kable(deceptive.tfidf[1:19,c(2,3,4,5,6,9,10,11)], format = "markdown")
doc_id truthfulness polarity word lemma total tf idf
positive/MTurk/f4/d_swissotel_9 deceptive positive beat/NOUN beat 11 0.0909091 7.377759
positive/MTurk/f2/d_talbott_4 deceptive positive ritzy/ADJ ritzy 24 0.0833333 7.377759
positive/MTurk/f3/d_conrad_8 deceptive positive brightly/ADV brightly 12 0.0833333 7.377759
positive/MTurk/f3/d_conrad_8 deceptive positive garishness/NOUN garishness 12 0.0833333 7.377759
positive/MTurk/f2/d_talbott_7 deceptive positive bedbug/ADV bedbug 13 0.0769231 7.377759
positive/MTurk/f1/d_monaco_12 deceptive positive tidy/ADJ tidy 10 0.1000000 5.431849
positive/MTurk/f4/d_sheraton_14 deceptive positive -bobby/INTJ -bobby 14 0.0714286 7.377759
positive/MTurk/f2/d_talbott_10 deceptive positive talbott/NOUN talbott 59 0.0677966 7.377759
positive/MTurk/f5/d_amalfi_18 deceptive positive complimentartry/NOUN complimentartry 15 0.0666667 7.377759
positive/MTurk/f2/d_talbott_7 deceptive positive above/ADP above 13 0.0769231 6.279147
positive/MTurk/f1/d_james_9 deceptive positive go/VERB go 11 0.0909091 5.298317
negative/MTurk/f1/d_james_19 deceptive negative deceive/VERB deceive 12 0.0833333 5.768321
negative/MTurk/f1/d_sofitel_11 deceptive negative garish/ADJ garish 14 0.0714286 6.684612
negative/MTurk/f1/d_monaco_7 deceptive negative fin/NOUN fin 16 0.0625000 7.377759
positive/MTurk/f2/d_affinia_8 deceptive positive nearby/ADP nearby 16 0.0625000 7.377759
positive/MTurk/f4/d_sheraton_6 deceptive positive coffeepot/NOUN coffeepot 16 0.0625000 7.377759
positive/MTurk/f4/d_swissotel_20 deceptive positive clientel/NOUN clientel 16 0.0625000 7.377759
negative/MTurk/f5/d_amalfi_6 deceptive negative planner/NOUN planner 33 0.0606061 7.377759
negative/MTurk/f5/d_allegro_17 deceptive negative prolem/NOUN prolem 17 0.0588235 7.377759

Deceptive writers use the word deceive in deceptive reviews.



For Truthful Reviews, these are terms with high specificity

truthful.tfidf <- reviews.tfidf.df %>%
    filter(truthfulness=="truthful") %>%
    filter(pos!='PROPN') 
truthful.tfidf <- distinct(truthful.tfidf, word, .keep_all = TRUE)
truthful.tfidf <- truthful.tfidf[order(-truthful.tfidf$tf_idf),] 
kable(truthful.tfidf[2:19,c(2,3,4,5,6,9,10,11)], format = "markdown")
doc_id truthfulness polarity word lemma total tf idf
negative/Web/f3/t_fairmont_9 truthful negative sully/ADV sully 12 0.0833333 7.377759
negative/Web/f1/t_james_14 truthful negative yeah/INTJ yeah 29 0.1034483 5.180534
negative/Web/f2/t_ambassador_13 truthful negative carpeting/NOUN carpeting 24 0.0833333 5.991465
negative/Web/f3/t_fairmont_9 truthful negative lousy/ADJ lousy 12 0.0833333 5.991465
negative/Web/f1/t_sofitel_9 truthful negative princess/NOUN princess 15 0.0666667 7.377759
negative/Web/f3/t_omni_4 truthful negative stinky/ADJ stinky 15 0.0666667 7.377759
negative/Web/f3/t_hyatt_9 truthful negative strike/NOUN strike 26 0.0769231 6.279147
negative/Web/f3/t_fairmont_9 truthful negative boy/NOUN boy 12 0.0833333 5.768321
positive/TripAdvisor/f2/t_hardrock_5 truthful positive lime/NOUN lime 16 0.0625000 7.377759
positive/TripAdvisor/f2/t_hardrock_5 truthful positive punch/NOUN punch 16 0.0625000 7.377759
positive/TripAdvisor/f2/t_hardrock_5 truthful positive extradionary/ADJ extradionary 16 0.0625000 7.377759
negative/Web/f1/t_sofitel_13 truthful negative blanket/NOUN blanket 31 0.0967742 4.605170
negative/Web/f3/t_omni_4 truthful negative hurt/VERB hurt 15 0.0666667 6.684612
negative/Web/f1/t_sofitel_10 truthful negative lexus/NOUN lexus 17 0.0588235 7.377759
negative/Web/f1/t_sofitel_10 truthful negative smash/VERB smash 17 0.0588235 7.377759
positive/TripAdvisor/f5/t_intercontinental_2 truthful positive doller/NOUN doller 17 0.0588235 7.377759
negative/Web/f1/t_james_5 truthful negative spoon/NOUN spoon 124 0.0645161 6.684612
positive/TripAdvisor/f1/t_hilton_5 truthful positive place.amazing/NOUN place.amazing 18 0.0555556 7.377759

Truthful people don’t say smelly, they say stinky.



Alright!, Most of the heavy database lifting is done. Let’s start teaching the machine.

In a classic machine learning process we randomly divide the data into a training set and a testing set - here 80% of the 1600 files are randomly selected for training, and the other 20% will be used to test the trained machine’s accuracy. 80% of a small 1600 member training set leaves an even smaller training set of 1280 files. You don’t go to war with the army you wished for, you go to war with the army you have. Poor justification, I know LOL - it’s stinky.

Naive SVM:

Naive Setup

This is my first pass at training. Hopefully we will get better than average human performance (an average accuracy of no better than random at 50%). Let’s start with a naive approach, by using the filtered, stemmed words that we’ve processed so far, adding a label for parts of speech.

First let’s build a document/text database consisting of the existing filtered text. The we build a document term matrix, where each row is a document in the training set, and each term is the Tf°Idf of that term in the training document set.

# input_data_X is the input training data
# output_data_Y is the output prediction (i.e. the classification result)
gather_review_string <- function(aDocID) { paste(reviews.tfidf.df$word[reviews.tfidf.df$doc_id == aDocID], collapse = " ") }
reviews.docs <-  distinct(reviews.tfidf.df, doc_id, .keep_all = TRUE) 
reviews.docs$review <- sapply(reviews.docs$doc_id, gather_review_string)
reviews.docs <-  select(reviews.docs, doc_id, truthfulness, review)
write_csv(reviews.docs, './files/filtered_reviews.csv')

full_range <- 1:length(reviews.docs$doc_id)
training_range <- sample(full_range, round(.8*length(full_range)))
test_range <- setdiff(full_range,training_range)

# Weighting by Tf°Idf drops %accuracy by 10!
doc_term_matrix = create_matrix(reviews.docs[training_range,]$review)#, weighting=weightTfIdf)
input_data_X <- doc_term_matrix 
output_data_Y <- reviews.docs[training_range,]$truthfulness

Naive SVM Training:

We then create the SVM model and ask it to train itself (time for a drink of coffee)

# We can run prediction on the model to determine execution time

# Set up an SVM model
# Configure the training data
container <- create_container(input_data_X, output_data_Y, trainSize=1:length(training_range), virgin=FALSE)
 
# train a SVM Model
model <- train_model(container, "SVM", kernel="linear", cost=1)

Naive SVM Performance:

Once the model is trained, we pull out the test set of data, and use the model to predict the truth. We were going to use the machine to produce King Solomon’s sword. Typically the first part of analyzing and improving performance is to produce a “confusion matrix” which tells us how well (or poorly) we are doing on a 2x2 binary classification problem.

## If you error out:
# trace("create_matrix",edit=T)
# edit it and on line 42 will have a misspelling of the word "acronym". 
# Change the "A" to an "a" and hit "Save" - it should work fine after that.
predictionData <- reviews.docs[test_range,]$review
# create a prediction document term matrix 
predMatrix <- create_matrix(predictionData, originalMatrix=doc_term_matrix) 

# create the corresponding container
predSize = length(predictionData);
predictionContainer <- create_container(predMatrix, labels=rep(0,predSize), testSize=1:predSize, virgin=FALSE) 

# predict
results <- classify_model(predictionContainer, model)
predicted_results <- as.character(results$SVM_LABEL)
actual_results <- reviews.docs[test_range,]$truthfulness
confusion.df <- as.tibble(data.frame(predicted_results, actual_results, stringsAsFactors = FALSE))
Prediction.Incorrect <- 100* sum(confusion.df$predicted_results != confusion.df$actual_results)/length(test_range)
Prediction.Correct <- 100* sum(confusion.df$predicted_results == confusion.df$actual_results)/length(test_range)

classification.truth_truth <- sum((confusion.df$predicted_results == 'truthful') &
                                  (confusion.df$actual_results == 'truthful'))
classification.truth_false <- sum((confusion.df$predicted_results == 'truthful') &
                                  (confusion.df$actual_results == 'deceptive'))
classification.false_truth <- sum((confusion.df$predicted_results == 'deceptive') &
                                  (confusion.df$actual_results == 'truthful'))
classification.false_false <- sum((confusion.df$predicted_results == 'deceptive') &
                                  (confusion.df$actual_results == 'deceptive'))
Actual.type <- c('Deceptive', 'Truthful')
Prediction.Deceptive <- c(classification.false_false, classification.false_truth)
Prediction.Truthful <- c(classification.truth_false, classification.truth_truth)
conf.mat <- data.frame(Actual.type, Prediction.Deceptive, Prediction.Truthful)

#print(paste("Accuracy is", Prediction.Correct, "%"))
kable(conf.mat, format = "markdown")
Actual.type Prediction.Deceptive Prediction.Truthful
Deceptive 128 21
Truthful 28 143

Ok. Well 85% accuracy won’t get me a job on Wall St. But it’s a naive Bag-of-words/Parts-of-speech approach. And with some additional feature engineering I know I can get to 95%

Oh, by the way - even the naive machine did MUCH BETTER than human intelligence.

Let’s improve and tune the model. King Solomon’s sword needs to be sharper. The authors of the original paper noted that bigram’s worked better than unigrams. Also what about a different type of learning model? Although our training model size is small, as is our dictionary size, I’d like to try a neural net to see what happens with overtrained models…

Next Step. Try an alternate approach

What if we use a neural net instead?

Let’s try that:

Part 3: Using a Neural Net with the Keras package