Using Machine Learning techniques

In Part 1 of this study project I explored the motivation and data of deceptive hotel reviews provided by researchers at Cornell University.

In Part 2, I will attempt to replicate their approach using statistical and machine learning packages available from The R Project for Statistical Computing.

rm(list=ls())
library(tidyverse)
library(gridExtra) #viewing multiple plots together

 # Text Mining Packages
library(tidytext)
library(tokenizers)
library("e1071")
library(tm)
library(RTextTools)

# Graphics Packages
library(ggthemes)
library(moments)
library(ggplot2)
library(scales)
library(knitr) # for dynamic reporting
library(kableExtra) # create a nicely formated HTML table
library(formattable) # for the color_tile function

publication_theme <- function() {
    theme_economist() +
    theme(text=element_text(family="Rockwell"),
          plot.title = element_text(family="Rockwell", size=12)
    )
}
publication.color.background <- '#d6e4ea'
publication.color.orange <- '#f0716f'
publication.color.cyan <- '#3cbfc2'


Project_Dir <- "/Users/amkhosla/Desktop/Statistics/Projects/Hotel_Reviews/code"
setwd(Project_Dir)

Restore the previously tagged database

First let’s restore our recently tagged dataset: Here’s a quick refresher:

reviews.df <- as.tibble(read.csv('./files/reviews_pos.csv', stringsAsFactors = FALSE)) 

names(reviews.df)[names(reviews.df)=="X1"] <- "word_id"
names(reviews.df)[names(reviews.df)=="Truthfulness"] <- "truthfulness"
names(reviews.df)[names(reviews.df)=="Polarity"] <- "polarity"

reviews.df <- reviews.df %>%
    filter(word != '#NAME?') %>%
    select(-entity)
tag_pos <- function(anIndex) { 
    return(paste(reviews.df[anIndex,]$lemma, '/', reviews.df[anIndex,]$pos, sep =''))}
reviews.df$word <- sapply(1:length(reviews.df$word),tag_pos)

kable(reviews.df[1:15,], format = "markdown")

X	doc_id	truthfulness	polarity	word	lemma	pos
1	negative/MTurk/f1/d_hilton_1	deceptive	negative	stay/VERB	stay	VERB
2	negative/MTurk/f1/d_hilton_1	deceptive	negative	schicago/PROPN	schicago	PROPN
3	negative/MTurk/f1/d_hilton_1	deceptive	negative	hilton/PROPN	hilton	PROPN
4	negative/MTurk/f1/d_hilton_1	deceptive	negative	day/NOUN	day	NOUN
5	negative/MTurk/f1/d_hilton_1	deceptive	negative	night/NOUN	night	NOUN
6	negative/MTurk/f1/d_hilton_1	deceptive	negative	conference/NOUN	conference	NOUN
7	negative/MTurk/f1/d_hilton_1	deceptive	negative	easy/ADJ	easy	ADJ
8	negative/MTurk/f1/d_hilton_1	deceptive	negative	amenity/NOUN	amenity	NOUN
9	negative/MTurk/f1/d_hilton_1	deceptive	negative	cleanliness/NOUN	cleanliness	NOUN
10	negative/MTurk/f1/d_hilton_1	deceptive	negative	experience/NOUN	experience	NOUN
11	negative/MTurk/f1/d_hilton_1	deceptive	negative	hilton/PROPN	hilton	PROPN
12	negative/MTurk/f1/d_hilton_1	deceptive	negative	awful/ADJ	awful	ADJ
13	negative/MTurk/f1/d_hilton_1	deceptive	negative	take/VERB	take	VERB
14	negative/MTurk/f1/d_hilton_1	deceptive	negative	time/NOUN	time	NOUN
15	negative/MTurk/f1/d_hilton_1	deceptive	negative	write/VERB	write	VERB

Term Frequency and Inverse Document Frequency and TfIdf

As part of training our machine model, we want to give it information about how words are distributed across our set of data. This is commonly done using Term Frequency/Inverse Document Frequency techniques.

The first step, then, is to calculate the term frequency (Tf), the inverse document frequency (Idf), and their product (Tf°Idf)
Apologies: There’s an occasional Google Chrome bug with latex/equation rendering - see equations in Safari if you want to understand them

(Tf): TERM FREQUENCY - How probable is that this particular word appears in a SINGLE document. \[ Tf(\omega) = \left( \frac{ \# occurences\ of\ \omega\ in\ document}{\# words\ in\ document} \right) \]
(Idf): INVERSE DOCUMENT FREQUENCY - How common or rare a word is across ALL documents. If it’s a common word across all documents, it’s regarded as a non-differentiating word, and we want to weight it lower.

Alternatively, it can be thought of as the specificity of a term - quantified as an inverse function of the number of documents in which it occurs. The most common function used is log(natural) of the number of docs divided by the number of docs in which this term appears. \[ Idf(\omega) = ln \left( \frac{ \# docs}{\# docs.containing(\omega) } \right) \]
(Tf°Idf):TERM SPECIFICITY Tf°Idf, the product of Tf and Idf, is used to weight words according to how “important” they are to the machine learning algorithm ☺. The higher the TF°IDF score (weight), the rarer the term and vice versa. The intuition for this measure is this:

If a word appears frequently in a document, then it should be important and we should give that word a high score. But if a word appears in too many other documents, it’s probably not a unique identifier, therefore we should assign a lower score to that word.

Tf°idf is one of the most popular term-weighting schemes today. Over 80% of text-based recommender systems, including Google’s search engine, use Tf°Idf. Introduced, as “term specificity” by Karen Spärck Jones in a 1972 paper, it has worked well as a heuristic. However, its theoretical foundations have been troublesome for at least three decades afterward, with many researchers trying to find information theoretic justifications for it.

Here’s a peek at our datasets Tf°Idf:

reviews.tfidf <- reviews.df  %>%
  count(doc_id, word, sort = TRUE)

total_words.tfidf <- reviews.tfidf %>% 
    group_by(doc_id) %>% 
    summarize(total = sum(n))

reviews.tfidf <- left_join(reviews.tfidf, total_words.tfidf)
reviews.tfidf <- reviews.tfidf %>%
    bind_tf_idf(word, doc_id, n)
kable(reviews.tfidf[1:15,], format = "markdown")

doc_id	word	n	total	tf	idf	tf_idf
negative/MTurk/f4/d_homewood_15	hotel/NOUN	11	49	0.2244898	0.2076394	0.0466129
negative/MTurk/f5/d_intercontinental_17	hotel/NOUN	10	75	0.1333333	0.2076394	0.0276852
negative/Web/f4/t_homewood_19	credit/NOUN	10	233	0.0429185	3.8513984	0.1652961
negative/Web/f4/t_knickerbocker_17	hotel/NOUN	10	62	0.1612903	0.2076394	0.0334902
positive/MTurk/f3/d_omni_5	omni/PROPN	9	157	0.0573248	3.2033716	0.1836328
negative/MTurk/f2/d_affinia_6	pillow/NOUN	8	112	0.0714286	2.9234116	0.2088151
negative/MTurk/f4/d_swissotel_12	hotel/NOUN	8	73	0.1095890	0.2076394	0.0227550
negative/Web/f1/t_james_5	spoon/NOUN	8	124	0.0645161	6.6846117	0.4312653
negative/Web/f1/t_monaco_10	garage/NOUN	8	130	0.0615385	5.0751738	0.3123184
negative/Web/f2/t_affinia_7	elevator/NOUN	8	179	0.0446927	2.8559703	0.1276411
negative/Web/f2/t_ambassador_14	hotel/NOUN	8	36	0.2222222	0.2076394	0.0461421
negative/Web/f2/t_talbott_11	hotel/NOUN	8	95	0.0842105	0.2076394	0.0174854
negative/Web/f2/t_talbott_17	service/NOUN	8	252	0.0317460	0.9976364	0.0316710
negative/Web/f3/t_hyatt_20	night/NOUN	8	158	0.0506329	1.1067705	0.0560390
negative/Web/f4/t_homewood_19	hotel/NOUN	8	233	0.0343348	0.2076394	0.0071292

As you can see words like hotel, that are common, and common to a lot of documents will not be weighted highly for the machine learning algorithm. The tf_idf range for “hotel” is [0.00 - 0.05], whereas, words like “bedbug”, “concierge”, or “stinky”, have a range of [0.40 - 0.50] Let’s look at the top 30 tf°idf’s for deceptive and truthful reviews.

For Deceptive Reviews, these are terms with high specificity

reviews.tfidf.df <- left_join(reviews.df, reviews.tfidf)
deceptive.tfidf <- reviews.tfidf.df %>%
    filter(truthfulness=="deceptive") %>%
    filter(pos!='PROPN')
deceptive.tfidf <- distinct(deceptive.tfidf, word, .keep_all = TRUE)
deceptive.tfidf <- deceptive.tfidf[order(-deceptive.tfidf$tf_idf),] 
kable(deceptive.tfidf[1:19,c(2,3,4,5,6,9,10,11)], format = "markdown")

doc_id	truthfulness	polarity	word	lemma	total	tf	idf
positive/MTurk/f4/d_swissotel_9	deceptive	positive	beat/NOUN	beat	11	0.0909091	7.377759
positive/MTurk/f2/d_talbott_4	deceptive	positive	ritzy/ADJ	ritzy	24	0.0833333	7.377759
positive/MTurk/f3/d_conrad_8	deceptive	positive	brightly/ADV	brightly	12	0.0833333	7.377759
positive/MTurk/f3/d_conrad_8	deceptive	positive	garishness/NOUN	garishness	12	0.0833333	7.377759
positive/MTurk/f2/d_talbott_7	deceptive	positive	bedbug/ADV	bedbug	13	0.0769231	7.377759
positive/MTurk/f1/d_monaco_12	deceptive	positive	tidy/ADJ	tidy	10	0.1000000	5.431849
positive/MTurk/f4/d_sheraton_14	deceptive	positive	-bobby/INTJ	-bobby	14	0.0714286	7.377759
positive/MTurk/f2/d_talbott_10	deceptive	positive	talbott/NOUN	talbott	59	0.0677966	7.377759
positive/MTurk/f5/d_amalfi_18	deceptive	positive	complimentartry/NOUN	complimentartry	15	0.0666667	7.377759
positive/MTurk/f2/d_talbott_7	deceptive	positive	above/ADP	above	13	0.0769231	6.279147
positive/MTurk/f1/d_james_9	deceptive	positive	go/VERB	go	11	0.0909091	5.298317
negative/MTurk/f1/d_james_19	deceptive	negative	deceive/VERB	deceive	12	0.0833333	5.768321
negative/MTurk/f1/d_sofitel_11	deceptive	negative	garish/ADJ	garish	14	0.0714286	6.684612
negative/MTurk/f1/d_monaco_7	deceptive	negative	fin/NOUN	fin	16	0.0625000	7.377759
positive/MTurk/f2/d_affinia_8	deceptive	positive	nearby/ADP	nearby	16	0.0625000	7.377759
positive/MTurk/f4/d_sheraton_6	deceptive	positive	coffeepot/NOUN	coffeepot	16	0.0625000	7.377759
positive/MTurk/f4/d_swissotel_20	deceptive	positive	clientel/NOUN	clientel	16	0.0625000	7.377759
negative/MTurk/f5/d_amalfi_6	deceptive	negative	planner/NOUN	planner	33	0.0606061	7.377759
negative/MTurk/f5/d_allegro_17	deceptive	negative	prolem/NOUN	prolem	17	0.0588235	7.377759

Deceptive writers use the word deceive in deceptive reviews.

For Truthful Reviews, these are terms with high specificity

truthful.tfidf <- reviews.tfidf.df %>%
    filter(truthfulness=="truthful") %>%
    filter(pos!='PROPN') 
truthful.tfidf <- distinct(truthful.tfidf, word, .keep_all = TRUE)
truthful.tfidf <- truthful.tfidf[order(-truthful.tfidf$tf_idf),] 
kable(truthful.tfidf[2:19,c(2,3,4,5,6,9,10,11)], format = "markdown")

doc_id	truthfulness	polarity	word	lemma	total	tf	idf
negative/Web/f3/t_fairmont_9	truthful	negative	sully/ADV	sully	12	0.0833333	7.377759
negative/Web/f1/t_james_14	truthful	negative	yeah/INTJ	yeah	29	0.1034483	5.180534
negative/Web/f2/t_ambassador_13	truthful	negative	carpeting/NOUN	carpeting	24	0.0833333	5.991465
negative/Web/f3/t_fairmont_9	truthful	negative	lousy/ADJ	lousy	12	0.0833333	5.991465
negative/Web/f1/t_sofitel_9	truthful	negative	princess/NOUN	princess	15	0.0666667	7.377759
negative/Web/f3/t_omni_4	truthful	negative	stinky/ADJ	stinky	15	0.0666667	7.377759
negative/Web/f3/t_hyatt_9	truthful	negative	strike/NOUN	strike	26	0.0769231	6.279147
negative/Web/f3/t_fairmont_9	truthful	negative	boy/NOUN	boy	12	0.0833333	5.768321
positive/TripAdvisor/f2/t_hardrock_5	truthful	positive	lime/NOUN	lime	16	0.0625000	7.377759
positive/TripAdvisor/f2/t_hardrock_5	truthful	positive	punch/NOUN	punch	16	0.0625000	7.377759
positive/TripAdvisor/f2/t_hardrock_5	truthful	positive	extradionary/ADJ	extradionary	16	0.0625000	7.377759
negative/Web/f1/t_sofitel_13	truthful	negative	blanket/NOUN	blanket	31	0.0967742	4.605170
negative/Web/f3/t_omni_4	truthful	negative	hurt/VERB	hurt	15	0.0666667	6.684612
negative/Web/f1/t_sofitel_10	truthful	negative	lexus/NOUN	lexus	17	0.0588235	7.377759
negative/Web/f1/t_sofitel_10	truthful	negative	smash/VERB	smash	17	0.0588235	7.377759
positive/TripAdvisor/f5/t_intercontinental_2	truthful	positive	doller/NOUN	doller	17	0.0588235	7.377759
negative/Web/f1/t_james_5	truthful	negative	spoon/NOUN	spoon	124	0.0645161	6.684612
positive/TripAdvisor/f1/t_hilton_5	truthful	positive	place.amazing/NOUN	place.amazing	18	0.0555556	7.377759

Truthful people don’t say smelly, they say stinky.

Alright!, Most of the heavy database lifting is done. Let’s start teaching the machine.

In a classic machine learning process we randomly divide the data into a training set and a testing set - here 80% of the 1600 files are randomly selected for training, and the other 20% will be used to test the trained machine’s accuracy. 80% of a small 1600 member training set leaves an even smaller training set of 1280 files. You don’t go to war with the army you wished for, you go to war with the army you have. Poor justification, I know LOL - it’s stinky.

Naive SVM:

Naive Setup

This is my first pass at training. Hopefully we will get better than average human performance (an average accuracy of no better than random at 50%). Let’s start with a naive approach, by using the filtered, stemmed words that we’ve processed so far, adding a label for parts of speech.

First let’s build a document/text database consisting of the existing filtered text. The we build a document term matrix, where each row is a document in the training set, and each term is the Tf°Idf of that term in the training document set.

# input_data_X is the input training data
# output_data_Y is the output prediction (i.e. the classification result)
gather_review_string <- function(aDocID) { paste(reviews.tfidf.df$word[reviews.tfidf.df$doc_id == aDocID], collapse = " ") }
reviews.docs <-  distinct(reviews.tfidf.df, doc_id, .keep_all = TRUE) 
reviews.docs$review <- sapply(reviews.docs$doc_id, gather_review_string)
reviews.docs <-  select(reviews.docs, doc_id, truthfulness, review)
write_csv(reviews.docs, './files/filtered_reviews.csv')

full_range <- 1:length(reviews.docs$doc_id)
training_range <- sample(full_range, round(.8*length(full_range)))
test_range <- setdiff(full_range,training_range)

# Weighting by Tf°Idf drops %accuracy by 10!
doc_term_matrix = create_matrix(reviews.docs[training_range,]$review)#, weighting=weightTfIdf)
input_data_X <- doc_term_matrix 
output_data_Y <- reviews.docs[training_range,]$truthfulness

Naive SVM Training:

We then create the SVM model and ask it to train itself (time for a drink of coffee)

# We can run prediction on the model to determine execution time

# Set up an SVM model
# Configure the training data
container <- create_container(input_data_X, output_data_Y, trainSize=1:length(training_range), virgin=FALSE)
 
# train a SVM Model
model <- train_model(container, "SVM", kernel="linear", cost=1)

Naive SVM Performance:

Once the model is trained, we pull out the test set of data, and use the model to predict the truth. We were going to use the machine to produce King Solomon’s sword. Typically the first part of analyzing and improving performance is to produce a “confusion matrix” which tells us how well (or poorly) we are doing on a 2x2 binary classification problem.

## If you error out:
# trace("create_matrix",edit=T)
# edit it and on line 42 will have a misspelling of the word "acronym". 
# Change the "A" to an "a" and hit "Save" - it should work fine after that.
predictionData <- reviews.docs[test_range,]$review
# create a prediction document term matrix 
predMatrix <- create_matrix(predictionData, originalMatrix=doc_term_matrix) 

# create the corresponding container
predSize = length(predictionData);
predictionContainer <- create_container(predMatrix, labels=rep(0,predSize), testSize=1:predSize, virgin=FALSE) 

# predict
results <- classify_model(predictionContainer, model)
predicted_results <- as.character(results$SVM_LABEL)
actual_results <- reviews.docs[test_range,]$truthfulness

confusion.df <- as.tibble(data.frame(predicted_results, actual_results, stringsAsFactors = FALSE))
Prediction.Incorrect <- 100* sum(confusion.df$predicted_results != confusion.df$actual_results)/length(test_range)
Prediction.Correct <- 100* sum(confusion.df$predicted_results == confusion.df$actual_results)/length(test_range)

classification.truth_truth <- sum((confusion.df$predicted_results == 'truthful') &
                                  (confusion.df$actual_results == 'truthful'))
classification.truth_false <- sum((confusion.df$predicted_results == 'truthful') &
                                  (confusion.df$actual_results == 'deceptive'))
classification.false_truth <- sum((confusion.df$predicted_results == 'deceptive') &
                                  (confusion.df$actual_results == 'truthful'))
classification.false_false <- sum((confusion.df$predicted_results == 'deceptive') &
                                  (confusion.df$actual_results == 'deceptive'))
Actual.type <- c('Deceptive', 'Truthful')
Prediction.Deceptive <- c(classification.false_false, classification.false_truth)
Prediction.Truthful <- c(classification.truth_false, classification.truth_truth)
conf.mat <- data.frame(Actual.type, Prediction.Deceptive, Prediction.Truthful)

#print(paste("Accuracy is", Prediction.Correct, "%"))
kable(conf.mat, format = "markdown")

Actual.type	Prediction.Deceptive	Prediction.Truthful
Deceptive	128	21
Truthful	28	143

Ok. Well 85% accuracy won’t get me a job on Wall St. But it’s a naive Bag-of-words/Parts-of-speech approach. And with some additional feature engineering I know I can get to 95%

Oh, by the way - even the naive machine did MUCH BETTER than human intelligence.

Let’s improve and tune the model. King Solomon’s sword needs to be sharper. The authors of the original paper noted that bigram’s worked better than unigrams. Also what about a different type of learning model? Although our training model size is small, as is our dictionary size, I’d like to try a neural net to see what happens with overtrained models…

Next Step. Try an alternate approach

What if we use a neural net instead?

Let’s try that:

Part 3: Using a Neural Net with the Keras package

Detecting Deceptive Hotel Reviews using Machine Learning - Part 2

7/28/2018

by Ashok Khosla, Mendocino County, California