</>

by Ashok Khosla, Mendocino County, California


Using A Neural Net



In Part 1 of this study project I explored the motivation and data of deceptive hotel reviews provided by researchers at Cornell University.

In Part 2, I attempted to replicate their approach using statistical and machine learning packages available from The R Project for Statistical Computing. The machine learning algorithm we used was the Support Vector Model. Running the training algorithm several times generated accuracies ranging from 82% to 87%. Since the sample size was small, the random selection of training data influenced the training accuracy.

Will a change in technology help? Unlikely, but just for fun (mostly mine LOL) let’s see how a neural net behaves.

I’ll use the very popular neural net package called Keras, written by Francois Chollet, part of the Google DeepMind team. The following is adapted from adapted from A tutorial on text classification using Keras

First we’ll reload our tagged database of docs, words, and parts of speech.

reviews.df <- as.tibble(read.csv('./files/reviews_pos.csv', stringsAsFactors=FALSE))

names(reviews.df)[names(reviews.df)=="X1"] <- "word_id"
names(reviews.df)[names(reviews.df)=="Truthfulness"] <- "truthfulness"
names(reviews.df)[names(reviews.df)=="Polarity"] <- "polarity"
reviews.df <- reviews.df %>%
    filter(word != '#NAME?') %>%
    select(-entity)
tag_pos <- function(anIndex) { 
    return(paste(reviews.df[anIndex,]$lemma, '/', reviews.df[anIndex,]$pos, sep =''))}
reviews.df$word <- sapply(1:length(reviews.df$word),tag_pos)

# Make a dictionary
wordIndex.dict <- count(reviews.df, word, sort = TRUE)
wordIndex.dict$word_index <- 1:length(wordIndex.dict$word)
write.csv(wordIndex.dict, file="./files/word_index_dict.csv")

# And update reviews.df to have a word_index :-) with one line!
reviews.df <- left_join(reviews.df,wordIndex.dict,by="word")

document.term.df <- reviews.df %>%
    group_by(doc_id) %>%
    mutate(terms_list = c(word_index))
document_ids <- as.vector(distinct(reviews.df, doc_id)$doc_id)

getTermList <- function(aDocId) { 
    return(reviews.df[reviews.df$doc_id==aDocId,]$word_index) }
getOutputCode <- function(aDocID) {
    if (reviews.df[reviews.df$doc_id==aDocID,]$truthfulness[1] == 'deceptive')
        { return(0) }
    else { return(1) }
}

get_document_set <- function(someDocs) {
    document_x_data <- unname(sapply(someDocs, getTermList))
    document_y_data <- unname(sapply(someDocs, getOutputCode))
    return(list( x =document_x_data, y=document_y_data))
}

full_range <- 1:length(document_ids)
training_range <- sample(full_range, round(.8*length(full_range)))
test_range <- setdiff(full_range,training_range)

nn_container <- list(train = get_document_set(document_ids[training_range]),
                     test  = get_document_set(document_ids[test_range]))

Like any data science project much of the ingenuity is in trying to munge the data into the right form in the fewest lines.

c(train_data, train_labels) %<-% nn_container$train
c(test_data, test_labels) %<-% nn_container$test

### Dictionary of index to words
word_index <- list()
for (i in 1:length(wordIndex.dict$word)) {
    word_index[wordIndex.dict[i,]$word] = wordIndex.dict[i,]$word_index 
}
word_index_df <- data.frame(
    word = names(word_index),
    idx = unlist(word_index, use.names = FALSE),
    stringsAsFactors = FALSE
)

# The first indices are reserved  
word_index_df <- word_index_df %>% mutate(idx = idx + 3)
word_index_df <- word_index_df %>%
    add_row(word = "<PAD>", idx = 0)%>%
    add_row(word = "<START>", idx = 1)%>%
    add_row(word = "<UNK>", idx = 2)%>%
    add_row(word = "<UNUSED>", idx = 3)

word_index_df <- word_index_df %>% arrange(idx)
gNumWords_in_Dict <- length(word_index_df$word)

## Convert from numbers to words
decode_review <- function(text){
    paste(map(text, function(number) word_index_df %>%
                  filter(idx == number) %>%
                  select(word) %>% 
                  pull()),
          collapse = " ")
}
# decode_review(train_data[[1]])

### Prepare the data

# The reviews — the arrays of integers — must be converted to tensors before fed into the neural network. This conversion can be done a couple of ways:
# One-hot-encode the arrays to convert them into vectors of 0s and 1s. For example, the sequence [3, 5] would become a 10,000-dimensional vector that is all zeros except for indices 3 and 5, which are ones. Then, make this the first layer in our network — a dense layer — that can handle floating point vector data. This approach is memory intensive, though, requiring a num_words * num_reviews size matrix.
# Alternatively, we can pad the arrays so they all have the same length, then create an integer tensor of shape num_examples * max_length. We can use an embedding layer capable of handling this shape as the first layer in our network.
# In this tutorial, we will use the second approach.
# Since the reviews must be the same length, we will use the pad_sequences function to standardize the lengths:
train_data <- pad_sequences(
    train_data,
    value = word_index_df %>% filter(word == "<PAD>") %>% select(idx) %>% pull(),
    padding = "post",
    maxlen = 256
)
test_data <- pad_sequences(
    test_data,
    value = word_index_df %>% filter(word == "<PAD>") %>% select(idx) %>% pull(),
    padding = "post",
    maxlen = 256
)    

We build a neural net with an input layer of about 9000 words, coded by integer index, and an output layer that says the result is 0 or 1. In between we have a few hidden layers. Let’s run and test it.

### Build the model

# input shape is the vocabulary count used for the hotel reviews ( approximately 10,000 words)
model <- keras_model_sequential() 
model %>% 
    layer_embedding(input_dim = gNumWords_in_Dict, output_dim = 96) %>%
    layer_global_average_pooling_1d() %>%
    layer_dense(units = 96, activation = "relu") %>% 
    layer_dense(units = 96, activation = "relu") %>% 
    layer_dense(units = 96, activation = "relu") %>% 
    layer_dense(units = 1, activation = "sigmoid")

model %>% summary()
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## embedding_1 (Embedding)          (None, None, 96)              839424      
## ___________________________________________________________________________
## global_average_pooling1d_1 (Glob (None, 96)                    0           
## ___________________________________________________________________________
## dense_1 (Dense)                  (None, 96)                    9312        
## ___________________________________________________________________________
## dense_2 (Dense)                  (None, 96)                    9312        
## ___________________________________________________________________________
## dense_3 (Dense)                  (None, 96)                    9312        
## ___________________________________________________________________________
## dense_4 (Dense)                  (None, 1)                     97          
## ===========================================================================
## Total params: 867,457
## Trainable params: 867,457
## Non-trainable params: 0
## ___________________________________________________________________________
# Loss function and optimizer
model %>% compile(
    optimizer = 'adam',
    loss = 'binary_crossentropy',
    metrics = list('accuracy')
)

# Loss function and optimizer
x_val <- train_data[1:150, ]
partial_x_train <- train_data[150:nrow(train_data), ]

y_val <- train_labels[1:150]
partial_y_train <- train_labels[150:length(train_labels)]

# Train the model
history <- model %>% fit(
    partial_x_train,
    partial_y_train,
    epochs = 20,
    batch_size = 50,
    validation_data = list(x_val, y_val),
    verbose=1
)


# EVALUATE the model
results <- model %>% evaluate(test_data, test_labels)
Percent_Accuracy <- format(100*results$acc,digits=2)
results
## $loss
## [1] 0.5000462
## 
## $acc
## [1] 0.8875
plot(history)

RESULTS

Sigh. A different suit of clothes didn’t help. The neural net has an accuracy of 89% - on average a few percent better (but with a much narrower standard deviation), than the Support Vector Machine. This implies… that the neural net may be less sensitive to the sampling error introduced by the small sample size.