Kaggle Data Science Competition Recap: Yelp.com Prediction of Useful Votes for Reviews

Completed my first Kaggle data science competition!  It was a great experience, albeit a somewhat rushed and last minute one.  I entered with only a week or so left in the contest and only had time to create one submission before the deadline hit, but it scored a competitive  0.53384 RMSE error metric, which landed me 135th out of 350 teams.  Not bad considering that it was a last minute submission and my first foray into machine learning.  

Here’s a summary of my approach and workflow as I learn my way around machine learning:

Contest Overview

http://www.kaggle.com/c/yelp-recruiting

Goal: To predict the useful votes that a review posted to Yelp.com will receive using a predictive model developed using supervised machine learning.  This was a regression problem (numerical target) as opposed to classification.

Data set: ~227,000 reviews from the Phoenix, AZ area.  Included in the data set was information about the reviewer (average review stars, total useful votes received for all reviews given, etc.) as well as the business receiving the review (business name, average review stars, etc.) and information about the review itself (date/time created, review text, etc.).  Each set of data (business, user, review, checkin) came formatted as large JSON files.

Error metric: RMSE (Root Mean Squared Error).  This is a pretty standard and straight forward regression error metric, which was nice to have given that it was my first competition and it made the math behind everything fairly intuitive and did not require any special transformations to make compatible with the learning algorithms.

My Model and Workflow

Model code on GitHub:  https://github.com/theusual/kaggle-yelp-useful-vote-prediction

Tools: For my weapon of choice I used the popular Python stack of: PANDAS/NumPy for loading and munging the data, SKLearn for learning algorithms and cross-validation, and the statsmodels module for visualization.

Loading and munging: The data was given in JSON format so I used the handy JSON import function found within PANDAS to directly import the JSON files into PANDAS data frames.  After loading, I did quite a bit of data munging and cleaning, including converting the unicode dates into usable date formats, flattening nested columns, and extracting zip codes from full business addresses.  In addition, I removed quite a few unused columns from the dataframes and converted some data into smaller formats (i.e.-int64 to int16) in order to reduce the memory footprint of the large dataframes.  Overall Python is amazing for rapidly performing munging/cleaning tasks like this.

Feature Creation:  After loading/munging, I set about identifying and creating features from the data for use in deriving signals from the data to be used in the predictive model.   There were quite a few potentially useful features to be derived from the data, among them:

  • Review stars (numeric) — The # of stars given to the review
  • Review Length (numeric) — Character count of review text
  • Total Check-ins (numeric) — Total check-ins for the business being reviewed
  • Business Category (categorical vector) — One-hot encoded vector of the business’ category
  • Zip code (categorical vector) — One-hot encoded vector of the business’ zipcode
  • User average stars (numeric) — User’s average review stars
  • Business average stars (numeric) — Business’ average review stars
  • Business open (binary boolean) — Boolean of whether business is open (1) or closed (0)
  • Review age (numerical) — Calculated age of review based on review date

Then there were the numerous features that could be derived from the review_text using NLP (natural language processing) methods:

  • Sentiment score (numeric) — Total sentiment (positive or negative view/emotion of the review) score of the review derived by looking up the average positive or negative sentiment of each word in a pre-calculated list.  The sentiment score list of words can be derived from your training text using some Python code or you can use pre-calculated lists available online, for example the AFINN-111 list which contains the sentiment score for ~22,000 English words.
  • Other NLP methods using the review text such n-gram TFID vectors

Unfortunately I ran out of time and was not able to use review text features at all, so I focused on creating a quick features matrix using the other features.

One thing of note: numerical pre-processing normalization using standardization.   I ran SKLearn’s StandardScaler on the numerical features in order to normalize the numerical data prior to training using standardization.  What this does is transform all values to zero mean and unit-variance (by subtracting mean from all values then dividing by the standard deviation).  This is an important pre-processing step on numerical features because it both ensures that equal weight is given to all features regardless of the original range of values, and it reduces the learning time during training, particularly on SVM’s and SGD. Other options for pre-processing transformations include simple re-scaling to [0,1] or [-1,1], or transforming using other mathematical operations such as applying a log transformation.

Cross-Validation:  For this data set, I used standard randomized k-fold cross validation.  For the number of folds, I used 20 for models with fast training times (Ridge, SGD), and 10 folds for learning algorithms with slow training times (Lasso, Logistic).  SKLearn has a great set of cross validation functions that are easy to use and a nice set of error metrics as well that are easy to tap into for scoring the CV.

I combined all available features into a large sparse matrix and recorded RMSE CV scores using all the major regression learning algorithms that are compatible with large sparse matrix problems like this.  Ensemble algorithms such as GBM, RF, and ExtraTrees were all out of the question due to the large K (features) caused by the one-hot vectors (sparse matrices) and the large N (review samples).

CV scores showed that the best performance came from SGDRegressor, so I went with that.  After choosing the learning algorithm, I began trimming off features one by one to see what the impact was on the CV score.  Surprisingly I found that removing features did not leave to any increase in RMSE and in fact led to small to large decreases, so it seems all features had some degree of signal.  The highest signal features, based on coefficient weights, were review_length, user_average_stars, business_average_stars, and text_sentiment_score.

Testing and submission:  Having identified the optimal learning algorithm (SGDRegressor) and feature set (all features), I was ready to train a model on the entire training set and then use that model to make test predictions.  After running some sanity checks on the predictions (mean, min, max, std. dev.), I discovered I had missed a post-processing step: converting all predictions < 0 to 0.  So I applied that then exported the processes predictions predictions to a CSV to create a contest submission.  All of this was fairly straight forward and I created some helper functions to make these steps faster in future competitions.

Performance

I only submitted one last-minute entry and the RMSE score for that entry on the final leaderboard was .53384, which placed in the top 40% of contestants (#135/#350).  It also beat the benchmarks by quite a bit, so overall I’m happy with my first (and only) submission.

Conclusion and Lessons Learned

Overall this was a great first experience in practically applying some of the machine learning techniques I’ve learned through reading and class (Intro Data Science online course at Univ of Washington).  I love Python and SKLearn!  Everything is very efficient and intuitive, allowing for very rapid development of models.

I also now have a good code base and work flow to follow for predictive modeling problems such as this.  I spent much of the time creating and identifying various Python helper functions to achieve what I needed and have now organized them into modules (munge, features, train, etc.) so that going forward future contests should “flow” much more rapidly.

I also learned some important new steps to the machine learning workflow that I had not previously covered in my reading such as pre and post processing and data cleaning.  It seems data science is all about the details.

Looking forward to my next Kaggle competition!

2 thoughts on “Kaggle Data Science Competition Recap: Yelp.com Prediction of Useful Votes for Reviews

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>