Kaggle Data Science Competition Recap: RecSys2013 – Yelp Business Rating Prediction

Completed another data science competition, and this one was really a doozy!  This was part of the ACM conference on recommender systems (RecSys2013) and it was sponsored by Yelp.com, so the dataset was in a very similar format to that used for the previous Yelp.com contest in which I competed 2 months ago (Recap: Yelp.com Prediction of Useful Votes for Reviews).  However in this competition the goal was to predict business ratings of reviews (review stars) rather than useful votes for reviews.  Same dataset format, different goal (and of course different set of samples, there was no overlap in reviews found in this dataset and reviews found in the previous dataset).

This contest progressed very well, I entered early and finally had a significant amount of time to dedicate to developing the model and I had some experience under my belt from previous competitions, so I really felt good about everything.  The result was quite satisfying.  I finished the competition with my best model scoring an RMSE error of 1.23107, earning me a 7th place finish!  Finally a coveted Kaggle Top 10 badge :)

Here’s a recap:

Contest Overview


Goal: The objective of this competition was to build a predictive model to accurately predict the review stars which a reviewer would assign to a business.  A very interesting contest indeed!

Like the other Yelp contest, this one involved a mix of cold start, warm start, and full start recommender problems.  This environment was created by giving some review records which had either no reference data about the user or business review history (warm start) or no data about the user AND business review history (cold start).  The most challenging problem of course was how to handle the cold start problem using only the non-review history attributes of the user and business.

Data set: The data consisted of 229,907 training reviews and 22,956 test reviews from the Phoenix, AZ metropolitan area.  Included in the data set was information about the reviewer (average review stars, total useful votes received for all reviews given, etc.) as well as the business receiving the review (business name, average review stars, etc.) and information about the review itself (date/time created, review text, etc.).  As mentioned some businesses and users had their average review starts removed to simulate warm and cold start problems.

Each set of data (business, user, review, checkin) came formatted as large JSON files.

Error metric:   RMSE (Root Mean Squared Error).  This is a common and straight forward regression error metric, which was nice to have given that it was my first competition and it made the math behind everything fairly intuitive and did not require any special transformations to make compatible with the learning algorithms. In RMSE, lower is better with 0 being a perfect model.

My Model and Workflow

Model code on GitHub:  https://github.com/theusual/kaggle-yelp-business-rating-prediction

Tools: I used Python/PANDAS/NumPy for loading and munging the data, SKLearn for models and CV, and Excel for tracking and organizing the different models and submissions.

Loading and munging: The data was given in JSON format so I used the JSON import function found within PANDAS to directly import the JSON files into PANDAS data frames.  Once in dataframe format, it was easy to use the Python console to join together and analyze the data across various dimensions.

After loading, I did quite a bit of data munging and cleaning, including converting the unicode dates into usable date formats, flattening nested columns, and extracting zip codes from full business addresses.  I also performed quite a bit of handling of null and missing values.  Lastly, there were some nuances to the data that had to be teased out.  For example, there were 1107 records in the test set which were missing a user average (no user_id found in the training set user table), but did contain matching user_id’s in the training set’s review table.  So in other words while we did not have an explicit user average for these users, we can calculate a user average based on their reviews found in the training set.  This being a sample mean (based only on sample records in this set vs. a true historical average), it was obviously a weaker signal then true user average, so I had to weight it less in my models.  It did however still significantly improve my RMSE over having no user average at all for those records.

During the course of munging and analyzing the data, I realized that there was a distinct difference in the weight that could be given to review history for new user or business with only a few reviews and a reviewer or business with many reviews, so a different model would be ideal for each group.  A great way to handle this problem is to parse the data into subsets (segments), thereby breaking the problem into smaller and more defined problems.  A separate model for each subset can be trained and used, and then the combined predictions of all models can be used as a submission.  I found out later that this is a technique in machine learning called a segmentation ensemble.

I broke the data down into 15 data subsets in total for which I used slightly different models depending on the features of the data.  The subsets were split according to whether any user or business review history existed and if so how many reviews existed.  For example, for records with both user and business history I broke the training and testing data into subsets of:

  • Usr review count >= 20, Bus review count >= 20
  • Usr review count >=  13, Bus review count >= 8 (and usr review count <20 and bus review count <20)
  • Usr review count >= 8, bus review count >=5 (and usr review count <13 and bus review count <8)
  • Usr review count >=0, bus review count >=0 (and usr review count <8 and bus review count <5)

These cutoffs were derived by manually testing various cutoff points and then performing cross validation.

This segmentation allowed me to derive more accurate coefficients/weights for the features for each subset of data.  For example, Business review history appeared to have a stronger signal then user review history as review counts became lower and lower.  Which makes sense intuitively as a new user to Yelp who has only submitted a handful of reviews has really shown revealed no distinct pattern yet, whereas a business with 5-10 4 and 5 star reviews has already begun to show a pattern of excellence.

Feature Creation:  Some of the features I created and used in addition to the basics such as user review history, business review history, and zipcodes included:

  • Business name averages (derived by finding businesses with the same name and calculating the average)
  • Text bias derived from words found in the business name (if a matching bus_name was not found)
  • Grouped category averages (finding businesses with exact same group of categories and calculating the average)
  • Mixed category averages (breaking all categories apart and calculating the averages for each, then averaging those categories together if test business contains more then 1)

The strongest signals came from bus_name averages, then grouped category averages, then mixed category averages.  So I used bus_name averages if there was sufficient matching businesses for comparison (>3), then used grouped category averages if there were sufficient matching categories for comparison (>3), then defaulted to mixed category averages if that was all that was available.  It’s for this reason that I had so many different models to train.

The bus_name text analysis gave some of the most surprising finds. For example, when I ran it and begin looking at the results, the highest positive bias word for business names in the entire training set was…. (drum roll please)…   “Yelp”!  So I looked into it and sure enough there are many Yelp events that Yelp holds for its elite members and each event is reviewed just like businesses.  And of course intuitively, what type of reviews are elite yelp members going to give to a Yelp event?  Reviews that are certain to be read by the Yelp admins?  Glowing 5-star reviews!   So, for all records in the test set that contained the word “Yelp”, I overrode the review prediction with a simple 5 and sure enough my RMSE score jumped +.00087 just from that simple change.

Other words were not so extremely biased, but I did take some of the more heavily negative and positive bias word combinations (“ice cream”, “chiropractor”, etc.) and use it to weight the reviews for which a comparable business name and comparable grouped categories were missing.  It would have been very interesting to see if there is a temporal effect on the word bias, for example in the winter are businesses with “ice cream” in their name still receiving such glowing reviews?  When the Diamondbacks perform better next season, do businesses with “Diamondbacks” in their name begin receiving a high positive bias?  Sadly, as has already been discussed much in the forums, temporal data was not available for this contest.

I used a few other features with marginal success, such as business review count and total check-ins.  These seemed to have very weak signals, but did improve my score marginally when added into the weaker models (usr avg only, bus avg only, etc.).  One important thing to note was that they were only effective once I cleaned the training data of outliers, businesses that had extremely high check-ins or bus review counts.

Cross-Validation:  For this data set, the standard randomized k-fold cross validation was ideal.  There was not a temporal element to the data and the data was evenly distributed, so this seemed the most effective choice.  I mostly used 20 folds for linear models with fast training time and 10 for models with slower training times.

Learning algorithms used:  Nearly all of my models used SKLearn’s basic linear regression model as I found other models did not perform any better on the leaderboard (although my CV scores much improved…).  A few of my models that didn’t perform well in linear regression were actually just built in Excel where I used simple factorization with weighting up to a certain threshold.  For example, in the UsrAvg+BusAvg group with review counts of <5 BusCount and <8 UsrCount, I simply had  formula of =A+(B/10)*(C-A)+(D/20)*(E-F).  Where A is the category average for the business (the starting point), B is the business review count, C is the business average, D is the user review count, E is the user average, and F is the global mean (3.766723066).  The thresholds to use (10 for bus and 20 for usr) were developed through experimentation based on leaderboard feedback.   I tried linear regression on this group with low review counts for usr or bus, but it was outperformed by the simple formula above.  I used a similar basic factorization model for a few other small subsets that didn’t perform well when run through linear regression (for example in the usr avg only group, when there was no similar business name to be found).

Conclusion and Lessons Learned

This competition was by far my best learning experience so far, I felt like I really spent the time needed to fully analyze the data set and develop a robust model.  I learned MUCH about model development and feature creation, particularly the value of segmentation ensembles and the need for attention to detail on real-world data sets like this one.

In contrast to some other contests, this was not a “canned” set of data which had been fully sanitized and cleaned prior to being given to competitors, instead it was a raw dump of Yelp’s real-world review data. I think the amount of time I spent finding patterns and details in the data paid off in the end and gave me favorable positioning over many of the other competitors.

Looking forward to the next one!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>