Kaggle Data Science Competition Recap: Hackathon — SeeClickFix.com — Predicting Views, Votes, and Comments

Just finished my first ever “Hackathon”!  Not familiar with the concept of a Hackathon?  Neither was I before now.

A Hackathon is basically a very short and very intense session of code development focused on one project.  This particular hackathon was hosted by Kaggle and was a data science competition with the goal of developing a predictive model for the popular SeeClickFix.com website.   SeeClickFix.com helps city governments crowd-source their non-emergency issues (3-1-1 issues) by providing a platform for citizens to post city issues.  Users can also view, vote, and comment on issues posted by others, which brings us to the target of this contest: create a model that will predict the number of views, votes, and comments that an issue will receive.

Being a hackathon, the competition lasted only 25-hours, starting at 7:00pm (CST) on Friday and going until 8:00pm (CST) on Saturday.  So rather then weeks or even months to develop a robust model, contestants had only a matter of hours.  This was a fun challenge so I eagerly cleared my schedule, stocked up on snacks and caffeine, and got ready to code!  24-hours (and a few naps) later, the final leaderboard was released and I had landed at #15 out of 80 teams.

Here’s a recap:

Contest Overview

http://www.kaggle.com/c/the-seeclickfix-311-challenge

Goal: The objective of this competition was to create a model that will predict the number of views, votes, and comments that an issue will receive using supervised learning based on historical records.  So rather than one target variable like in past contests, this contest had 3 target variables.

Data set: The data consisted of CSV files containing 3-1-1 issues from four cities (Oakland, Richmond, New Haven, Chicago) covering the time period from 1-1-2012 to 4-1-2013. The data attributes given included information about the 3-1-1 issue such as latitude and longitude of the issue, source from which the issue was created (iPhone, android, web, etc.), text of the issue, and creation date.

Error metric:   RMSLE (Root Mean Squared Logarithmic Error).  This is a variant on the common RMSE error metric, but an additional log operation is performed first prior to performing the square, mean, and root operations.  This has the effect of reducing the impact that high variance predictions have on the overall score, thus placing more emphasis on getting the majority of low variance predictions correct.  Very useful for problems in which some unpredictable large misses are expected in the model, and that certainly applies to this contest in which a handful of issues will receive thousands of views and dozens of votes due while most receive only a fraction of that.

My Model and Workflow

Tools:  I again went with a Python stack of tools: PANDAS and NumPy for loading/munging the data and SKLearn for the machine learning aspects.

Loading and munging: There were some null and missing values to deal with.  I did some replacement of rare values with a “__MISSING__” flag, and I performed some cleaning up of the text attributes (converting all to lowercase, removing odd characters, etc.).  But overall I skipped over much of the more detailed munging process due to time constraints.

Features:

  • Tfidf on summary only using the word analyzer and 1,2 n-gram range.  Tried on summary+description but CV and leaderboard scores dropped.
  • One hot encoder on source and tag type
  • One hot encoder on location (long+lat)
  • One hot encoder on time of day range (night, morning, afternoon, and evening)
  • One hot encoder on day of week and month of year

Cross-Validation:  I stuck to basic k-fold validation in the interest of time, but in retrospect this was NOT ideal for this problem.  There is a significant temporal element to the data because SeeClickFix.com is rapidly changing over time and there is also seasonality to the data (cold-weather issues vs warm-weather issues, etc.), which meant that the random sampling used in k-fold validation is NOT a good method.  A much better method is to use a temporal cross-validation in which you parse off the last section of training data and create a hold out set from that.  I’m going to try that CV method on the second version of this contest.

Learning algorithms used:  Due to time restrictions and the large amount of samples, I stuck with linear models only because they train much faster. Ridge Regression and an optimized SGD (high iterations and low alpha) gave the best scores, however  I ended up going with Ridge for everything because it was faster to train and I ran out of time near the end.

It was on my to-do list to try a new neural network library I’ve been researching, SparseNN, which I think would have worked well on the non-text data, then combine it into an ensemble with the linear model on the text data.  But alas there was no time so I did not get to that.

Conclusion and Lessons Learned

My final place was #15 overall with a score of .48026.  Overall I was pleased with how rapidly I was able to get a working and accurate model out into the field, definitely shows that I have come a long way in becoming familiar with machine learning concepts and tools in just a few months time.

Other valuable lessons learned on this contest were: how to deal with multi-label regression problems (previously had only worked with single label regression problems), how to work with text data (I applied n-grams using both TF-IDF and word vectors), and the trick of transforming target variables prior to using them.  The last one is a trick that I did NOT use in this contest, however it became clear on the forums after the contest completed that this was essential to landing in the top 10.  What it involved was taking the natural log of all target variables prior to training the model, then transforming model output back to normal to derive the final predictions.  What this achieved was it allowed the learning algorithms to work in normal least squares without the impact of the the log component of the RMSLE error metric.  Essentially it factored the “L” out of the RMSLE and turned the problem into a standard RMSE one.  Very cool.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>