WOOHOO! Excitement, relief, and exhaustion. That’s perhaps the best way to summarize my latest data science competition experience. After quite a bit of hard work, I placed 2nd in the SeeClickFix.com extended competition, a continuation of the 24-hour Hackathon competition I participated in a few months ago. The prize win for our team was $1,000, which is a nice payoff for the hard work!
This was also different from past competitions in that this time around I formed a team with another Kaggler, Miroslaw Horbal. At the start, I entered the contest solo, applying much of the same code that I used in the Hackathon competition, and everything went great. I immediately entered the top 10 and stayed there, slowly working my way into the top position midway through the contest. But with only a few weeks left, others were gaining and I decided to team up with Miroslaw, who was at that time in command of 4th place. Teaming up with him was a great experience and not only helped us achieve 2nd place (I doubt I could have done it on my own), but also helped me learn a great deal about collaborating on a data science project.
We were invited by Kaggle to write up our methodology and approach to the competition, you can find the article here: http://blog.kaggle.com/2014/01/07/qa-with-bryan-miroslaw-2nd-place-in-the-see-click-predict-fix-competition
Also we wrote up a detailed description of our code: http://bryangregory.com/Kaggle/DocumentationforSeeClickFix.pdf
I also briefly touched on my approach in the contest forums: http://www.kaggle.com/c/see-click-predict-fix/forums/t/6464/congratulations-to-the-winners
Contest Overview
http://www.kaggle.com/c/see-click-predict-fix
Goal: The objective of this competition was to create a model that will predict the number of views, votes, and comments that an issue will receive using supervised learning based on historical records. So rather than one target variable like in past contests, this contest had 3 target variables.
Data set: The training data consisted of CSV files containing 3-1-1 issues from four cities (Oakland, Richmond, New Haven, Chicago) covering the time period from 1-1-2012 to 4-30-2013, and a test set covering from 5-1-2013 to 9-17-2013. The data attributes given included information about the 3-1-1 issue such as latitude and longitude of the issue, source from which the issue was created (iPhone, android, web, etc.), text of the issue, and creation date.
Error metric: RMSLE (Root Mean Squared Logarithmic Error). This is a variant on the common RMSE error metric, but an additional log operation is performed first prior to performing the square, mean, and root operations. This has the effect of reducing the impact that high variance predictions have on the overall score, thus placing more emphasis on getting the majority of low variance predictions correct. Very useful for problems in which some unpredictable large misses are expected in the model, and that certainly applies to this contest in which a handful of issues will receive thousands of views and dozens of votes due while most receive only a fraction of that.
My Model and Workflow
Tools: I piggy backed on to my existing code developed for the Hackathon, which consisted of the SKLearn/PANDAS/NumPy stack.
Loading and munging: As with the Hackathon contest, there were some null and missing values to deal with. I did some replacement of rare values with a “__MISSING__” flag, and I performed some cleaning up of the text attributes (converting all to lowercase, removing odd characters, etc.).
I also spent some time getting to know the data (something I didn’t have the luxury of doing in the Hackathon contest), and learned some interesting things. One very valuable business rule learned from running max/min/mean/mode analysis on subsets of the data was that all issues default with a vote of 1 (assuming auto vote by the issue creator), which meant that a lower bound of 1 needed to be included in the model when predicting votes, whereas both views and comments had a lower bound of 0.
Features:
- TFIDF Bi-gram Vector on summary+description, using word count analyzer. Used a high min_df count threshold of 6 to prevent overfitting and keep the k dimension low relative to each segment. I arrived at a min_df of 6 and using word count analyzer after extensive CV testing of the various TFIDF parameters.
- one hot encoded vector of the summary field — This performed better than the TFIDF bi-gram feature on the remote_api segment, so was used only for that. It consisted of a simple one hot vector of all words found in summary.
- one hot encoded hours range – morning, afternoon, evening, late night
- one hot encoded city — used for segmenting and as a feature for the remote_api segment
- one hot encoded latitude and longitude – lat+long rounded to 2 digits
- one hot encoded day of week
- description length using character count — transformed so that length of 0 was adjusted to -70, thereby giving issues missing descriptions a higher negative impact
- log description length using word count — transformed using log (this came from Miroslaw and gave better CV scores than linear description length on a few of the segments, but interestingly not all)
- one hot encoded boolean description flag — worked better for remote_api segment then using length, for reasons described in my other post about the correlations
- one hot encoded boolean tagtype flag
- one hot encoded boolean weekend flag
- neighborhood blended with zipcode — This one is an interesting one. Like Miroslaw, I had used a free service to reverse geocode the longitude/latitude into neighborhoods and zipcodes. I didn’t have much look using the zipcodes on their own, but neighborhoods gave a nice bump so I was using that feature as a standalone. An even stronger bump came from blending zipcodes with neighborhoods by replacing low count and missing neighborhoods with zip codes. Then Miroslaw improved that further after we teamed up by using a genius bit of code to replace low counts with the nearest matching neighborhood/zip using a KNN model.
- Total income — I derived this from IRS 2008 tax data by zip code (http://federalgovernmentzipcodes.us/free-zipcode-database.csv). That CSV file contains the # of tax returns and the average income of the tax return for each zip code, which can be used as proxies for population size and avg income. Unfortunately neither of those had much affect on CV by themselves, but when multiplied together to derive total income for a zip code, I got a small boost on some of the segments (~.00060 total gain)
Cross-Validation: I started off with basic k-fold validation and quickly realized that my CV results were not matching with my submission results. At that point I moved to a temporal validation by creating a hold out set consisting of the last 10% of the issues posted in the training set, which was roughly equivalent to all the issues posted in March and April. This proved much more effective as there is a significant temporal element to the data because SeeClickFix.com is rapidly changing over time and there is also seasonality to the data (cold-weather issues vs warm-weather issues, etc.), which meant that the random sampling used in k-fold validation is NOT a good method. After moving to this, CV improvement began to correlate with submission score improvements.
Learning algorithms used: Because I broke the data into so many small subsets, most sets had a small k (features) and n (samples), so training time was generally not an issue which allowed for a wide variety of learning algorithms to be experimented with. The majority of the sub models were GBM’s as they seemed to perform very well on this data set, probably due to their ability to focus on a small number of important variables, and their ability to resist overfitting. A few of the subsets ended up using linear models (mostly SGD regressors, 1 SVR with linear kernel) because of a better performing CV score for that particular segment.
Conclusion and Lessons Learned
Our final place was #2 out of 532 teams with a score of .28976. A great accomplishment and we were really happy to see the final result once the dust settled of releasing the private data set. Our fears of overfitting the leaderboard turned out to not be true as our score had very little adjustment.
Undoubtedly our primary advantage and biggest lessons learned on this contest came from 2 techniques: segment ensembles and ensemble averaging of distinct models.
Segment ensembles are a technique I first learned in the Yelp competition, a technique used to break down data sets into small subsets and train models separately on each subset, then combine the models’ output for a total set of predictions. My particular segmentation ensemble in this contest consisted of 15 base models trained on distinct sub segments comprised of each target variable, each city, and remote_api_created issues ((4 cities+remote_api) x 3 targets = 15 total models). I believe this was effective because it allowed for the models to take into account interactions between the variables which exist between the different segments. In statistical terms, segmented ensembles are effective at capturing interactions between variables (here’s a quick reference, see the section on segmentation (reference: http://www.fico.com/en/Communities/Analytic-Technologies/Pages/EnsembleModeling.aspx). This data set clearly contains some interaction as each of the cities are in many ways distinct from each other (different means and standard deviations for the targets variables, different composition of sources, etc.)
And in the end while my approach of using a segment ensemble performed very well on its own, it performed especially well when combined using ensemble averaging with Miroslaw’s distinct and independent model. Reason being our models were quite different in approach and therefore had a high degree of diversity (variance in errors). This was a technique I had often read about in past competition but had not yet put into use until now, the technique of combining the predictions from multiple models on the same set of test data, usually combined using some type of averaging.
For this contest, I developed a small code base in Python for trying multiple methods for deriving the weights to use when averaging, from simple 50/50 or other heuristically developed weights, to linear models derived weights. I definitely plan on using this technique in all contests going forward as clearly a group of diverse models will combine together to create more accurate overall predictions then a single strong model, and, most importantly, will be more resistant to overfitting.