Bryan Gregory | Data Science, Business Intelligence, Analytics, Visualization, Agile Development

First Place in the WSDM Cup 2018!

Posted on 02/25/2018 04/30/2018 by Bryan Gregory

Big news! I won a second Kaggle competition (apparently lightning does strike twice), including a $2,500 prize and an invitation to be a guest speaker at the WSDM 2018 Conference.

Over the Thanksgiving and Christmas Breaks I decided to compete in another Kaggle competition. This time the challenge was to build a subscription churn mode for the Asian music subscription company, kkbox. I quickly rose to the top 5 and held out until the end to place 1st out of 575 team.

My solution consisted of Microsoft T-SQL for ETL/munging and some feature creation (the datasets were quite large, ~30GB) and Python/PANDAS/Sklearn for modelling with XGBoost/LightGBM being the primary learning algorithms used. Feature engineering dominated this competition and creativity paid off!

———Tools Used———

Microsoft SQL Server 2016, Linux mode (Azure VM)
LightGBM Python Library – 2.0.11
XGBoost Python Library – 0.6
SKLearn – 0.19.1
Pandas – 0.22.0
NumPy – 1.14.0 rc1

———–Links————

Recorded Presentation (video): https://www.youtube.com/watch?v=OEDUzVH1aDI
My Medium post detailing my solution: https://medium.com/@bryan.gregory1/predicting-customer-churn-extreme-gradient-boosting-with-temporal-data-332c0d9f32bf
WSDM 2018 Cup Challenge: https://wsdm-cup-2018.kkbox.events/
Kaggle Competition Overview: https://www.kaggle.com/c/kkbox-churn-prediction-challenge
Final standings: https://www.kaggle.com/c/kkbox-churn-prediction-challenge/leaderboard

Heading to SQL PASS 2017!

Posted on 10/30/2017 04/30/2018 by Bryan Gregory

PRIZE WIN! Competition Recap: SeeClickFix.com — Predicting Views, Votes, and Comments

Posted on 02/19/2014 04/30/2018 by Bryan Gregory

WOOHOO! Excitement, relief, and exhaustion. That’s perhaps the best way to summarize my latest data science competition experience. After quite a bit of hard work, I placed 2nd in the SeeClickFix.com extended competition, a continuation of the 24-hour Hackathon competition I participated in a few months ago. The prize win for our team was $1,000, which is a nice payoff for the hard work!

This was also different from past competitions in that this time around I formed a team with another Kaggler, Miroslaw Horbal. At the start, I entered the contest solo, applying much of the same code that I used in the Hackathon competition, and everything went great. I immediately entered the top 10 and stayed there, slowly working my way into the top position midway through the contest. But with only a few weeks left, others were gaining and I decided to team up with Miroslaw, who was at that time in command of 4th place. Teaming up with him was a great experience and not only helped us achieve 2nd place (I doubt I could have done it on my own), but also helped me learn a great deal about collaborating on a data science project.

We were invited by Kaggle to write up our methodology and approach to the competition, you can find the article here: http://blog.kaggle.com/2014/01/07/qa-with-bryan-miroslaw-2nd-place-in-the-see-click-predict-fix-competition

Also we wrote up a detailed description of our code: http://bryangregory.com/Kaggle/DocumentationforSeeClickFix.pdf

I also briefly touched on my approach in the contest forums: http://www.kaggle.com/c/see-click-predict-fix/forums/t/6464/congratulations-to-the-winners

Contest Overview

http://www.kaggle.com/c/see-click-predict-fix

Goal: The objective of this competition was to create a model that will predict the number of views, votes, and comments that an issue will receive using supervised learning based on historical records. So rather than one target variable like in past contests, this contest had 3 target variables.

Data set: The training data consisted of CSV files containing 3-1-1 issues from four cities (Oakland, Richmond, New Haven, Chicago) covering the time period from 1-1-2012 to 4-30-2013, and a test set covering from 5-1-2013 to 9-17-2013. The data attributes given included information about the 3-1-1 issue such as latitude and longitude of the issue, source from which the issue was created (iPhone, android, web, etc.), text of the issue, and creation date.

Error metric: RMSLE (Root Mean Squared Logarithmic Error). This is a variant on the common RMSE error metric, but an additional log operation is performed first prior to performing the square, mean, and root operations. This has the effect of reducing the impact that high variance predictions have on the overall score, thus placing more emphasis on getting the majority of low variance predictions correct. Very useful for problems in which some unpredictable large misses are expected in the model, and that certainly applies to this contest in which a handful of issues will receive thousands of views and dozens of votes due while most receive only a fraction of that.

My Model and Workflow

Tools: I piggy backed on to my existing code developed for the Hackathon, which consisted of the SKLearn/PANDAS/NumPy stack.

Loading and munging: As with the Hackathon contest, there were some null and missing values to deal with. I did some replacement of rare values with a “__MISSING__” flag, and I performed some cleaning up of the text attributes (converting all to lowercase, removing odd characters, etc.).

I also spent some time getting to know the data (something I didn’t have the luxury of doing in the Hackathon contest), and learned some interesting things. One very valuable business rule learned from running max/min/mean/mode analysis on subsets of the data was that all issues default with a vote of 1 (assuming auto vote by the issue creator), which meant that a lower bound of 1 needed to be included in the model when predicting votes, whereas both views and comments had a lower bound of 0.

Features:

TFIDF Bi-gram Vector on summary+description, using word count analyzer. Used a high min_df count threshold of 6 to prevent overfitting and keep the k dimension low relative to each segment. I arrived at a min_df of 6 and using word count analyzer after extensive CV testing of the various TFIDF parameters.
one hot encoded vector of the summary field — This performed better than the TFIDF bi-gram feature on the remote_api segment, so was used only for that. It consisted of a simple one hot vector of all words found in summary.
one hot encoded hours range – morning, afternoon, evening, late night
one hot encoded city — used for segmenting and as a feature for the remote_api segment
one hot encoded latitude and longitude – lat+long rounded to 2 digits
one hot encoded day of week
description length using character count — transformed so that length of 0 was adjusted to -70, thereby giving issues missing descriptions a higher negative impact
log description length using word count — transformed using log (this came from Miroslaw and gave better CV scores than linear description length on a few of the segments, but interestingly not all)
one hot encoded boolean description flag — worked better for remote_api segment then using length, for reasons described in my other post about the correlations
one hot encoded boolean tagtype flag
one hot encoded boolean weekend flag
neighborhood blended with zipcode — This one is an interesting one. Like Miroslaw, I had used a free service to reverse geocode the longitude/latitude into neighborhoods and zipcodes. I didn’t have much look using the zipcodes on their own, but neighborhoods gave a nice bump so I was using that feature as a standalone. An even stronger bump came from blending zipcodes with neighborhoods by replacing low count and missing neighborhoods with zip codes. Then Miroslaw improved that further after we teamed up by using a genius bit of code to replace low counts with the nearest matching neighborhood/zip using a KNN model.
Total income — I derived this from IRS 2008 tax data by zip code (http://federalgovernmentzipcodes.us/free-zipcode-database.csv). That CSV file contains the # of tax returns and the average income of the tax return for each zip code, which can be used as proxies for population size and avg income. Unfortunately neither of those had much affect on CV by themselves, but when multiplied together to derive total income for a zip code, I got a small boost on some of the segments (~.00060 total gain)

Cross-Validation: I started off with basic k-fold validation and quickly realized that my CV results were not matching with my submission results. At that point I moved to a temporal validation by creating a hold out set consisting of the last 10% of the issues posted in the training set, which was roughly equivalent to all the issues posted in March and April. This proved much more effective as there is a significant temporal element to the data because SeeClickFix.com is rapidly changing over time and there is also seasonality to the data (cold-weather issues vs warm-weather issues, etc.), which meant that the random sampling used in k-fold validation is NOT a good method. After moving to this, CV improvement began to correlate with submission score improvements.

Learning algorithms used: Because I broke the data into so many small subsets, most sets had a small k (features) and n (samples), so training time was generally not an issue which allowed for a wide variety of learning algorithms to be experimented with. The majority of the sub models were GBM’s as they seemed to perform very well on this data set, probably due to their ability to focus on a small number of important variables, and their ability to resist overfitting. A few of the subsets ended up using linear models (mostly SGD regressors, 1 SVR with linear kernel) because of a better performing CV score for that particular segment.

Conclusion and Lessons Learned

Our final place was #2 out of 532 teams with a score of .28976. A great accomplishment and we were really happy to see the final result once the dust settled of releasing the private data set. Our fears of overfitting the leaderboard turned out to not be true as our score had very little adjustment.

Undoubtedly our primary advantage and biggest lessons learned on this contest came from 2 techniques: segment ensembles and ensemble averaging of distinct models.

Segment ensembles are a technique I first learned in the Yelp competition, a technique used to break down data sets into small subsets and train models separately on each subset, then combine the models’ output for a total set of predictions. My particular segmentation ensemble in this contest consisted of 15 base models trained on distinct sub segments comprised of each target variable, each city, and remote_api_created issues ((4 cities+remote_api) x 3 targets = 15 total models). I believe this was effective because it allowed for the models to take into account interactions between the variables which exist between the different segments. In statistical terms, segmented ensembles are effective at capturing interactions between variables (here’s a quick reference, see the section on segmentation (reference: http://www.fico.com/en/Communities/Analytic-Technologies/Pages/EnsembleModeling.aspx). This data set clearly contains some interaction as each of the cities are in many ways distinct from each other (different means and standard deviations for the targets variables, different composition of sources, etc.)

And in the end while my approach of using a segment ensemble performed very well on its own, it performed especially well when combined using ensemble averaging with Miroslaw’s distinct and independent model. Reason being our models were quite different in approach and therefore had a high degree of diversity (variance in errors). This was a technique I had often read about in past competition but had not yet put into use until now, the technique of combining the predictions from multiple models on the same set of test data, usually combined using some type of averaging.

For this contest, I developed a small code base in Python for trying multiple methods for deriving the weights to use when averaging, from simple 50/50 or other heuristically developed weights, to linear models derived weights. I definitely plan on using this technique in all contests going forward as clearly a group of diverse models will combine together to create more accurate overall predictions then a single strong model, and, most importantly, will be more resistant to overfitting.

Python code for performing reverse address lookups using longitude and latitude (uses free Nominatim OSM/Mapquest databases)

Posted on 12/17/2013 03/13/2014 by Bryan Gregory

In my last Kaggle contest (read about it here), I created a small utility in python for performing reverse lookup of address data (zipcode, neighborhood, stree, etc.) using longitude and latitude fed to the free Nominatim OSM/Mapquest databases. This is a better option then the Google Maps API for bulk data geo lookups because it has no daily limit on calls, whereas the Google Maps API (free version) has a 5000 call daily limit.

With the contest completed, I’ve had a little free time, so I thought I would clean up the code a bit and release it publicly in case it might be of use to anyone else on a future Kaggle contest or other data science project. So here it is: https://github.com/theusual/reverse_geocoding_nominatim

It’s easy to use: the input can be any flatfile with longitude and latitude fields, then it returns street address, zip code, neighborhood, and city/township. It could be easily changed to also pull country, state, country, country code.

Nothing fancy, but hopefully this may be of help to someone in the future!

Data Visualization: Kaggle Contest Entry for SeeClickFix.com – Community Activity by Zipcode

Posted on 12/13/2013 03/13/2014 by Bryan Gregory

Just wanted to make a quick post highlighting my last minute entry for the visualization portion of the Kaggle SeeClickFix.com competition.

Overview of my entry:

My visualization entry consists of a dashboard summary and a time-series video showing the SeeClickFix.com community activity, both historic and forecasted, by zip code, as determined using reverse lookup of the issue’s longitude and latitude. Activity level is measured by average community votes that an issue receives. The models are based on both the training (01/2012-04/2013) and test (05/2013-09/2013) data sets. The average votes for issues in the test data (05/2013 -09/2013) have been populated using our team’s #2 ranked prediction model.

Average votes was the chosen metric for community activity levels because it is stable over time with a lower variance, and it best represents a community’s interest in fixing the issues. Views and comments are noisy with very high variance and are possibly influenced by users outside of a community. Also city_intiated and remote_api_created issues were filtered from the data set to make comparison across cities more reasonable.

The dashboard summary encapsulates all the data into an aggregate activity level for each zip code in each city and is visualized using a combination of heat mapping for each city and a tree-view of overall most active zip codes. Fully interactive dashboard summary available here: http://public.tableausoftware.com/views/Kaggle-SeeClickFix-ActivityByZipcode/Dashboard1?:embed=y&:display_count=no#1

And here is a non-interactive snapshot of the summary:

The video consists of quarterly time-series heat maps of each city using the same scale as the dashboard summary. In contrast to the dashboard aggregate summary, the time series model illustrates changing community activity levels over time, including the forecasted activity levels for Q2 and Q3 2013. Video available on YouTube : https://www.youtube.com/watch?v=DlE2uMZ44QQ

)

Don’t Break The Chain Mobile App — Development Progress

Posted on 11/26/2013 03/13/2014 by Bryan Gregory

Over the Thanksgiving holiday, I’ve taken a short break from the data science competitions and picked back up on developing my mobile productivity app I started earlier this year. It’s coming along nicely and I hope to have it released by the end of Q2 2014.

It’s made in HTML5/JS/CSS3, so it’s cross-platform and will run on any mobile device or workstation. I’m also working on developing a server back-end for it that will store goal progress data in a centralized database, thereby allowing all versions of the app to synchronize via user login. For example, a user can update goals on both an android phone and on an ipad tablet and on a laptop using the web browser, and all 3 apps will be synchronized. Even cooler, it will store the data locally using HTML5 local storage in addition to storing on the remote server, which means that a user can go offline (no web access) and still update goals and have them synchronize when the device comes online again.

The backend currently uses NodeJS and CouchDB, and I’m loving NodeJS.

Here are some screenshots:

The Login Menu

The Calendar

The Goals Menu

Kaggle Data Science Competition Recap: Hackathon — SeeClickFix.com — Predicting Views, Votes, and Comments

Posted on 10/01/2013 03/04/2014 by Bryan Gregory

Just finished my first ever “Hackathon”! Not familiar with the concept of a Hackathon? Neither was I before now.

A Hackathon is basically a very short and very intense session of code development focused on one project. This particular hackathon was hosted by Kaggle and was a data science competition with the goal of developing a predictive model for the popular SeeClickFix.com website. SeeClickFix.com helps city governments crowd-source their non-emergency issues (3-1-1 issues) by providing a platform for citizens to post city issues. Users can also view, vote, and comment on issues posted by others, which brings us to the target of this contest: create a model that will predict the number of views, votes, and comments that an issue will receive.

Being a hackathon, the competition lasted only 25-hours, starting at 7:00pm (CST) on Friday and going until 8:00pm (CST) on Saturday. So rather then weeks or even months to develop a robust model, contestants had only a matter of hours. This was a fun challenge so I eagerly cleared my schedule, stocked up on snacks and caffeine, and got ready to code! 24-hours (and a few naps) later, the final leaderboard was released and I had landed at #15 out of 80 teams.

Here’s a recap: