First Place in the WSDM Cup 2018!

Big news! I won a second Kaggle competition (apparently lightning does strike twice), including a $2,500 prize and an invitation to be a guest speaker at the WSDM 2018 Conference.

Over the Thanksgiving and Christmas Breaks I decided to compete in another Kaggle competition. This time the challenge was to build a subscription churn mode for the Asian music subscription company, kkbox. I quickly rose to the top 5 and held out until the end to place 1st out of 575 team.

My solution consisted of Microsoft T-SQL for ETL/munging and some feature creation (the datasets were quite large, ~30GB) and Python/PANDAS/Sklearn for modelling with XGBoost/LightGBM being the primary learning algorithms used. Feature engineering dominated this competition and creativity paid off!

———Tools Used———

  • Microsoft SQL Server 2016, Linux mode (Azure VM)
  • LightGBM Python Library – 2.0.11
  • XGBoost Python Library – 0.6
  • SKLearn – 0.19.1
  • Pandas – 0.22.0
  • NumPy – 1.14.0 rc1

———–Links————

PRIZE WIN! Competition Recap: SeeClickFix.com — Predicting Views, Votes, and Comments

WOOHOO!  Excitement, relief, and exhaustion.  That’s perhaps the best way to summarize my latest data science competition experience.  After quite a bit of hard work, I placed 2nd in the SeeClickFix.com extended competition, a continuation of the 24-hour Hackathon competition I participated in a few months ago. The prize win for our team was $1,000, which is a nice payoff for the hard work!

This was also different from past competitions in that this time around I formed a team with another Kaggler, Miroslaw Horbal.  At the start,  I entered the contest solo, applying much of the same code that I used in the Hackathon competition, and everything went great.  I immediately entered the top 10 and stayed there, slowly working my way into the top position midway through the contest.  But with only a few weeks left, others were gaining and I decided to team up with Miroslaw, who was at that time in command of 4th place.  Teaming up with him was a great experience and not only helped us achieve 2nd place (I doubt I could have done it on my own), but also helped me learn a great deal about collaborating on a data science project.

We were invited by Kaggle to write up our methodology and approach to the competition, you can find the article here:  http://blog.kaggle.com/2014/01/07/qa-with-bryan-miroslaw-2nd-place-in-the-see-click-predict-fix-competition

Also we wrote up a detailed description of our code:  http://bryangregory.com/Kaggle/DocumentationforSeeClickFix.pdf

I also briefly touched on my approach in the contest forums: http://www.kaggle.com/c/see-click-predict-fix/forums/t/6464/congratulations-to-the-winners

Contest Overview

http://www.kaggle.com/c/see-click-predict-fix

Goal: The objective of this competition was to create a model that will predict the number of views, votes, and comments that an issue will receive using supervised learning based on historical records.  So rather than one target variable like in past contests, this contest had 3 target variables.

Data set: The training data consisted of CSV files containing 3-1-1 issues from four cities (Oakland, Richmond, New Haven, Chicago) covering the time period from 1-1-2012 to 4-30-2013, and a test set covering from 5-1-2013 to 9-17-2013. The data attributes given included information about the 3-1-1 issue such as latitude and longitude of the issue, source from which the issue was created (iPhone, android, web, etc.), text of the issue, and creation date.

Error metric:   RMSLE (Root Mean Squared Logarithmic Error).  This is a variant on the common RMSE error metric, but an additional log operation is performed first prior to performing the square, mean, and root operations.  This has the effect of reducing the impact that high variance predictions have on the overall score, thus placing more emphasis on getting the majority of low variance predictions correct.  Very useful for problems in which some unpredictable large misses are expected in the model, and that certainly applies to this contest in which a handful of issues will receive thousands of views and dozens of votes due while most receive only a fraction of that.

My Model and Workflow

Tools: I piggy backed on to my existing code developed for the Hackathon, which consisted of the SKLearn/PANDAS/NumPy stack.

Loading and munging: As with the Hackathon contest, there were some null and missing values to deal with.  I did some replacement of rare values with a “__MISSING__” flag, and I performed some cleaning up of the text attributes (converting all to lowercase, removing odd characters, etc.).

I also spent some time getting to know the data (something I didn’t have the luxury of doing in the Hackathon contest), and learned some interesting things.  One very valuable business rule learned from running max/min/mean/mode analysis on subsets of the data was that all issues default with a vote of 1 (assuming auto vote by the issue creator), which meant that a lower bound of 1 needed to be included in the model when predicting votes, whereas both views and comments had a lower bound of 0.

Features:

  • TFIDF Bi-gram Vector on summary+description, using word count analyzer.  Used a high min_df count threshold of 6 to prevent overfitting and keep the k dimension low relative to each segment.  I arrived at a min_df of 6 and using word count analyzer after extensive CV testing of the various TFIDF parameters.
  • one hot encoded vector of the summary field — This performed better than the TFIDF bi-gram feature on the remote_api segment, so was used only for that.  It consisted of a simple one hot vector of all words found in summary.
  • one hot encoded hours range  – morning, afternoon, evening, late night
  • one hot encoded city — used for segmenting and as a feature for the remote_api segment
  • one hot encoded latitude and longitude  – lat+long rounded to 2 digits
  • one hot encoded day of week
  • description length using character count — transformed so that length of 0 was adjusted to -70, thereby giving issues missing descriptions a higher negative impact 
  • log description length using word count —  transformed using log (this came from Miroslaw and gave better CV scores than linear description length on a few of the segments, but interestingly not all)
  • one hot encoded boolean description flag — worked better for remote_api segment then using length, for reasons described in my other post about the correlations
  • one hot encoded boolean tagtype flag
  • one hot encoded boolean weekend flag
  • neighborhood blended with zipcode —  This one is an interesting one.  Like Miroslaw, I had used a free service to reverse geocode the longitude/latitude into neighborhoods and zipcodes.  I didn’t have much look using the zipcodes on their own, but  neighborhoods gave a nice bump so I was using that feature as a standalone.  An even stronger bump came from blending zipcodes with neighborhoods by replacing low count and missing neighborhoods with zip codes.  Then Miroslaw improved that further after we teamed up by using a genius bit of code to replace low counts with the nearest matching neighborhood/zip using a KNN model.
  • Total income — I derived this from IRS 2008 tax data by zip code (http://federalgovernmentzipcodes.us/free-zipcode-database.csv).  That CSV file contains the # of tax returns and the average income of the tax return for each zip code, which can be used as proxies for population size and avg income.  Unfortunately neither of those had much affect on CV by themselves, but when multiplied together to derive total income for a zip code, I got a small boost on some of the segments (~.00060 total gain)

Cross-Validation: I started off with basic k-fold validation and quickly realized that my CV results were not matching with my submission results.  At that point I moved to a temporal validation by creating a hold out set consisting of the last 10% of the issues posted in the training set, which was roughly equivalent to all the issues posted in March and April.  This proved much more effective as there is a significant temporal element to the data because SeeClickFix.com is rapidly changing over time and there is also seasonality to the data (cold-weather issues vs warm-weather issues, etc.), which meant that the random sampling used in k-fold validation is NOT a good method.   After moving to this, CV improvement began to correlate with submission score improvements.

Learning algorithms used:  Because I broke the data into so many small subsets, most sets had a small k (features) and n (samples), so training time was generally not an issue which allowed for a wide variety of learning algorithms to be experimented with.  The majority of the sub models were GBM’s as they seemed to perform very well on this data set, probably due to their ability to focus on a small number of important variables, and their ability to resist overfitting.  A few of the subsets ended up using linear models (mostly SGD regressors, 1 SVR with linear kernel) because of a better performing CV score for that particular segment.

Conclusion and Lessons Learned

Our final place was #2 out of 532 teams with a score of .28976.  A great accomplishment and we were really happy to see the final result once the dust settled of releasing the private data set.  Our fears of overfitting the leaderboard turned out to not be true as our score had very little adjustment.

Undoubtedly our primary advantage and biggest lessons learned on this contest came from 2 techniques:  segment ensembles and ensemble averaging of distinct models.

Segment ensembles are a technique I first learned in the Yelp competition, a technique used to break down data sets into small subsets and train models separately on each subset, then combine the models’ output for a total set of predictions.  My particular segmentation ensemble in this contest consisted of 15 base models trained on distinct sub segments comprised of each target variable, each city, and remote_api_created issues   ((4 cities+remote_api) x 3 targets = 15 total models).   I believe this was effective because it allowed for the models to take into account interactions between the variables which exist between the different segments.  In statistical terms, segmented ensembles are effective at capturing interactions between variables (here’s a quick reference, see the section on segmentation (reference: http://www.fico.com/en/Communities/Analytic-Technologies/Pages/EnsembleModeling.aspx). This data set clearly contains some interaction as each of the cities are in many ways distinct from each other (different means and standard deviations for the targets variables, different composition of sources, etc.)

And in the end while my approach of using a segment ensemble performed very well on its own, it performed especially well when combined using ensemble averaging with Miroslaw’s distinct and independent model.  Reason being our models were quite different in approach and therefore had a high degree of diversity (variance in errors).  This was a technique I had often read about in past competition but had not yet put into use until now, the technique of combining the predictions from multiple models on the same set of test data, usually combined using some type of averaging.

For this contest, I developed a small code base in Python for trying multiple methods for deriving the weights to use when averaging, from simple 50/50 or other heuristically developed weights, to linear models derived weights.  I definitely plan on using this technique in all contests going forward as clearly a group of diverse models will combine together to create more accurate overall predictions then a single strong model, and, most importantly, will be more resistant to overfitting.

 

Python code for performing reverse address lookups using longitude and latitude (uses free Nominatim OSM/Mapquest databases)

In my last Kaggle contest (read about it here), I created a small utility in python for performing reverse lookup of address data (zipcode, neighborhood, stree, etc.) using longitude and latitude fed to the free Nominatim OSM/Mapquest databases.  This is a better option then the Google Maps API for bulk data geo lookups because it has no daily limit on calls, whereas the Google Maps API (free version) has a 5000 call daily limit.

With the contest completed, I’ve had a little free time, so I thought I would clean up the code a bit and release it publicly in case it might be of use to anyone else on a future Kaggle contest or other data science project.  So here it is:  https://github.com/theusual/reverse_geocoding_nominatim

It’s easy to use: the input can be any flatfile with longitude and latitude fields,  then it returns street address, zip code, neighborhood, and city/township.  It could be easily changed to also pull country, state, country, country code.

Nothing fancy, but hopefully this may be of help to someone in the future!

Data Visualization: Kaggle Contest Entry for SeeClickFix.com – Community Activity by Zipcode

Just wanted to make a quick post highlighting my last minute entry for the visualization portion of the Kaggle SeeClickFix.com competition.

Overview of my entry:

My visualization entry consists of a dashboard summary and a time-series video showing the SeeClickFix.com community activity, both historic and forecasted, by zip code, as determined using reverse lookup of the issue’s longitude and latitude.  Activity level is measured by average community votes that an issue receives. The models are based on both the training (01/2012-04/2013) and test (05/2013-09/2013) data sets. The average votes for issues in the test data (05/2013 -09/2013) have been populated using our team’s #2 ranked prediction model.

Average votes was the chosen metric for community activity levels because it is stable over time with a lower variance, and it best represents a community’s interest in fixing the issues. Views and comments are noisy with very high variance and are possibly influenced by users outside of a community.  Also city_intiated and remote_api_created issues were filtered from the data set to make comparison across cities more reasonable.

The dashboard summary encapsulates all the data into an aggregate activity level for each zip code in each city and is visualized using a combination of heat mapping for each city and a tree-view of overall most active zip codes.   Fully interactive dashboard summary available here:  http://public.tableausoftware.com/views/Kaggle-SeeClickFix-ActivityByZipcode/Dashboard1?:embed=y&:display_count=no#1

And here is a non-interactive snapshot of the summary:

Activity By Zip-Summary

The video consists of quarterly time-series heat maps of each city using the same scale as the dashboard summary. In contrast to the dashboard aggregate summary, the time series model illustrates changing community activity levels over time, including the forecasted activity levels for Q2 and Q3 2013.  Video available on YouTube : https://www.youtube.com/watch?v=DlE2uMZ44QQ

)

 

 

Kaggle Data Science Competition Recap: Hackathon — SeeClickFix.com — Predicting Views, Votes, and Comments

Just finished my first ever “Hackathon”!  Not familiar with the concept of a Hackathon?  Neither was I before now.

A Hackathon is basically a very short and very intense session of code development focused on one project.  This particular hackathon was hosted by Kaggle and was a data science competition with the goal of developing a predictive model for the popular SeeClickFix.com website.   SeeClickFix.com helps city governments crowd-source their non-emergency issues (3-1-1 issues) by providing a platform for citizens to post city issues.  Users can also view, vote, and comment on issues posted by others, which brings us to the target of this contest: create a model that will predict the number of views, votes, and comments that an issue will receive.

Being a hackathon, the competition lasted only 25-hours, starting at 7:00pm (CST) on Friday and going until 8:00pm (CST) on Saturday.  So rather then weeks or even months to develop a robust model, contestants had only a matter of hours.  This was a fun challenge so I eagerly cleared my schedule, stocked up on snacks and caffeine, and got ready to code!  24-hours (and a few naps) later, the final leaderboard was released and I had landed at #15 out of 80 teams.

Here’s a recap:

Contest Overview

http://www.kaggle.com/c/the-seeclickfix-311-challenge

Goal: The objective of this competition was to create a model that will predict the number of views, votes, and comments that an issue will receive using supervised learning based on historical records.  So rather than one target variable like in past contests, this contest had 3 target variables.

Data set: The data consisted of CSV files containing 3-1-1 issues from four cities (Oakland, Richmond, New Haven, Chicago) covering the time period from 1-1-2012 to 4-1-2013. The data attributes given included information about the 3-1-1 issue such as latitude and longitude of the issue, source from which the issue was created (iPhone, android, web, etc.), text of the issue, and creation date.

Error metric:   RMSLE (Root Mean Squared Logarithmic Error).  This is a variant on the common RMSE error metric, but an additional log operation is performed first prior to performing the square, mean, and root operations.  This has the effect of reducing the impact that high variance predictions have on the overall score, thus placing more emphasis on getting the majority of low variance predictions correct.  Very useful for problems in which some unpredictable large misses are expected in the model, and that certainly applies to this contest in which a handful of issues will receive thousands of views and dozens of votes due while most receive only a fraction of that.

My Model and Workflow

Tools:  I again went with a Python stack of tools: PANDAS and NumPy for loading/munging the data and SKLearn for the machine learning aspects.

Loading and munging: There were some null and missing values to deal with.  I did some replacement of rare values with a “__MISSING__” flag, and I performed some cleaning up of the text attributes (converting all to lowercase, removing odd characters, etc.).  But overall I skipped over much of the more detailed munging process due to time constraints.

Features:

  • Tfidf on summary only using the word analyzer and 1,2 n-gram range.  Tried on summary+description but CV and leaderboard scores dropped.
  • One hot encoder on source and tag type
  • One hot encoder on location (long+lat)
  • One hot encoder on time of day range (night, morning, afternoon, and evening)
  • One hot encoder on day of week and month of year

Cross-Validation:  I stuck to basic k-fold validation in the interest of time, but in retrospect this was NOT ideal for this problem.  There is a significant temporal element to the data because SeeClickFix.com is rapidly changing over time and there is also seasonality to the data (cold-weather issues vs warm-weather issues, etc.), which meant that the random sampling used in k-fold validation is NOT a good method.  A much better method is to use a temporal cross-validation in which you parse off the last section of training data and create a hold out set from that.  I’m going to try that CV method on the second version of this contest.

Learning algorithms used:  Due to time restrictions and the large amount of samples, I stuck with linear models only because they train much faster. Ridge Regression and an optimized SGD (high iterations and low alpha) gave the best scores, however  I ended up going with Ridge for everything because it was faster to train and I ran out of time near the end.

It was on my to-do list to try a new neural network library I’ve been researching, SparseNN, which I think would have worked well on the non-text data, then combine it into an ensemble with the linear model on the text data.  But alas there was no time so I did not get to that.

Conclusion and Lessons Learned

My final place was #15 overall with a score of .48026.  Overall I was pleased with how rapidly I was able to get a working and accurate model out into the field, definitely shows that I have come a long way in becoming familiar with machine learning concepts and tools in just a few months time.

Other valuable lessons learned on this contest were: how to deal with multi-label regression problems (previously had only worked with single label regression problems), how to work with text data (I applied n-grams using both TF-IDF and word vectors), and the trick of transforming target variables prior to using them.  The last one is a trick that I did NOT use in this contest, however it became clear on the forums after the contest completed that this was essential to landing in the top 10.  What it involved was taking the natural log of all target variables prior to training the model, then transforming model output back to normal to derive the final predictions.  What this achieved was it allowed the learning algorithms to work in normal least squares without the impact of the the log component of the RMSLE error metric.  Essentially it factored the “L” out of the RMSLE and turned the problem into a standard RMSE one.  Very cool.

 

Kaggle Data Science Competition Recap: RecSys2013 – Yelp Business Rating Prediction

Completed another data science competition, and this one was really a doozy!  This was part of the ACM conference on recommender systems (RecSys2013) and it was sponsored by Yelp.com, so the dataset was in a very similar format to that used for the previous Yelp.com contest in which I competed 2 months ago (Recap: Yelp.com Prediction of Useful Votes for Reviews).  However in this competition the goal was to predict business ratings of reviews (review stars) rather than useful votes for reviews.  Same dataset format, different goal (and of course different set of samples, there was no overlap in reviews found in this dataset and reviews found in the previous dataset).

This contest progressed very well, I entered early and finally had a significant amount of time to dedicate to developing the model and I had some experience under my belt from previous competitions, so I really felt good about everything.  The result was quite satisfying.  I finished the competition with my best model scoring an RMSE error of 1.23107, earning me a 7th place finish!  Finally a coveted Kaggle Top 10 badge :)

Here’s a recap:

Contest Overview

http://www.kaggle.com/c/yelp-recsys-2013

Goal: The objective of this competition was to build a predictive model to accurately predict the review stars which a reviewer would assign to a business.  A very interesting contest indeed!

Like the other Yelp contest, this one involved a mix of cold start, warm start, and full start recommender problems.  This environment was created by giving some review records which had either no reference data about the user or business review history (warm start) or no data about the user AND business review history (cold start).  The most challenging problem of course was how to handle the cold start problem using only the non-review history attributes of the user and business.

Data set: The data consisted of 229,907 training reviews and 22,956 test reviews from the Phoenix, AZ metropolitan area.  Included in the data set was information about the reviewer (average review stars, total useful votes received for all reviews given, etc.) as well as the business receiving the review (business name, average review stars, etc.) and information about the review itself (date/time created, review text, etc.).  As mentioned some businesses and users had their average review starts removed to simulate warm and cold start problems.

Each set of data (business, user, review, checkin) came formatted as large JSON files.

Error metric:   RMSE (Root Mean Squared Error).  This is a common and straight forward regression error metric, which was nice to have given that it was my first competition and it made the math behind everything fairly intuitive and did not require any special transformations to make compatible with the learning algorithms. In RMSE, lower is better with 0 being a perfect model.

My Model and Workflow

Model code on GitHub:  https://github.com/theusual/kaggle-yelp-business-rating-prediction

Tools: I used Python/PANDAS/NumPy for loading and munging the data, SKLearn for models and CV, and Excel for tracking and organizing the different models and submissions.

Loading and munging: The data was given in JSON format so I used the JSON import function found within PANDAS to directly import the JSON files into PANDAS data frames.  Once in dataframe format, it was easy to use the Python console to join together and analyze the data across various dimensions.

After loading, I did quite a bit of data munging and cleaning, including converting the unicode dates into usable date formats, flattening nested columns, and extracting zip codes from full business addresses.  I also performed quite a bit of handling of null and missing values.  Lastly, there were some nuances to the data that had to be teased out.  For example, there were 1107 records in the test set which were missing a user average (no user_id found in the training set user table), but did contain matching user_id’s in the training set’s review table.  So in other words while we did not have an explicit user average for these users, we can calculate a user average based on their reviews found in the training set.  This being a sample mean (based only on sample records in this set vs. a true historical average), it was obviously a weaker signal then true user average, so I had to weight it less in my models.  It did however still significantly improve my RMSE over having no user average at all for those records.

During the course of munging and analyzing the data, I realized that there was a distinct difference in the weight that could be given to review history for new user or business with only a few reviews and a reviewer or business with many reviews, so a different model would be ideal for each group.  A great way to handle this problem is to parse the data into subsets (segments), thereby breaking the problem into smaller and more defined problems.  A separate model for each subset can be trained and used, and then the combined predictions of all models can be used as a submission.  I found out later that this is a technique in machine learning called a segmentation ensemble.

I broke the data down into 15 data subsets in total for which I used slightly different models depending on the features of the data.  The subsets were split according to whether any user or business review history existed and if so how many reviews existed.  For example, for records with both user and business history I broke the training and testing data into subsets of:

  • Usr review count >= 20, Bus review count >= 20
  • Usr review count >=  13, Bus review count >= 8 (and usr review count <20 and bus review count <20)
  • Usr review count >= 8, bus review count >=5 (and usr review count <13 and bus review count <8)
  • Usr review count >=0, bus review count >=0 (and usr review count <8 and bus review count <5)

These cutoffs were derived by manually testing various cutoff points and then performing cross validation.

This segmentation allowed me to derive more accurate coefficients/weights for the features for each subset of data.  For example, Business review history appeared to have a stronger signal then user review history as review counts became lower and lower.  Which makes sense intuitively as a new user to Yelp who has only submitted a handful of reviews has really shown revealed no distinct pattern yet, whereas a business with 5-10 4 and 5 star reviews has already begun to show a pattern of excellence.

Feature Creation:  Some of the features I created and used in addition to the basics such as user review history, business review history, and zipcodes included:

  • Business name averages (derived by finding businesses with the same name and calculating the average)
  • Text bias derived from words found in the business name (if a matching bus_name was not found)
  • Grouped category averages (finding businesses with exact same group of categories and calculating the average)
  • Mixed category averages (breaking all categories apart and calculating the averages for each, then averaging those categories together if test business contains more then 1)

The strongest signals came from bus_name averages, then grouped category averages, then mixed category averages.  So I used bus_name averages if there was sufficient matching businesses for comparison (>3), then used grouped category averages if there were sufficient matching categories for comparison (>3), then defaulted to mixed category averages if that was all that was available.  It’s for this reason that I had so many different models to train.

The bus_name text analysis gave some of the most surprising finds. For example, when I ran it and begin looking at the results, the highest positive bias word for business names in the entire training set was…. (drum roll please)…   “Yelp”!  So I looked into it and sure enough there are many Yelp events that Yelp holds for its elite members and each event is reviewed just like businesses.  And of course intuitively, what type of reviews are elite yelp members going to give to a Yelp event?  Reviews that are certain to be read by the Yelp admins?  Glowing 5-star reviews!   So, for all records in the test set that contained the word “Yelp”, I overrode the review prediction with a simple 5 and sure enough my RMSE score jumped +.00087 just from that simple change.

Other words were not so extremely biased, but I did take some of the more heavily negative and positive bias word combinations (“ice cream”, “chiropractor”, etc.) and use it to weight the reviews for which a comparable business name and comparable grouped categories were missing.  It would have been very interesting to see if there is a temporal effect on the word bias, for example in the winter are businesses with “ice cream” in their name still receiving such glowing reviews?  When the Diamondbacks perform better next season, do businesses with “Diamondbacks” in their name begin receiving a high positive bias?  Sadly, as has already been discussed much in the forums, temporal data was not available for this contest.

I used a few other features with marginal success, such as business review count and total check-ins.  These seemed to have very weak signals, but did improve my score marginally when added into the weaker models (usr avg only, bus avg only, etc.).  One important thing to note was that they were only effective once I cleaned the training data of outliers, businesses that had extremely high check-ins or bus review counts.

Cross-Validation:  For this data set, the standard randomized k-fold cross validation was ideal.  There was not a temporal element to the data and the data was evenly distributed, so this seemed the most effective choice.  I mostly used 20 folds for linear models with fast training time and 10 for models with slower training times.

Learning algorithms used:  Nearly all of my models used SKLearn’s basic linear regression model as I found other models did not perform any better on the leaderboard (although my CV scores much improved…).  A few of my models that didn’t perform well in linear regression were actually just built in Excel where I used simple factorization with weighting up to a certain threshold.  For example, in the UsrAvg+BusAvg group with review counts of <5 BusCount and <8 UsrCount, I simply had  formula of =A+(B/10)*(C-A)+(D/20)*(E-F).  Where A is the category average for the business (the starting point), B is the business review count, C is the business average, D is the user review count, E is the user average, and F is the global mean (3.766723066).  The thresholds to use (10 for bus and 20 for usr) were developed through experimentation based on leaderboard feedback.   I tried linear regression on this group with low review counts for usr or bus, but it was outperformed by the simple formula above.  I used a similar basic factorization model for a few other small subsets that didn’t perform well when run through linear regression (for example in the usr avg only group, when there was no similar business name to be found).

Conclusion and Lessons Learned

This competition was by far my best learning experience so far, I felt like I really spent the time needed to fully analyze the data set and develop a robust model.  I learned MUCH about model development and feature creation, particularly the value of segmentation ensembles and the need for attention to detail on real-world data sets like this one.

In contrast to some other contests, this was not a “canned” set of data which had been fully sanitized and cleaned prior to being given to competitors, instead it was a raw dump of Yelp’s real-world review data. I think the amount of time I spent finding patterns and details in the data paid off in the end and gave me favorable positioning over many of the other competitors.

Looking forward to the next one!

Kaggle Data Science Competition Recap: Amazon.com Employee Access Challenge

Completed my second Kaggle data science competition!  Coming off the high of successfully completing my first competition a few weeks ago (Recap: Yelp.com Prediction of Useful Votes for Reviews), I decided to join another competition already in progress.  This competition had a HUGE number of participants (over 1500 teams when I joined) with a very active forum community, so it held alot of appeal for me as a learning experience.  Much like the last competition, I was a little rushed and only managed to get in a few submissions, but the result was still quite satisfying.  I finished the competition with a  .89996 AUC error metric, which landed me in the top 11% of teams (#174/#1,687), just barely missing the coveted top 10% badge :)

I was very happy with this finish given the limited amount of time and submissions I had to work with, and I really enjoyed learning from the forum community and combing through the data.  This competition was quite different from my first in that this was a classification problem instead of a regression, and therefore it used a common classification error metric: AUC (area under the curve) which was intuitively very different from the previous contest’s RMSE.

Here’s a recap:

Contest Overview

http://www.kaggle.com/c/amazon-employee-access-challenge

Goal: The objective of this competition was to build a model, using supervised learning on historical data provided by Amazon, that predicts an employee’s access needs, such that manual access transactions (grants and revokes) are minimized as the employee’s attributes change over time. The model takes as input an employee’s role information and a resource code and outputs a prediction of whether or not access should be granted.  This was a binary classification problem (predict approval or disapproval).

Data set: The data consists of real historical data collected from 2010 & 2011.  The data consists of manual approvals and denials to resources over time and includes each employee’s role information.  The data came formatted as standard CSV files.

Error metric:   AUC (Area Under the ROC Curve).  This is a popular classification error metric in machine learning because it allows for an entire model to be measured using a single score.  It basically consists of first graphing a ROC curve with true positives (TPR) on the y-axis and false positives (FPR) on the x-axis, then measuring the area under the curve using trapezoidal areas.  Intuitively, a more accurate model will have more true positives and less false positives, therefore the area under its curve will be higher.  Concretely, a perfect model has 100% TPR and no FPR, thus giving an AUC of 1.  Conversely, a random noise model will have 50% TPR and 50% FPR, thus giving an AUC of .5.  So technically an AUC <.5 on a binary classification problem is worse then random guessing.

My Model and Workflow

Model code on GitHub:  https://github.com/theusual/kaggle-amazon-employee-access

Tools: For my weapon of choice I again used the popular Python and SKLearn combo.

Loading and munging: The data was given in CSV format, so loading was easy.  From there, no cleaning was needed as there were not any N/A’s or null values.  Overall the loading and munging took very little effort for this contest as the data was very clean and compact (sanitized).  It quickly became apparent this contest was all about effective feature creation.

Feature Creation:  After loading/munging, work began on the features.  This project was particularly light on features as only a handful of categorical features were given.  This led to quite a bit of discussion on the forums as to how to best derive more signal from the limited data we were working with.  One great suggestion that arose (thanks Miroslaw!) was to create higher order pairs of features, thereby deriving a huge amount of new features and harnessing any signal that existed in the interaction of the various features.

Of course with these new higher order features came a huge increase in K and a method must be used to prune off the useless higher order features which contain no signal. The pruning method of choice for this was greedy feature selection.  Greedy features selection is basically a stepwise regression method in which the input space is iteratively pruned by deleting the worst feature in each iteration, as determined by cross-validation.  Nothing too complicated, although it did add hours to the model training.

The addition of higher order features provided minor improvement and took additional hours to train, however there was some improvement and it allowed me to jump up a few hundred spots over my initial submission using only first order features.  It was an interesting trick to add to the tool chest for future projects in which speed is not a factor or which have very small K.

Cross-Validation:  For this data set, the standard randomized k-fold cross validation was ideal.  There was not a temporal element to the data and the data was evenly distributed, so this seemed the most effective choice.   For both the greedy feature selection loop and the hyperparameter selection 8 folds were used.

CV scores showed logistic regression outperformed other linear learning algorithms.  Ensemble classifiers and most SVM kernels were not options because of the huge feature space (K) that the higher-order data set contained.

Testing and submission:  After deriving the ideal hyperparameters for the logistic regression and pruning the feature set, test predictions were made using the trained model.  Submission was straight-forward, a simple classification prediction of each record (approve or disapprove).

Performance

The approach of using higher order pairs and greedy feature selection with the added sauce of hyperparameter selection gave a slight leg up on standard logistic regression approaches, which landed a score in the top 11% or so of teams.  It also soundly beat the benchmarks.

Conclusion and Lessons Learned

Great experience again, I learned much from this contest and from the forum community that evolved around the contest.

This was very different from the Yelp contest in that the data was very limited and sanitized, so there was little to no gain to be made in correcting and cleaning the data.  All signal gain came from unique feature creation methods.  A great lesson on how important feature creation is, and I’ll definitely use higher order pairs and greedy feature selection on future projects.

Once again looking forward to my next Kaggle competition!

Kaggle Data Science Competition Recap: Yelp.com Prediction of Useful Votes for Reviews

Completed my first Kaggle data science competition!  It was a great experience, albeit a somewhat rushed and last minute one.  I entered with only a week or so left in the contest and only had time to create one submission before the deadline hit, but it scored a competitive  0.53384 RMSE error metric, which landed me 135th out of 350 teams.  Not bad considering that it was a last minute submission and my first foray into machine learning.  

Here’s a summary of my approach and workflow as I learn my way around machine learning:

Contest Overview

http://www.kaggle.com/c/yelp-recruiting

Goal: To predict the useful votes that a review posted to Yelp.com will receive using a predictive model developed using supervised machine learning.  This was a regression problem (numerical target) as opposed to classification.

Data set: ~227,000 reviews from the Phoenix, AZ area.  Included in the data set was information about the reviewer (average review stars, total useful votes received for all reviews given, etc.) as well as the business receiving the review (business name, average review stars, etc.) and information about the review itself (date/time created, review text, etc.).  Each set of data (business, user, review, checkin) came formatted as large JSON files.

Error metric: RMSE (Root Mean Squared Error).  This is a pretty standard and straight forward regression error metric, which was nice to have given that it was my first competition and it made the math behind everything fairly intuitive and did not require any special transformations to make compatible with the learning algorithms.

My Model and Workflow

Model code on GitHub:  https://github.com/theusual/kaggle-yelp-useful-vote-prediction

Tools: For my weapon of choice I used the popular Python stack of: PANDAS/NumPy for loading and munging the data, SKLearn for learning algorithms and cross-validation, and the statsmodels module for visualization.

Loading and munging: The data was given in JSON format so I used the handy JSON import function found within PANDAS to directly import the JSON files into PANDAS data frames.  After loading, I did quite a bit of data munging and cleaning, including converting the unicode dates into usable date formats, flattening nested columns, and extracting zip codes from full business addresses.  In addition, I removed quite a few unused columns from the dataframes and converted some data into smaller formats (i.e.-int64 to int16) in order to reduce the memory footprint of the large dataframes.  Overall Python is amazing for rapidly performing munging/cleaning tasks like this.

Feature Creation:  After loading/munging, I set about identifying and creating features from the data for use in deriving signals from the data to be used in the predictive model.   There were quite a few potentially useful features to be derived from the data, among them:

  • Review stars (numeric) — The # of stars given to the review
  • Review Length (numeric) — Character count of review text
  • Total Check-ins (numeric) — Total check-ins for the business being reviewed
  • Business Category (categorical vector) — One-hot encoded vector of the business’ category
  • Zip code (categorical vector) — One-hot encoded vector of the business’ zipcode
  • User average stars (numeric) — User’s average review stars
  • Business average stars (numeric) — Business’ average review stars
  • Business open (binary boolean) — Boolean of whether business is open (1) or closed (0)
  • Review age (numerical) — Calculated age of review based on review date

Then there were the numerous features that could be derived from the review_text using NLP (natural language processing) methods:

  • Sentiment score (numeric) — Total sentiment (positive or negative view/emotion of the review) score of the review derived by looking up the average positive or negative sentiment of each word in a pre-calculated list.  The sentiment score list of words can be derived from your training text using some Python code or you can use pre-calculated lists available online, for example the AFINN-111 list which contains the sentiment score for ~22,000 English words.
  • Other NLP methods using the review text such n-gram TFID vectors

Unfortunately I ran out of time and was not able to use review text features at all, so I focused on creating a quick features matrix using the other features.

One thing of note: numerical pre-processing normalization using standardization.   I ran SKLearn’s StandardScaler on the numerical features in order to normalize the numerical data prior to training using standardization.  What this does is transform all values to zero mean and unit-variance (by subtracting mean from all values then dividing by the standard deviation).  This is an important pre-processing step on numerical features because it both ensures that equal weight is given to all features regardless of the original range of values, and it reduces the learning time during training, particularly on SVM’s and SGD. Other options for pre-processing transformations include simple re-scaling to [0,1] or [-1,1], or transforming using other mathematical operations such as applying a log transformation.

Cross-Validation:  For this data set, I used standard randomized k-fold cross validation.  For the number of folds, I used 20 for models with fast training times (Ridge, SGD), and 10 folds for learning algorithms with slow training times (Lasso, Logistic).  SKLearn has a great set of cross validation functions that are easy to use and a nice set of error metrics as well that are easy to tap into for scoring the CV.

I combined all available features into a large sparse matrix and recorded RMSE CV scores using all the major regression learning algorithms that are compatible with large sparse matrix problems like this.  Ensemble algorithms such as GBM, RF, and ExtraTrees were all out of the question due to the large K (features) caused by the one-hot vectors (sparse matrices) and the large N (review samples).

CV scores showed that the best performance came from SGDRegressor, so I went with that.  After choosing the learning algorithm, I began trimming off features one by one to see what the impact was on the CV score.  Surprisingly I found that removing features did not leave to any increase in RMSE and in fact led to small to large decreases, so it seems all features had some degree of signal.  The highest signal features, based on coefficient weights, were review_length, user_average_stars, business_average_stars, and text_sentiment_score.

Testing and submission:  Having identified the optimal learning algorithm (SGDRegressor) and feature set (all features), I was ready to train a model on the entire training set and then use that model to make test predictions.  After running some sanity checks on the predictions (mean, min, max, std. dev.), I discovered I had missed a post-processing step: converting all predictions < 0 to 0.  So I applied that then exported the processes predictions predictions to a CSV to create a contest submission.  All of this was fairly straight forward and I created some helper functions to make these steps faster in future competitions.

Performance

I only submitted one last-minute entry and the RMSE score for that entry on the final leaderboard was .53384, which placed in the top 40% of contestants (#135/#350).  It also beat the benchmarks by quite a bit, so overall I’m happy with my first (and only) submission.

Conclusion and Lessons Learned

Overall this was a great first experience in practically applying some of the machine learning techniques I’ve learned through reading and class (Intro Data Science online course at Univ of Washington).  I love Python and SKLearn!  Everything is very efficient and intuitive, allowing for very rapid development of models.

I also now have a good code base and work flow to follow for predictive modeling problems such as this.  I spent much of the time creating and identifying various Python helper functions to achieve what I needed and have now organized them into modules (munge, features, train, etc.) so that going forward future contests should “flow” much more rapidly.

I also learned some important new steps to the machine learning workflow that I had not previously covered in my reading such as pre and post processing and data cleaning.  It seems data science is all about the details.

Looking forward to my next Kaggle competition!