Kaggle Data Science Competition Recap: Amazon.com Employee Access Challenge

Completed my second Kaggle data science competition!  Coming off the high of successfully completing my first competition a few weeks ago (Recap: Yelp.com Prediction of Useful Votes for Reviews), I decided to join another competition already in progress.  This competition had a HUGE number of participants (over 1500 teams when I joined) with a very active forum community, so it held alot of appeal for me as a learning experience.  Much like the last competition, I was a little rushed and only managed to get in a few submissions, but the result was still quite satisfying.  I finished the competition with a  .89996 AUC error metric, which landed me in the top 11% of teams (#174/#1,687), just barely missing the coveted top 10% badge :)

I was very happy with this finish given the limited amount of time and submissions I had to work with, and I really enjoyed learning from the forum community and combing through the data.  This competition was quite different from my first in that this was a classification problem instead of a regression, and therefore it used a common classification error metric: AUC (area under the curve) which was intuitively very different from the previous contest’s RMSE.

Here’s a recap:

Contest Overview


Goal: The objective of this competition was to build a model, using supervised learning on historical data provided by Amazon, that predicts an employee’s access needs, such that manual access transactions (grants and revokes) are minimized as the employee’s attributes change over time. The model takes as input an employee’s role information and a resource code and outputs a prediction of whether or not access should be granted.  This was a binary classification problem (predict approval or disapproval).

Data set: The data consists of real historical data collected from 2010 & 2011.  The data consists of manual approvals and denials to resources over time and includes each employee’s role information.  The data came formatted as standard CSV files.

Error metric:   AUC (Area Under the ROC Curve).  This is a popular classification error metric in machine learning because it allows for an entire model to be measured using a single score.  It basically consists of first graphing a ROC curve with true positives (TPR) on the y-axis and false positives (FPR) on the x-axis, then measuring the area under the curve using trapezoidal areas.  Intuitively, a more accurate model will have more true positives and less false positives, therefore the area under its curve will be higher.  Concretely, a perfect model has 100% TPR and no FPR, thus giving an AUC of 1.  Conversely, a random noise model will have 50% TPR and 50% FPR, thus giving an AUC of .5.  So technically an AUC <.5 on a binary classification problem is worse then random guessing.

My Model and Workflow

Model code on GitHub:  https://github.com/theusual/kaggle-amazon-employee-access

Tools: For my weapon of choice I again used the popular Python and SKLearn combo.

Loading and munging: The data was given in CSV format, so loading was easy.  From there, no cleaning was needed as there were not any N/A’s or null values.  Overall the loading and munging took very little effort for this contest as the data was very clean and compact (sanitized).  It quickly became apparent this contest was all about effective feature creation.

Feature Creation:  After loading/munging, work began on the features.  This project was particularly light on features as only a handful of categorical features were given.  This led to quite a bit of discussion on the forums as to how to best derive more signal from the limited data we were working with.  One great suggestion that arose (thanks Miroslaw!) was to create higher order pairs of features, thereby deriving a huge amount of new features and harnessing any signal that existed in the interaction of the various features.

Of course with these new higher order features came a huge increase in K and a method must be used to prune off the useless higher order features which contain no signal. The pruning method of choice for this was greedy feature selection.  Greedy features selection is basically a stepwise regression method in which the input space is iteratively pruned by deleting the worst feature in each iteration, as determined by cross-validation.  Nothing too complicated, although it did add hours to the model training.

The addition of higher order features provided minor improvement and took additional hours to train, however there was some improvement and it allowed me to jump up a few hundred spots over my initial submission using only first order features.  It was an interesting trick to add to the tool chest for future projects in which speed is not a factor or which have very small K.

Cross-Validation:  For this data set, the standard randomized k-fold cross validation was ideal.  There was not a temporal element to the data and the data was evenly distributed, so this seemed the most effective choice.   For both the greedy feature selection loop and the hyperparameter selection 8 folds were used.

CV scores showed logistic regression outperformed other linear learning algorithms.  Ensemble classifiers and most SVM kernels were not options because of the huge feature space (K) that the higher-order data set contained.

Testing and submission:  After deriving the ideal hyperparameters for the logistic regression and pruning the feature set, test predictions were made using the trained model.  Submission was straight-forward, a simple classification prediction of each record (approve or disapprove).


The approach of using higher order pairs and greedy feature selection with the added sauce of hyperparameter selection gave a slight leg up on standard logistic regression approaches, which landed a score in the top 11% or so of teams.  It also soundly beat the benchmarks.

Conclusion and Lessons Learned

Great experience again, I learned much from this contest and from the forum community that evolved around the contest.

This was very different from the Yelp contest in that the data was very limited and sanitized, so there was little to no gain to be made in correcting and cleaning the data.  All signal gain came from unique feature creation methods.  A great lesson on how important feature creation is, and I’ll definitely use higher order pairs and greedy feature selection on future projects.

Once again looking forward to my next Kaggle competition!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>