Data Visualization: Kaggle Contest Entry for – Community Activity by Zipcode

Just wanted to make a quick post highlighting my last minute entry for the visualization portion of the Kaggle competition.

Overview of my entry:

My visualization entry consists of a dashboard summary and a time-series video showing the community activity, both historic and forecasted, by zip code, as determined using reverse lookup of the issue’s longitude and latitude.  Activity level is measured by average community votes that an issue receives. The models are based on both the training (01/2012-04/2013) and test (05/2013-09/2013) data sets. The average votes for issues in the test data (05/2013 -09/2013) have been populated using our team’s #2 ranked prediction model.

Average votes was the chosen metric for community activity levels because it is stable over time with a lower variance, and it best represents a community’s interest in fixing the issues. Views and comments are noisy with very high variance and are possibly influenced by users outside of a community.  Also city_intiated and remote_api_created issues were filtered from the data set to make comparison across cities more reasonable.

The dashboard summary encapsulates all the data into an aggregate activity level for each zip code in each city and is visualized using a combination of heat mapping for each city and a tree-view of overall most active zip codes.   Fully interactive dashboard summary available here:

And here is a non-interactive snapshot of the summary:

Activity By Zip-Summary

The video consists of quarterly time-series heat maps of each city using the same scale as the dashboard summary. In contrast to the dashboard aggregate summary, the time series model illustrates changing community activity levels over time, including the forecasted activity levels for Q2 and Q3 2013.  Video available on YouTube :