Kaggle Otto Challenge — Introduction

25 May 2015

python • machine learning • kaggle • deep learning • random forests • [archive]

I have recently participated in Kaggle’s Otto Challenge and got in the top 10% on the leaderboard. It was a fairly standard classification challenge, where a class of a product was predicted based on 93 numerical features. I will gradually write a couple posts about the methods I used, but here is a brief summary:

Mainly, I used a combination of scikit-learn’s Gradient Boosting Classifier (GBC) algorithm with nolearn’s deep artificial neural network. In addition to the standard grid search for hyper-parameter tuning, I engineered new features. At the beginning I experimented with Principal Component Analysis, but discovered that it did not improve algorithms’ performance. Instead, I engineered new features looking for interactions between the most important predictors from the GBC’s feature importance list. Next, I did some semi-supervised clustering using DBSCAN library, assigning datapoints from both training and testing datasets to a number of automatically generated clusters. This helped with the algorithms’ performance quite a bit.

The final step was combining the results from GBC and nolearn. After printing out a confusion matrix for each algorithm, I weighted the contribution of each algorithm’s prediction accordingly.

My final score (log-loss) was 0.42671 on the public leaderboard and 0.42912 on the private one. Throughout the competition, I moved from being in the bottom 25% to top 10% of the leaderboard (at one point I was in the top 3%, but in the final few days a rush of new submissions pushed me down a bit).

I learned quite a bit from this experience and recommend participating in Kaggle challenges to anyone interested in both practicing machine learning on real-world datasets (with the obvious caveats about cleaning data, importance — or not — of marginal improvement, etc.) and learning the underlying theory (I will post some links on the open-source resources that helped me throughout this competition as well).

As I mentioned, I plan to talk in some detail about the methods I used over the next few posts, but if you’re interested in an advance preview, take a look at the IPython notebooks.

In terms of Kaggle competitions, my plan is next to form a local team and try out collective hand on the CrowdFlower Search Relevance challenge.

I would like to thank Chicago Python (ChiPy) User Group for allowing me to participate in a mentorship program, where I received a lot of advice and guidance from Eric Meschke, a local Machine Learning enthusiast working at Chicago’s Mercantile Exchange (Eric and I are forming a Machine Learning discussion group and will be working together on Kaggle challenges as a team with a few additional members; so, if you’re in Chicago area and would like to join, let us know).

Double-checking NPR's income data

21 Feb 2015

python • pandas • data exploration • income • economics • [archive]

According to NPR.org, “After 1980, only the top 1% saw their incomes rise.” Flowing Data quoted this figure:

You can see the trend more clearly here (note that the y-axis shows growth in year X vs. 1917; so, it’s clear that income growth stagnated for the bottom 90% in the last 40 years):

Their data came from World Top Incomes Database.

I decided to double-check this claim using a dataset from census.gov

How unlikely is the recent Boston snowfall history?

16 Feb 2015

julia • monte carlo • probability • [archive]

This morning, a discussion on Facebook:

JC: Do you notice how 6 out of the 10 snowiest Boston winters are from the past 25 years? This might be a symptom of climate change.
AF: Your intuition is that if high snowfalls are randomly distributed, it’s unlikely that 6/10 highest would be concentrated in the last 25 out of 115 years. I ask: how unlikely?

Create a heatmap of your Google Maps trips

15 Feb 2015

javascript • google • plotting • maps • dataviz • [archive]

In this post, I am not writing code in Python or Julia (even though it was inspired by these instructions that use Python code). I will show how to use online JavaScript-based tool to plot your Google Maps locations in a heatmap form. You can use it even if you don’t know how to code at all.

It is called Location History Visualizer; the instructions are on the page once you open it. You can either use:

Google Takeout to download your Google Maps location history (uncheck the other boxes), or:
Google Location History KML API endpoint (much faster but still a bit experimental) — the file downloads automatically

After obtaining either file, just drag it onto the Visualizer page.

The heatmap of your travels (as recorded through Google Maps history) will be displayed. I have no spatial orientation at all, so I use GPS on my phone to go to a local grocery store. Therefore, I have a lot of Google Maps data.

Boston:

San Antonio:

It also shows my recent trip from Texas to Chicago:

The controls are on the left bottom. I think this example shows that JS is still the king of data visualization.

Linguistic geolocating with Twitter and Python

04 Feb 2015

python • twitter • plotting • bokeh • maps • dataviz • [archive]

That is a fancy title. What I actually did was try to map incidence of specific words spoken in tweets. For fun, I chose y’all and you guys, expecting to see the southerners in the US y’alling and English-speakers elsewhere youguying.

{data dendrites} exploring data science with Python et al.

Kaggle Otto Challenge — Introduction

Double-checking NPR's income data

How unlikely is the recent Boston snowfall history?

Create a heatmap of your Google Maps trips

Linguistic geolocating with Twitter and Python