…to find an algorithm that predicts whether (and for what reason) a question will be closed. The idea is simple: we’ve prepared a dataset with all the questions on Stack Overflow, including everything we knew about them right before they were posted, and whether they finally ended up closed or not. You grab the data, build your brilliant classifier, run it against some leaderboard data and submit your results. Rinse and repeat until the contest ends, when we’ll grab the most promising classifiers and run them against fresh data to choose winners.
There were four reasons for closing questions: off topic, not constructive, not a real question, or too localized and about 4 GB worth of training data to learn from. The announcement immediately caught my eye as it’s a challenging problem and I know next to nothing of text analysis & classification. The Kaggle concept was also new to me and I think its a great idea. So ever since the announcement I had a tab open in Chrome with the blog post to remind me to check it out.
In the meantime the back of my mind was already working on a grand plan to tackle this problem. I was going to dust of my machine learning books and go through some coursera lectures on natural language processing. Then, given I have access to a 12000+ core machine, I was going to take the opportunity to learn some Hadoop and maybe even write it all in Clojure (which I am learning anyway). Throw in some SOMs, deep belief nets, random forests, ensembles, et voila, walk away with first prize.
No points for guessing that the grand plan didn’t exactly work out. I had hoped to get a partner in crime on board but found no willing subjects. Add to that a full time job, family, and other ongoing madness and it quickly turned out that there was only a week left before the end of the competition. So alas I did not get very far.
I imported all the data into a MySQL database and cleaned up some of the inconsistencies. Getting this far is already half the battle in most cases, but luckily the SO data was already quite clean and well vetted. I then started from the basic_benchmark.py script, delving into the scikit-learn docs where needed. Featurewise the plan was to start with the usual kitchen sink approach. Come up with a large set of features and see what sticks. I quickly knocked together a number of custom features as well as a TF-IDF matrix based on bi/tri-gramms. This was then filtered down with a chi-squared test to a more manageable set. On the classifier side I planned to keep it simple (experience shows complex, adaptive classifiers rarely pay off anyway) and started off with the usual suspects: random forest, logistic regression, etc.
Unfortunately though, just as I got the whole pipeline working and could start thinking about doing something sensible the competition ended and I was left with a measly top 100 leaderboard score…
However while, from the standpoint of the competition, it was a failure I do find I learned a couple things about handling text and the awesome libraries that are scikit-learn and nltk. I’m really interested to see what the winning teams came up with, especially what kind of semantic analysis was performed (if any). Next time around I should be up and running more quickly. As I said the kaggle concept is great and there are lots of resources online to learn about this stuff. So I hope to repeat this exercise again some time and I encourage you to do the same. To quote Mike Harsh of General Electric:
Be curious, fail often, fail small