Its 8:44 am and after 5 hours of sleep on my trusty Thermarest I feel quite refreshed, which is more than I can say about the people around me. Some have capitulated and lay scattered around the brightly lit room under their coats in front of their MacBook Air’s. Others are still in exactly the same position I left them 5 hours ago but the intensity has gone and eyes have glazed over. At least one person confirmed the geek stereotype and didn’t manage to hold his beer.
Update: Apparently Parlycloud won a special mention during the judging, thanks!
I’m at the 2012 Rewired State Parliament hack weekend. Not with a concrete problem in mind but to observe and learn, similar to the Datadive of two months ago. I was surprised that the format was even more free from than the Datadive. There was no problem facilitation, just a collection of links to various data sources and it was up to the participants to come up with ideas and coalesce into groups.
After poking around at the data somewhat and inspired by the free Parliament tour, I thought I it would be cool to do automatic and dynamic topic mapping of the debate transcripts (or Hansard data as it is referred to). The ideal goal was to have some kind of dynamically changing word cloud or topic map which would allow one to visualize how time was allocated across different topics (health, immigration, science and technology, …) and varied over time.
Unfortunately there was no predefined taxonomy, transcripts were not labelled in any way, and I couldn’t find a different dataset to cross reference with (though I wouldn’t be surprised if there was one). Thus I spent some time looking into the very difficult problem of automatic topic extraction. Probabilistic methods such as Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) seemed the most popular so I toyed around with various libraries but wasn’t quite happy with the results. With a deadline looming I didn’t want to waste too much time understanding libraries or playing with parameters so eventually I decided to switch to a hosted solution based around Zemanta. I could always swap a custom solution in later but at least I had something working. Other APIs I tried were Yahoo’s content analysis, Alchemy API, and the Open Calais API.
I went with Zemanta as it was quick and easy to obtain a key and the API well documented and easy to understand. Essentially the API provides a suggest capability that, when queried with some text, will return what it thinks are relevant keywords. Particularly useful for me was that it also returns relevant images, articles and categories. That saved me from piping the keywords through a search engine in order to get some more context information. Unfortunately though I found little technical background information as to how the Zemanta service actually works so its a bit of a black box but the results seem sensible.
With Zemanta doing the heavy lifting it was then straightforward to tie it together with the XML Hansard data and provide a simple web frontend with Flask and Bootstrap. The result is a page that shows you the topics, categories, and images for a particular parliamentary debate:
Code is in github and a demo application (seeded with a couple months worth of data) is running on heroku. The topic extraction has more trouble with some debates than others, but overall the themes are there. Still a long way from what I originally had in mind but I thought it was a good place to stop. Also, during this process I connected with another team, led by Mark Smitham, who had been thinking along similar lines and had done similar things. Unfortunately I can’t stay for the final presentations but I’m sure they will come up with something cool. Hopefully I will find some time to continue to work on this in the future.
Thanks to the Rewired State and Tech Hub crew for organizing! Next stop: #rhoksoton!