Rewired State Parliament hack weekend 2012

Its 8:44 am and after 5 hours of sleep on my trusty Thermarest I feel quite refreshed, which is more than I can say about the people around me. Some have capitulated and lay scattered around the brightly lit room under their coats in front of their MacBook Air’s. Others are still in exactly the same position I left them 5 hours ago but the intensity has gone and eyes have glazed over. At least one person confirmed the geek stereotype and didn’t manage to hold his beer.

Update: Apparently Parlycloud won a special mention during the judging, thanks!

I’m at the 2012 Rewired State Parliament hack weekend. Not with a concrete problem in mind but to observe and learn, similar to the Datadive of two months ago. I was surprised that the format was even more free from than the Datadive. There was no problem facilitation, just a collection of links to various data sources and it was up to the participants to come up with ideas and coalesce into groups.

After poking around at the data somewhat and inspired by the free Parliament tour, I thought I it would be cool to do automatic and dynamic topic mapping of the debate transcripts (or Hansard data as it is referred to). The ideal goal was to have some kind of dynamically changing word cloud or topic map which would allow one to visualize how time was allocated across different topics (health, immigration, science and technology, …) and varied over time.

Unfortunately there was no predefined taxonomy, transcripts were not labelled in any way, and I couldn’t find a different dataset to cross reference with (though I wouldn’t be surprised if there was one). Thus I spent some time looking into the very difficult problem of automatic topic extraction. Probabilistic methods such as Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) seemed the most popular so I toyed around with various libraries but wasn’t quite happy with the results. With a deadline looming I didn’t want to waste too much time understanding libraries or playing with parameters so eventually I decided to switch to a hosted solution based around Zemanta. I could always swap a custom solution in later but at least I had something working. Other APIs I tried were Yahoo’s content analysis, Alchemy API, and the Open Calais API.

I went with Zemanta as it was quick and easy to obtain a key and the API well documented and easy to understand. Essentially the API provides a suggest capability that, when queried with some text, will return what it thinks are relevant keywords. Particularly useful for me was that it also returns relevant images, articles and categories. That saved me from piping the keywords through a search engine in order to get some more context information. Unfortunately though I found little technical background information as to how the Zemanta service actually works so its a bit of a black box but the results seem sensible.

With Zemanta doing the heavy lifting it was then straightforward to tie it together with the XML Hansard data and provide a simple web frontend with Flask and Bootstrap. The result is a page that shows you the topics, categories, and images for a particular parliamentary debate:

Code is in github and a demo application (seeded with a couple months worth of data) is running on heroku. The topic extraction has more trouble with some debates than others, but overall the themes are there. Still a long way from what I originally had in mind but I thought it was a good place to stop. Also, during this process I connected with another team, led by Mark Smitham, who had been thinking along similar lines and had done similar things. Unfortunately I can’t stay for the final presentations but I’m sure they will come up with something cool. Hopefully I will find some time to continue to work on this in the future.

Thanks to the Rewired State and Tech Hub crew for organizing! Next stop: #rhoksoton!

–Dirk

3 thoughts on “Rewired State Parliament hack weekend 2012

  1. Hi Dirk,
    I was lucky enough to run into Parlycloud.
    May I ask a few questions?

    1 – What’s the difference and / or relationship between “Topics” highlighted and “Categories”?
    Are “Categories” supposed to populate a taxonomy?
    2 – Do you intend to go on with promising young Parlycloud?
    Namely I figure out:
    a- linking to the debates transcripts from “Topics”! (key)
    b- giving the pictures’ keys
    c- what is the content on DMOZ?
    Happy to take part on the functional side.
    Véronique

    • Hi Véronique,

      Thanks for your comment, replies inline:

      >1 – What’s the difference and / or relationship between “Topics”
      > highlighted and “Categories”? Are “Categories” supposed to populate a taxonomy?

      Topics are (hopefully) the main keywords of what the debate was about. Categories are higher level classifications of the subjects talked about. A kind of taxonomy yes.

      >2 – Do you intend to go on with promising young Parlycloud?

      Hopefully, though I am in the process of job hunting (moving away from academia) so that has priority. This was developed over the space of a day as a proof of principle. There are many ideas I still have for making it better and smarter. Now Zemanta does the heavy lifting but this is something I want to change. But time is very limited at the moment unfortunately. I agree your suggestions are useful though.

      DMOZ is simply a crowdsourced directory (http://www.dmoz.org/docs/en/about.html) which Zemanta provides links to.

      What’s your interest in all of this / back story? (feel free to reply by email).

      Cheers
      Dirk

  2. Pingback: UAVs meet Node.js – NodeCopter comes to Southampton on 10 Aug | Dirk's Page

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s