Its 23:06 as I type this on the train on the way back from the second UK DataDive. I attended the first one in October last year after hearing about DataKind on rce-cast. In a nutshell DataKind are a non-profit who work together with charities, NGO’s and related organizations to help them collect, manage, and analyze data so they can be more effective. So its like RHoK but focused on data analysis and with a stronger sustainability angle.
I have good memories from the last one and the turnout and organization was very similar this time around. Like last time I could only make it for the Saturday and my employer was kind enough to cover travel costs. Charities selected to participate were:
The attendee demographics were as you would expect at a data hackathon: folk from the London startup scene, lots of data analysts (Marks & Spencer, Ocado, Ordnance Survey, …), some students, and the generally curious. I was quite surprised to find a number of attendees had come quite a way to attend (e.g., Switzerland, France, Italy). It was also nice to bump into some familiar faces like Paul Lam (from clojure dojo, cascalog, Data Science London fame) and fellow Taarifian and serial hackathoner Florian Rathgeber. DataKind(UK) founders Jake, Craig, and Duncan were present as well and were as positive and encouraging as always.
I was particularly happy to bump into Katilin Thaney, director of the recently launched Mozilla Science Labs. Incidentally I had recently suggested her as a speaker at an upcoming workshop, of which I am on the steering committee, at Oxford University. Following on to our position paper, the aim of the workshop is to go into depth about the role of software development in research and need for changes in the recognition model. More about that in a later post. Given Kaitlin’s extensive experience at Creative Commons, Digital Science, and now Mozilla she would be a perfect speaker. Its really great what Mozilla is doing in the Open Science space and much more is in the pipeline by the looks of it.
Anyway, after a brief stint at the Hampshire group poking at Meteor I joined the Oxfam group to look at spatial data visualization after they put out a call for somebody with d3 knowledge. Having just used d3 quite intensively at work it seemed like a good fit and Florian joined me as well. After some debate we ended up using Leaflet and some geocoding later we had a dynamic map of the Kenyan districts working. Getting a decent heatmap proved more problematic though and the remainder of the time was spent debugging buggy plugins while Florian was fighting the Mercurial branching model. Unfortunately we never got it all completely working and were forced to put pens down when we were kicked out at 10pm as the last attendees for that day. So not the most productive hackathon code wise but thats how it goes sometimes.
Having been to the previous DataDive I knew what to expect and it was generally all as before. However, some things I noticed and stuck with me are listed below:
Its about the data stupid
As before the dive perfectly illustrated that 80% of the effort (if not more) goes into cleaning and understanding the data, where it came from, why it was collected, its limitations and reliability, and pinning down which questions you want to ask and in which order. Thats the hard part, the rest is buzzword bingo.
This does make the hackathon format more difficult though as most people will start with little or no prior knowledge.
Triggering the spark
Related to the previous point is that the final analyses are usually not particularly sophisticated. Most of what is showcased are simple line plots or choloropleths and I can imagine social scientists raising an eyebrow or two at some of the conclusions drawn based on them. However, that is of course perfectly fine as the whole objective of the dive is to network, spark enthusiasm, raise awareness and have fun (at which it succeeds with flying colours). Not statistical rigor. However, it is important to manage the expectations of the Charity and keep this in mind when presenting results.
Chatting to Jake about this and sharing my own experiences with RHoK has me convinced they are taking the right approach with their focus on DataCorps.
It was also interesting talking to some of the charities about the value of data. Everybody agrees there is value in collecting and analyzing data but all charities I spoke to agreed there were issues with implementing this in practice. Proper data collection is hard and people are lazy and resistant to change. This is compounded by the fact that the benefit of data collection is not immediate and may not directly benefit the person collecting it. “What’s in it for me?”. Even worse, what if the collected data is taken out of context or shows that your approach is wrong and that the method you passionately believe in is actually detrimental? For me this touches right at the heart of what it is to be human and you find the same problem in many fields (e.g., open science, design rationale collection in engineering).
On the tool front Excel was present as always. Lots of python, R, and I also saw some clojure here and there. Its been a while since I toyed with clojure and I really like the language. Hoping to find a way to get more into it and delve into Cascalog. D3.js was also very present which was nice for me as I spent quite a lot of time with it recently. Though not an exploration tool, it is very useful if you have a clear picture of what you want to show and how.
There will always be a couple of people using Tableau, SpotFire, or similar commercial tool. Usually because the company they work for has the necessary licences. These are very nice exploration tools but not useful in sustainable way as the license cost is prohibitive for most charities.
What stood out for me most was Shiny, a web application framework that plugs into RStudio and allows you to quickly create interactive, browser based UIs for your R code and plots. Complete with built in bootstrap theme. I had never heard of it before and was quite impressed to see Peter whip up something nice very quickly. Despite being new to the API himself. Another thing for my long list of “stuff to learn”.
As there are only a handful of charities but 60-80 participants you get very large groups. Keeping everybody engaged and avoiding duplication within a team I found to be an issue. Maybe worth thinking about how to reduce team size in the future?
In all another great event and time well spent. Thanks to Jake, Craig, Duncan, the whole DataKind(UK) team and associated volunteers for pulling this off. Will definitely keep in touch with the community and look forward to the next one.