Since about half a year my podcast subscriptions includes RCE-Cast, an interesting show run by very knowledgable hosts about HPC related topics. One of their more recent episodes was on the Datadive project by Datakind.
From the website:
DataKind brings together leading data scientists with high impact social organizations through a comprehensive, collaborative approach that leads to shared insights, greater understanding, and positive action through data in the service of humanity.
I liked the idea and, while I have plenty of scuba diving experience, data diving was not something I was very familiar with. The problems I have worked on so far have pretty much always been big CPU vs big data. Thus I followed their Twitter feed and signed up to the London event (first in the UK/Europe?) when I heard about it.
National Rail had me arriving a little late and I was starting to get frustrated looking for the correct building but luckily I bumped into Sara Farmer (crisis mapper from UN Global Pulse and Standby Task Force fame and who I know via Taarifa & WaterMe) who pointed me in the right direction. The datadive had already kicked off with three main projects, each with quite a large team of 10-15 people. I joined the KeyFund project:
Keyfund invests in ideas. Groups plan, design and deliver their own projects, building confidence, aspirations, self-esteem and our 12 Keyfund skills. Through Keyfund, young people learn by doing. They experience first hand how effort can reap rewards and what it feels like to succeed.
The group was divided into two subgroups and our task was to extract any trends from the 20-odd csv files on KeyFund projects & their participants. For example, what age groups were gaining the most from the different project stages, which facilitators (police, social services, community, …) were most successful in helping participants progress, etc.
As I arrived quite late I missed the initial briefings so it took me a while to understand how the data fitted together but I contributed where I could with some scripting & data visualization. In the process I learnt a lot from seeing how others tackled the problem and the tools they used.
In general my take home messages were:
- First and formost, make sure you understand the problem domain and how the data was collected. Without this everything else is just a waste of time.
- Save yourself a lot of fiddling by ensuring you have a proper data dev environment setup with the necessary analysis libraries, database server/admin tools, CSV processing tools, visualization tools, etc.
- R is really useful and worth learning, together with a good foundation in stats. Time to start watching those Coursera classes.
- Some Excel foo is always useful to have
- Tableau is a really nice data visualization tool, though the fact that you can’t export to an image in the free version is just silly.
- CSV is the lingua franca of data analysis but save yourself a headache and try to use the same formatting rules & conventions throughout the analysis
Pretty obvious really, though it always adds an extra dimension to actually experience why they are important. Putting them all together then allows you to produce some really nice plots & PowerPoint choreography.
For example, using a box plot we can see that as the participants move through the stages they rate themselves higher across all skills. Progress!
We can also look at the distribution of stages across age groups and facilitators:
.. and link it to geographic deprivation:
Even simple wordle’s like this one (from team names) can be useful:
So overall a great event, shame I could not make it to the second day to see the final presentations. It was great to chat with the diverse mix of attendees, especially with DataKind founder Jake Porway himself after hearing him on the RCE podcast.
Lots more added to my “to-learn” list. Hope to get some of that done before the next data analysis event I will be attending: Rewired State: Parilament 2012.