RHOK Water Hackathon – Team Tarifa Won!

The past weekend I finally got the chance to attend one of the RHOK hackathon events. These events concentrate on using technology to solve humanitarian problems such as those related to draught, overpopulation, earthquakes, etc.  The current hackathon was the first to concentrate solely on water and sanitation related problems.  Held in 9 cities across the world, I attended the London event, organized by fellow compatriate Julian Harou.

I was one of the first to arrive, and although there weren’t too many people we quickly had a good mix of water and tech people.  The event started with pizza and calling in to various countries around the globe so the problem submitters could briefly describe the motivation behind their problem statement.  There were quite a few problems, some with curious names such as “Map the crap”.  Initially my interest went out to a problem that involved writing a Google Earth plugin that would enable water management people & researchers to quickly and automatically identify catchment areas for a given location.  A challenging algorithmic problem I figured.  Unfortunately I was one of the few (or only?) person interested in the problem & the problem submitter was not physically available.  So progress would have been difficult & slow.

Instead I decided to join fellow attendee and problem owner Mark Iliffe.  Mark had just started working at the World Bank and his problem revolved around extending the Ushahidi platform to improve the reporting and feedback process of complaints.  If you have never heard of Ushahidi, dont worry, neither had I.  Turned out is quite a useful and successful CMS system that allows users to file reports about problems in their area (broken pipes, blocked public toilets, etc.).

From the start it was great to see how enthusiastic Mark was about his team (about 4 of us).  Quickly we were sketching out use case diagrams, sequence diagrams, flow charts, TODO lists, etc.  I think we pretty much filled up all the white boards available.  We were only about 20 minutes into this and our progress was already caught on the akvo.org blog by Mark Charmer.  Great stuff.

As people were working stuff out Mark would bring on beer, biscuits, wine, and gently coerce other hackers in joining us (at peak we had about 12 people!).  Star of the team was probably Caz.  Ushahidi is written in php and my php foo is … well … not stellar :)  But no worries for Caz who did a great job of coding stuff up.  The hardest part about the whole exercise was understanding the code structure (Ushadhidi uses a custom framework based on Codeigniter and Kohana).  I helped around with various bits and pieces and added support for geocoding sms messages by using a custom location code (Tanzania lacks an official postcode system).  We worked right up to the deadline, with our tweaked Android application only finally working during our presentation (props to Gitmeister Florian!).

Anyways, long story short.  We won :)  The other teams had some good stuff too but I think we had the edge with Marks mad selling skills and a very clear route to sustainability.  So what did we win?  Not quite sure yet.  Some gadgets from Google and a teleconference with some IBM guys next week.  Lets see what happens.  After a great social gathering at the local pub it was time to head home.

As I undertook the 2 hour journey back to Southampton it struck me again how important these kinds of events are.  First of all for meeting new and interesting people, secondly for learning new stuff and learning to get something together quickly.  I still have a long way to go but thats a great excuse for attending a next event :)

More pictures here.

-Dirk

Pycel: Compiling Excel spreadsheets to Python and making pretty pictures

Background

As described in my previous post, my job is to assemble a software suite that is used by aerospace engineers to design civillian UAVs.  Our main use case being search and rescue.

Every aircraft design starts its life in a concept tool where the configuration choice (V-tail, canard, flying wing, …) is made and where the principal geometrical parameters (wing area, CoG, tail volume, …)  are decided.  This then allows the calculation of the approximate weight and performance of the aircraft (e.g., maximum speed, range, endurance, …).

While this concept stage is only a rough approximation it already involves (in our case) some 250 variables which are interlinked in a complex web of equations.  These sizing equations capture a great deal of knowledge about the physics of flying an aircraft as well as empirical relationships drawn from historical & experimental data.  Solving these equations is really a Multidisciplinary Design Optimization (MDO) problem, a whole field of research in its own right.

After working with the Pacelab Suite for a year we have (for various reasons) switched to spreadsheets for implementing & experimenting with the sizing equations.   This brings great flexibility for the designer but makes things more tricky to parallelize & optimize for me.  For example, I cant just throw Excel on the 8000+ core linux cluster we have here.  Also, while the inbuilt solver plugin is easy and quick for solving small scale systems.  As our equation system got more complex the process just breaks down.

It was also clear that our sizing complexity had outgrown the spreadsheet UI and had become hard to understand for a newcomer.

Early experiments

Being a programmer I feel more at home manipulating code than manipulating spreadsheets.   Early on I had already experimented with visualizing the sizing logic by customizing and extending the Trace addin.  This resulted in images like this:

With some more graphviz magic you could add some structure by grouping related parameters:

And trace the impact of a particular parameter:

However, I soon ran into limitations of the static graphviz model and VBA (the language Trace is written in) is not the most happy place to be (to put it politely).

Python to the rescue

This work then lay dormant as I was sidetracked by other stuff.  However, with Pacelab being abandoned and the spreadsheet itch still being there I could not resist having a go at rewriting my own excel visualization library and throwing in some compilation for good measure.  The goal being to:

  • Allow better visualization and exploration of the sizing equations, to understand how variables influence each other
  • Compile the spreadsheet into code that can be manipulated, transformed and executed independently from excel.

There are already a number of commercial tools for compiling spreadsheets but I did not find an open source library which did what I needed.  And anyways, I did not want to look too hard since it would be something fun to do myself.  Why python? Its the language I was using at the time and one I particularly enjoy.

Piecing things together

A core problem in making all of this work is tokenizing an excel formula properly. Luckily this had already been solved and after a small patch to the python port that was working. Throw in shunting yard, the conversion to an AST, and I was ready to emit code.  For example:

Formula:  =3 + 4 * 2 / ( 1 - 5 ) ^ 2 ^ 3
RPN:  3|4|2|*|1|5|-|2|^|3|^|/|+
Python code:  3+((4*2)/(((1-5)**2)**3))

or

Formula:  =If(SUM(A1:A5) > 0, sin(0.3), ln(2.5))
RPN:  A1:A5|SUM|0|>|0.3|sin|2.5|ln|If
Python code:  sin(0.3) if xsum(eval_range("A1:A5"))>0 else xlog(2.5)

As you can see this means mapping excel functions/operators onto python ones and adding python equivalents where necessary.

Dependency tracking

Besides the issue of function mapping there is the issue of dependency tracking when updating cells.  So if A1 contains the formula B1 + 5 then you need to be aware of this dependency and update A1 whenever B1 changes.  For this I used a graph as the base data model as it is most intuitive.  Thanks to pythons networkx this was very easy to do.

The final result is a small python library that:

  • takes an excel spreadsheet & one or more seed cells
  • follows the dependency graph, starting from the seeds
  • generates a graph where each node is a cell or range and contains the python equivalent of the excel formula

This graph can then be serialized to disk and run independently of Excel (e.g., on linux).  Caching and lazy evaluation ensures the calculation and update process is relatively fast.  I have tested extensively with a spreadsheet with multiple sheets and over 1000 formulae.  Full calculation times are around 50ms which is plenty fast for our application.  Also, I can now run multiple instances in parallel, something that was virtually impossible to do by orchestrating Excel via Pthon over COM (yes I know there is solution from Microsoft for running Excel on a cluster but I did just not want to go there).

Pretty pictures

This means its now time for an intermezzo with some pretty pictures, courtesy of Gephi. The node color is determined by the worksheet the node comes from and the size of a node is determined by how many cells it depends on.  Ever think an aircraft design problem could look pretty? :)

Using some layout algorithms you can start pulling things apart:

And look at the macroscopic structure:

Or zoom in for the microscopic structure:

Here the colors represent which variable group the node belongs to. Stripping out things further you can look at one particular variable, for example the wing area, and see how it influences others:

Finally, optimization

Ok, so I could now serialize an excel spreadsheet to a python graph.  However, the ultimate goal is to build an aircraft that can actually fly, or in aero speak, it is balanced (lift equals the weight, center of gravity is in the right place, etc.)  This still involves an optimization process which now had to be done in Python.  For this I added an extra layer on top of the compilation process using OpenOpt.  After some bugfixing and testing I am happy to report that, without doing anything fancy, it solves in 98% of the time and takes 30 seconds to 1 minute to solve.  Compared with the original 65% and 10 minutes.

As a sanity check I then did some parameter sweeps to generate surface plots such as these:

Roughly this plot tells you that as you increase the payload weight and the importance of a low landing speed, your dry weight (the weight without fuel) goes up.  Which is exactly as you would expect (ignore the ridge at the bottom left).

Where is the code?

My employer has been kind enough to let me release the code which is now available on github!  An example is included and the readme lists the pro’s and con’s of the current functionality.  May the patches roll in :)

Thoughts and Future work

So far this has been an interesting little side project and been very useful in the wider design system.  Recall from my previous post that this is just one small piece in the puzzle.  In the wider system this is linked up with CFD simulations, costing codes, an operational model and what have you.  One disadvantage of spreadsheets, though, is that designers tend to use large tables to interpolate out equations instead of writing a few lines of code.  This can add large numbers of nodes/edges to the graph, thus increasing the visual complexity, without adding much to the true problem complexity.

I think there is potential to explore this topic further, but this would really need an application specific layer on top of a graph-only tool like gephi.  However, it does not look like this is where my priorities will lay, but maybe it will inspire somebody else :)

There is a good chance I will be revisiting this type of work when we look at visualizing design rationale data and how it evolves during the aircraft design process.  But that is a separate topic for a separate post :)

Thoughts, comments, suggestions, constructive criticism, patches? Dont be shy!

–Dirk