Sunday, February 8, 2009

On the serendipity of failure

My post on the relative difficulty of sensor nets research versus "traditional" systems research made an allusion to a comment that David Culler once made about the handicap associated with field work. David notes that in a field deployment, unlike a simulation or a lab experiment, if you get data you don't like, you can't just throw it away and run the experiment again, hoping for a better result. All systems researchers tend to tweak and tune their systems until they get the graphs they expect to get; it's one of the luxuries of running experiments in a controlled setting. With sensor network field deployments, however, there are no "do overs". You get whatever the network gives you, and if the network is broken or producing bogus data, that's what you have to go with when it's time to write it up. One of the most remarkable examples I have seen of this is the Berkeley paper on "A Macroscope in the Redwoods." In that case, the data was actually retrieved from the sensor nodes manually, in part because the multihop routing did not function as well as expected in the field. They got lucky by logging all of the sensor data to flash on each mote, making it possible to write that paper despite the failure.

We learned this lesson the hard way with our deployment at Reventador Volcano in 2005: the FTSP time sync protocol broke (badly), causing much of the data to have bogus timestamps, rendering the signals useless from a geophysical monitoring point of view. Note that we had tested our system considerably in the lab, on much larger networks than the one we deployed in the field, but we never noticed this problem until we got to the volcano. What went wrong? Many possibilities: The sparse nature of the field network; the fact that the nodes were powered by real alkaline batteries and not USB hubs; the fact that the time sync bug only seemed to turn up after the network had been operational for several hours. (We had done many hours of testing in the lab, but never continuously for that period of time.)

In our case, we managed to turn lemons into lemonade by designing a scheme to fix the broken timestamps, and then did a fairly rigorous study of its accuracy. That got us into OSDI! It's possible that if the FTSP protocol had worked perfectly we would have had a harder time getting that paper accepted.

I often find the parts of those application papers that talk about what didn't work as expected are more enlightening than the rest of the paper. Lots of things sound like good ideas on paper; it's often not until you try them in the field that you gain some real understanding of the real-world forces at work.

Later on I'll blog about why sensor net application papers face such an uphill battle at most conferences.

No comments:

Post a Comment

Startup Life: Three Months In

I've posted a story to Medium on what it's been like to work at a startup, after years at Google. Check it out here.