On MobileActive's SMS Delivery Results in Egypt

A few days ago, MobileActive posted the results to some interesting work they did in Egypt, trying to quantify SMS latencies and whether they might indicate that filtering is taking place.  Essentially they wrote an Android application that allowed them to easily measure the SMS latencies between networks.  That application sent a variety of messages, some with 'safe' content and others with 'political' content.  The idea being to help quickly identity if networks are filtering or blocking messages based on their content.

While I applaud the effort, and I think it is an interesting project, I feel like there are some flaws in the methodology, significant enough flaws that MobileActive should have resisted even hinting at any conclusions before fixing them.

We do a lot of work with SMS here in Rwanda, we live and breath the stuff really, so we are pretty familiar with just how mysterious latencies can be.  We've done work in Ethiopia with Android phones and seen instant deliveries right after massive delays, all with just boring old numerical messages.  And we've done a lot of work here in Rwanda talking straight to the carrier's SMSC and seen similar things.  To put it simply, even under the best circumstances, making rhyme or reason to delivery failures, much less latency is a really hard task.

And that puts an enormous burden of proof to any experiment which claims to try to corrolate messages latencies to message content, and here I feel MobileActive should have known better and resisted any, even tentative and disclaimed, reporting of possible filtering.

It is hard to know exactly what the data showed, since the data isn't publicly available, but we do know that the study involved roughly 270 messages across a variety of networks.  One of the hypothesis they give is that Etisalat may be filtering messages based on the content of the message.  From the graphs provided it seems we can guess this is based on roughly 70 results, which seems to fit in with their totals.  Here's their graph of those results:

Now it is certainly tempting to look at this and start making hypothesis that political messages are being filtered, but we have to keep in mind our previous caveats about the reliability of SMS deliveries.  Specifically, especially when measuring something as unpredictable as SMS, we need to start with a valid hypothesis and then work out an experiment that makes sense.  Again, we don't know enough about MobileActive's methodology here to draw any conclusions, but here's some things I'd do:

1) Network latency is often cyclical, the network sometimes just acts up for a bit.  So when testing various messages it is important to both randomize the order that the messages are sent, and do multiple passes of the test at various times of the day.  If we skimp on either of these we may just be seing the artifacts of a network under load.

2) Are delays consistent across the same message?  From what I can tell this is the biggest flaw in the tests as a whole.  If every time I send a message saying "revolt now" it takes 20 seconds to deliver, then perhaps we have a case.  But if it is inconsistent, then we really need to start looking at what else can explain that latency.

3) If the hypothesis is that filtering is taking place, what do we think is the mechanism?  Clearly, any filtering that is automated would be completely lost in the noise of normal SMS delivery times.  Even the most sophisticated algorithm would take mere milliseconds to evaluate whether 160 characters should pass or not, you wouldn't be able to detect it via measuring the latency.  If the hypothesis is that some kind of manual filtering is taking place, such that actual humans are looking at the messages, then we should design our experiment to capture that.  For example, perhaps we can try to overload these mechanical turks by sending a large number of political messages in a short time period.  If the delays increase even further, then that's probably an indication that there is some human intervention taking place.  I find it very, very, hard to believe that any carrier has such a system in place that would still result in sub-minute deliveries, but if that is indeed our hypothesis, we could create an experiment to test it.

Perhaps my biggest complaint here is the lack of openness.  In our fast paced age of Twitter and Facebook and flashy headlines, we need to resist the temptation for sensationalism and be rigorous in our methodologies, especially on topics this important.  Not publishing the raw data that these results were based on just isn't acceptable, and if the excuse is one of not enough time, then the original article should have waited until that part was ready.  And just as you should always be skeptical of any benchmark which you can't run yourself, you should also be skeptical of any test using software you can't examine.  Again, lack of time is not an excuse, if you don't have time to make the process transparent, then you should just delay publishing your results.

MobileActive and others all do important work, but we must remember to maintain our standards, our scientific training, whenever sharing important information such as this.  It is all too easy to be drawn to the headline, to be taken over by the excitement of your results, and most importantly, all too easy to see patterns where none really exist.  We have all fallen victim to this, but it is our job as a community to call each other on it, to remind each other to be rigorous in our conclusions, to be peer reviewed and to never forget about that little thing called statistical significance.

I hope MobileActive stays true to their word and releases their Android client.  No need to clean it up guys, we won't judge you!  Let's all work together to get to the bottom of this most intriguing mystery.  We have lots of experience building Android apps and would be happy to lend a hand, as well as give our results on the Rwandan carriers.