CompLearn is a free/libre/open source software application that uses mathematical compression techniques to spot obscure patterns in a wide variety of data sources, from languages and music to biology. One of the authors of CompLearn, Rudi Cilibrasi, has applied this tool to a data set of 30 different H5N1 avian flu strains, and was able to build a tree graph of the relationships between the different versions of the disease. The goal?
...to track which strains are going where and when new strains pop up we can match them to the nearest previously known strain in the hope that this can shed light on the epidemiology of the situation.
Mathematical compression algorithms are turning into profoundly powerful tools (we blogged recently about the use of compression to aid the ability of radiologists to detect cancer, for example). And, as WorldChanging alumnus Taran Rampersad notes, this is a prime example of the utility of open source tools outside of the corporate computing setting.
As a technique note, let me comment that given any data set, you can always construct a tree relating those objects according to some distance or parsimony metric. The question one needs to ask here is if that tree accurrately reflects the biological evolution of the disease. There exists an extensive and sophisticated body of literature trying to answer this question, along with computer programs to do so (like PHYLIP). The reliance of these methods on biological mutation mechanisms suggests that compression will not improve accurracy (though in some cases, it may improve speed).