Binning on a grand scale

In the latest round of filing stuff into bins (see previous post), we’re working with an affymetrix array platform, which produces data across the whole genome. Previously, I’ve been binning data from a different kind of microarray, which was around 17,000 points per array. In this case, there are around 6 million points per array, with 7 arrays per treatment; around 40 million points per treatment. There are 28 files, each of which is about 130MB.

In one sense, Affy data is a bit easier because our previous arrays were indexed by a primer ID. You have to read in a different file to look up the genome coordinates of that ID. Affy data is based on oligonucleotides, which are expressed in the data files as a chromosome and a centre along with the data point. We assume each oligo is a 20-mer, so we need to extend the centre 10 bases in each direction to find the tile coordinates.

The first problem I had with this data was reading it in. If you write a naive parser that reads in each line and builds a coordinate, 10 either side of the centre, as a tuple it starts eating memory at a rate of 100MB a second and it doesn’t stop until it’s all gone. My machine has 8GB of memory, but it was less than half way through reading the files when it ran out.

The problem is that Python lists are doubly linked, so each new item you add, despite being only one 64 bit float, is also using several other 64 bit variables as the forward and reverse pointers. As well as some of the suggestions from the forum thread, I also tried Python arrays and a suggestion from a local genius about cStringIO.

In the end, I used a map from a single character to the chromosome name, so I could store all the chromosomes in a single enormous cStringIO string, with one character for the chromosome of each data point. I used a numpy float array for the data point itself and a numpy integer array for the oligo centre. This squeezed each individual file into less than 100MB memory. I stuck with the Python csv module for I/O because it has more flexibility than the numpy loadtxt routines and I found it to be at least as fast. I’m very impressed with the raw speed of Python csv, which can chew through 10s of MB in a matter of seconds. It actually seems to be around 10 times faster than Perl CSV, which I find quite surprising despite my general disapproval of Perl.

The next problem was the bin filing itself. Our whole genome dataset uses about 10,000 genes which resolve into about 130,000 bins, depending on the type of analysis. My previous solution was to look for overlaps between a tile and a gene, and then look for overlaps in the bins within that gene. I tried a few methods to tweak this method, but it didn’t seem to make too much difference. Instead, another forum came to the rescue with a reference to interval trees. A bit of googling led me here, a very compact and easy to understand implementation. On a test set, I found these to be up to 80 times faster than my original implementation. Booyah.

Integrating this into my existing code was pretty simple. I was able to add a further optimisation by filing each treatment together, because they all share the same coordinates. The previous microarray data had bad spots in some treatments which meant it would have been more complicated to do this (and not worth it, when it already ran in only a few minutes).

A run that files the full 160 million points and plots the resulting bins completes in about 90 minutes, whereas before it did not finish in an overnight run. A good result, and a pleasing use of a proper piece of computer science. I do feel slightly ashamed that I didn’t think of the tree implementation myself, but I guess I’m a software engineer now and not a theoretician – it’s somebody else’s job to come up with the clever method and my job to make use of it.

I did have one further problem because I was producing my plots using parallel python. Some of the plots are quite complex and it saves a bit of time if we plot them concurrently. However, PP dumps the data it needs into a pickle so it can be read by the separate threads. Dumping the vast gigabytes of data from the binning not only took forever but it consumed all the memory, leaving the machine thrashing like crazy. Not time saving at all. It took about 20 seconds to redraw the screen when I got to work this morning.

I still haven’t figured out how best to use parallel processing for this work, because the bulk of the time is spend operating on the same set of lists (the bins) and these can’t be shared in a PP style implementation. One suggestion was to store the data in a database. I’m just not sure if anything file based like this, even using many cores, can ever be quicker than working in memory with a single core. It might make sense if I have 10 or 20 cores, but I suspect this database approach may be too specialised to operate on any of my local clusters.

I need to hook up with some more scientific programming experts to work out how to deal with this problem. I just have to hope the answer is not to rewrite my code in C++.


2 Responses to “Binning on a grand scale”

  1. psaffrey Says:

    A number of people have suggested I should use a generator instead of reading the whole lot into memory in one go. The problem is that I need to z-score the data after it has come in, and a z-score needs the means and standard deviations of each point – I need to read them all in before I can compute this.

  2. Overlapping Genomic Intervals Revisited « Peter Saffrey's Weblog Says:

    […] posted before on finding the right approach to overlapping genomic intervals. Broadly speaking, the […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: