I’m building an application that makes composites of microarray data. Basically, each data point from the microarray represents a range of coordinates on the genome (eg 10000–11234) and the data at that range – how much that piece of DNA gets expressed based on the antibody we used in this experiment. Each microarray experiment represents about 8000 of these data points.

We want to see how this data looks on the genome, so we take all the genes that overlap the range of out points. Then we divide each gene into a number of equally sized buckets. If a data point overlaps a bucket, the point goes in there – points can be filed in many buckets. Eventually, when all the data has been processed and each bucket contains a number of values, we take medians and things to work out how much this part of the gene is expressed when treated with this antibody.

The problem is that if we say a bucket is, say, of size 2 what does this really mean? If a bucket has the lower bound (lb) 100 and the upper bound (ub) 102, that would seem to indicate that the bucket is of size (ub – lb) = (102 – 100) = 2. However, this buckets contains 3 positions – 100, 101 and 102. So which is it – 2 or 3? It matters, because if a data point is, say, from 102–110 does it get filed in that bucket or not? It sort of overlaps – on 102. But also sort of not.

More important than the 2 or 3 quandary is how we should build the coordinates of each bucket. Should continguous buckets share their bounds (100–102, 102–104 and so on) or should they be side by side: (100–102, 103–105 and so on)? This has an impact on the coordinates of each bucket, and therefore how we locate a bucket to file each point.

I spent the weekend mulling this over. My conclusion was that it actually depends on the semantics of the data. If you’re using a tape measure to take the dimensions of the table, what you get is something like this:

This table is clearly 8 whatevers long (big table). The fact that it contains 9 numbers (100, 101,…, 108) is irrelevant, because it is the spaces between the numbers that we are interested in.

However, my data is a sequence of bases:

The number 102 represents a coordinate – a physical base in the sequence. A base with lb/ub of 100/102 contains 3 bases. This means that sharing a base between buckets does not make sense, so I must use buckets with non-shared bounds. Once I understood this, it was much easier to get my application to actually work.

When do we teach this in computer science? Somewhere, I guess.

Oh, and apparently I’m not supposed to call them buckets – the accepted term is “bins”. I prefer buckets, but he who pays the piper…

March 19, 2009 at 5:11 pm |

[…] on a grand scale By psaffrey In the latest round of filing stuff into bins (see previous post), we’re working with an affymetrix array platform, which produces data across the whole […]