Archive for September, 2008

NaN == NaN?

September 29, 2008

I’m currently building an application to analyse about 40 microarray experiments. I’ve already built some small prototypes that work OK, but I’ve been told that eventually the analysis will take in a lot more data, so I want to make sure it’s super fast. We do lots of simple mathematical and statistical analysis (medians, z-scores) on big arrays of numbers so I’ve started learning numpy and scipy.

Numpy arrays are lightning fast for things like medians and means – about 100 times faster in my profiling. However, numpy and scipy are sometimes a little quirky in behaviour, largely to support people who are more used to Matlab. Many functions have a kind of a implicit map (as in functional programming map) on the numpy arrays so given a numpy array a:


a / 3

divides everything in a by 3 and


zs(a)

gives us the z score for everything in a. This takes a bit of getting used to, but is actually very useful. You can also do cool things with array access, like this:


a[a > 0.8]

Which gives an array with only the elements greater than 0.8. You can even assign into this:


a[a > 0.8] = 1

assigning 1 into everything in a greater than 0.8. And it does it really fast, which is useful.

So, obviously I want to store the data spots from my microarray experiments in these arrays so that I can do medians and things nice and quickly. Each microarray could have up to three duplicate spots for each primer in the array, but as with many things in experimental biology, all three spots could be bad. Each experiment is also duplicated three times, but again it could be the case that in none of the three experiments are the spots readable.

My idea was to take the number of primers and build a numpy array with a lookup for which index stores each primer data. Each point starts with a blank value and is filled in later. Of course, I can’t use 0 (or any other number) as a start value, because that is a possible signal from the array and I this won’t look like a “blank” point in the analysis.

So I chose to use the mathematical constant “NaN”, meaning not a number. This is the result of invalid mathematical operations, like diving by zero.

The problem is that numpys handling of NaN is also a little quirky. You can take a median of an array contains a NaN and it sorts the NaNs to the end – it doesn’t crash, but it gives you completely the wrong result. Instead, you have to use NaN aware functions. This would be fine, except for my application, they take significant performance hits – 36 times slower than not using numpy arrays at all.

I posted to the numpy lists and some users suggested that you’re not really supposed to use NaNs in this way – you should use a masked array. Again there are masked array functions that know what to do. Of course, the version of numpy that comes in the Ubuntu package repository does not include these.

I was left fighting my way through various manual methods to get around my NaN problems, which led me to some weird properties of NaNs (they’re called “nan” in numpy):


>>> nan == nan
False
>>> nan in [ nan ]
False
>>> nan is nan
True

So nan does not equal itself. A list with only a nan in it does not contain a nan (presumably because nan does not equal nan). Finally, the Python “is” operator checks pointer equivalence – this is apparently not guaranteed, because numpy may declare more than one nan in the internals. All this means that you can’t do:


[nan].remove(nan)

because it doesn’t remove anything. I ended up with something like this:


nonan = a[negative(isnan(a))]

which gives me all the non-nan values in the list. Then I can reimplement median and so on with these new lists. I had to do the same with the scipy.stats z-score function, which gives all nans if there are any nans in the list.

All in all, it took me a week to 10 days to pick through this mess. I have learned quite a bit about numpy, which has many nice features. However, by the end of course I was made to eat humble pie – almost all the time in my application is spent reading data from the microarray files and writing out wiggle files that can be used by the UCSC genome browser.

This just shows that you should always profile before you start optimising…

Advertisements

How big is the bucket

September 15, 2008

I’m building an application that makes composites of microarray data. Basically, each data point from the microarray represents a range of coordinates on the genome (eg 10000–11234) and the data at that range – how much that piece of DNA gets expressed based on the antibody we used in this experiment. Each microarray experiment represents about 8000 of these data points.

We want to see how this data looks on the genome, so we take all the genes that overlap the range of out points. Then we divide each gene into a number of equally sized buckets. If a data point overlaps a bucket, the point goes in there – points can be filed in many buckets. Eventually, when all the data has been processed and each bucket contains a number of values, we take medians and things to work out how much this part of the gene is expressed when treated with this antibody.

The problem is that if we say a bucket is, say, of size 2 what does this really mean? If a bucket has the lower bound (lb) 100 and the upper bound (ub) 102, that would seem to indicate that the bucket is of size (ub – lb) = (102 – 100) = 2. However, this buckets contains 3 positions – 100, 101 and 102. So which is it – 2 or 3? It matters, because if a data point is, say, from 102–110 does it get filed in that bucket or not? It sort of overlaps – on 102. But also sort of not.

More important than the 2 or 3 quandary is how we should build the coordinates of each bucket. Should continguous buckets share their bounds (100–102, 102–104 and so on) or should they be side by side: (100–102, 103–105 and so on)? This has an impact on the coordinates of each bucket, and therefore how we locate a bucket to file each point.

I spent the weekend mulling this over. My conclusion was that it actually depends on the semantics of the data. If you’re using a tape measure to take the dimensions of the table, what you get is something like this:

using a tape measure

Buckets: using a tape measure

This table is clearly 8 whatevers long (big table). The fact that it contains 9 numbers (100, 101,…, 108) is irrelevant, because it is the spaces between the numbers that we are interested in.

However, my data is a sequence of bases:

sequence data

Buckets: sequence data

The number 102 represents a coordinate – a physical base in the sequence. A base with lb/ub of 100/102 contains 3 bases. This means that sharing a base between buckets does not make sense, so I must use buckets with non-shared bounds. Once I understood this, it was much easier to get my application to actually work.

When do we teach this in computer science? Somewhere, I guess.

Oh, and apparently I’m not supposed to call them buckets – the accepted term is “bins”. I prefer buckets, but he who pays the piper…

Unbreaking Linux Sound

September 5, 2008

So I got it working. I switched on the onboard sound and used that crazy oss stuff. There was a degree of poking involved to remove a lot of background hiss but I can now watch YouTube videos and use Amarok.

(aside: it’s very annoying that because Amarok is a KDE app, I need all the KDE gumpf to get it working. In particular, I have to install kcontrol in order to set a proxy server to make the Amarok lyric engine work)

Then I spent a Linux hacking hour getting the volume controls on my keyboard working. Time was, I just would have expected them to not work, but I seem to have some spare mental capacity for this meaningless tinkering.

Anyway, I started following these instructions, which worked fine with a bit of shell hackery. I love the way that I have to string together a series of awk commands to get my volume control to work. It makes the geek part of me feel all warm and fuzzy. The control doesn’t work properly with the gnome graphical widget despite further efforts. Still, I’m happy with it.

All this tinkering leaves me with the conclusion that I haven’t got enough hacking to do. I’m thinking of getting involved in one of the Python graph drawing toolkits, such as NetworkX. It looks like working on the EDGE project has turned graph drawing into something of a hobby. Well, we’ll see how I get on.