I’m currently building an application to analyse about 40 microarray experiments. I’ve already built some small prototypes that work OK, but I’ve been told that eventually the analysis will take in a lot more data, so I want to make sure it’s super fast. We do lots of simple mathematical and statistical analysis (medians, z-scores) on big arrays of numbers so I’ve started learning numpy and scipy.

Numpy arrays are lightning fast for things like medians and means – about 100 times faster in my profiling. However, numpy and scipy are sometimes a little quirky in behaviour, largely to support people who are more used to Matlab. Many functions have a kind of a implicit map (as in functional programming map) on the numpy arrays so given a numpy array a:

a / 3

divides everything in a by 3 and

zs(a)

gives us the z score for everything in a. This takes a bit of getting used to, but is actually very useful. You can also do cool things with array access, like this:

a[a > 0.8]

Which gives an array with only the elements greater than 0.8. You can even assign into this:

a[a > 0.8] = 1

assigning 1 into everything in a greater than 0.8. And it does it really fast, which is useful.

So, obviously I want to store the data spots from my microarray experiments in these arrays so that I can do medians and things nice and quickly. Each microarray could have up to three duplicate spots for each primer in the array, but as with many things in experimental biology, all three spots could be bad. Each experiment is also duplicated three times, but again it could be the case that in none of the three experiments are the spots readable.

My idea was to take the number of primers and build a numpy array with a lookup for which index stores each primer data. Each point starts with a blank value and is filled in later. Of course, I can’t use 0 (or any other number) as a start value, because that is a possible signal from the array and I this won’t look like a “blank” point in the analysis.

So I chose to use the mathematical constant “NaN”, meaning not a number. This is the result of invalid mathematical operations, like diving by zero.

The problem is that numpys handling of NaN is also a little quirky. You can take a median of an array contains a NaN and it sorts the NaNs to the end – it doesn’t crash, but it gives you completely the wrong result. Instead, you have to use NaN aware functions. This would be fine, except for my application, they take significant performance hits – 36 times slower than not using numpy arrays at all.

I posted to the numpy lists and some users suggested that you’re not really supposed to use NaNs in this way – you should use a masked array. Again there are masked array functions that know what to do. Of course, the version of numpy that comes in the Ubuntu package repository does not include these.

I was left fighting my way through various manual methods to get around my NaN problems, which led me to some weird properties of NaNs (they’re called “nan” in numpy):

>>> nan == nan

False

>>> nan in [ nan ]

False

>>> nan is nan

True

So nan does not equal itself. A list with only a nan in it does not contain a nan (presumably because nan does not equal nan). Finally, the Python “is” operator checks pointer equivalence – this is apparently not guaranteed, because numpy may declare more than one nan in the internals. All this means that you can’t do:

[nan].remove(nan)

because it doesn’t remove anything. I ended up with something like this:

nonan = a[negative(isnan(a))]

which gives me all the non-nan values in the list. Then I can reimplement median and so on with these new lists. I had to do the same with the scipy.stats z-score function, which gives all nans if there are any nans in the list.

All in all, it took me a week to 10 days to pick through this mess. I have learned quite a bit about numpy, which has many nice features. However, by the end of course I was made to eat humble pie – almost all the time in my application is spent reading data from the microarray files and writing out wiggle files that can be used by the UCSC genome browser.

This just shows that you should always profile before you start optimising…

September 30, 2008 at 3:30 am |

Thanks for the useful information.

July 5, 2009 at 8:16 pm |

Hmm. Makes you wonder if it really does make sense to use the Fortran convention of using -999.99 (or whatever extremal values happens to just fit in the fixed width field) for Data Not Present values.

July 7, 2009 at 11:18 am |

I’d still need code to remove the -999.99 before taking means/medians and other housekeeping. Also, it’s not as clear in debugging to spot a “blank” float in a big pile of data carrying floats as it is to see the nans.

Of course, there are those who think that using nans in this way is evil:

http://mail.scipy.org/pipermail/numpy-discussion/2008-September/037503.html

July 7, 2009 at 6:24 pm

Oh yeah. I don’t think it makes the housekeeping any less. But I think it might be clearer _that_ you need to do it, and it might be clearer _how_ to do it.