numpy and NaN again

As a continuation of my heatmap work, I’ve been normalising the interaction frequencies against the density of other biological features. In particular, we are more likely to detect interactions around the restriction enzyme we have used to separate the interacting partners after cross-linking. In our case, this is HindIII (which my biological colleagues pronounce “hindi – 3”). Therefore, we’d like to normalise against this density – if we pick up interactions in a region of low hindIII density, we should consider this more significant than in an area of high hindIII density.

First of all, we compute the density of hindIII sites at the same resolution as our heatmaps – 1mb. We then divide each density by the median hindIII density; this gives us the relative density of each 1mb region to the median. Each interaction is between two 1mb regions and is more likely to occur based on the density of the hindIII sites in each of these regions. Therefore, we want to normalise each spot on the heatmap by the product of the relative density of hindIII from the two regions.

We can do this conveniently in numpy, because numpy gives us native matrix operations. Our heatmap has one chromosome on the x axis and another on the y axis. We take the relative hindIII densities for the y axis chromosome and turn this into a column vector by doing:

ypoints.reshape(-1, 1)

Then we can generate a matrix of products with the x axis densities by doing:

normmatrix = xpoints * ypoints.reshape(-1, 1)

To satisfy myself that this produced sensible output, I did this:

>>> from numpy import arange
>>> arange(10).reshape(-1,1) * arange(10)
array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18],
[ 0, 3, 6, 9, 12, 15, 18, 21, 24, 27],
[ 0, 4, 8, 12, 16, 20, 24, 28, 32, 36],
[ 0, 5, 10, 15, 20, 25, 30, 35, 40, 45],
[ 0, 6, 12, 18, 24, 30, 36, 42, 48, 54],
[ 0, 7, 14, 21, 28, 35, 42, 49, 56, 63],
[ 0, 8, 16, 24, 32, 40, 48, 56, 64, 72],
[ 0, 9, 18, 27, 36, 45, 54, 63, 72, 81]])

Then normalise the heatmap matrix by doing:

normalised = matrix / normmatrix

Which is very concise and intuitive. However, the title of this post is about tangling with NaNs again.

At various points, the density of hindIII sites drops to 0. This means that we’ll be dividing by zero for some of the matrix cells. Exactly what numpy does about this is subject to some slightly odd variations, in particular when our heatmap matrix also contains 0 at that position, so we’ll be dividing zero by zero.

In regular Python, here’s what happens when you divide zero by zero:

>>> 0 / 0
Traceback (most recent call last):
File "", line 1, in
ZeroDivisionError: integer division or modulo by zero
>>> 0. / 0
Traceback (most recent call last):
File "", line 1, in
ZeroDivisionError: float division
>>> 0 / 0.
Traceback (most recent call last):
File "", line 1, in
ZeroDivisionError: float division

Note that in each case, we get a ZeroDivisionError exception, but when either the denominator or numerator is a float, the error message is slightly different. If we do the same inside a numpy array:

>>> from numpy import array
>>> array([0]) / array([0])
>>> array([0.]) / array([0])
array([ NaN])
>>> array([0]) / array([0.])
array([ NaN])

This is particularly odd because:

>>> array([2]) / array([0])
>>> array([2.]) / array([0])
array([ Inf])
>>> array([2.]) / array([0.])
array([ Inf])
>>> array([0.]) / array([0.])
array([ NaN])

So integers divided by zero in numpy give 0, floating point numbers give infinite and 0 (as a float) gives a NaN. This seems pretty odd to me, but then I though the other nan behaviour was weird too, and that turned out to be based on the standard.

To get sensible heatmaps from this, I have to filter out both infs and nans and replace them with 0s. Filtering inf is easy, using numpy’s selective assignment:

normalised = [normalised==numpy.inf] = 0

but you can’t do this with nan, because numpy.nan != numpy.nan. Fortunately, numpy provides a function to do just what I’m after:

normalised = numpy.nan_to_num(normalised)


Tags: ,

5 Responses to “numpy and NaN again”

  1. Alex Simoes Says:

    You have no idea how much time this just saved me! Thank you so much for documenting this.

  2. William Says:

    Thank you so much! I was pulling my hair out over filtering NaNs from a numpy array. They weren’t equal to anything! Yay for numpy.isnan.

  3. Ben Root Says:

    It should be expected that the integer division would not yield a NaN or an Inf, because integers do not support those concepts. Only IEEE Floating point numbers do.

    Any non-zero number divided by zero is defined to be infinity. However, because it makes no sense for nothing to go into nothing (could be done infinite number of times, or once, or none), that is a special case that is undefined in mathematics. Therefore, it is a NaN.

    Nothing you experienced here is out of the ordinary for any scientific computing tool such as Matlab, R, and Octave.

  4. Mr E Says:

    Just a small issue – the line

    normalised = [normalised==numpy.inf] = 0

    Should be

    normalised[normalised==numpy.inf] = 0

  5. watson ix Says:

    yes Mr E!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: