Archive for July, 2010

numpy and NaN again

July 30, 2010

As a continuation of my heatmap work, I’ve been normalising the interaction frequencies against the density of other biological features. In particular, we are more likely to detect interactions around the restriction enzyme we have used to separate the interacting partners after cross-linking. In our case, this is HindIII (which my biological colleagues pronounce “hindi – 3”). Therefore, we’d like to normalise against this density – if we pick up interactions in a region of low hindIII density, we should consider this more significant than in an area of high hindIII density.

First of all, we compute the density of hindIII sites at the same resolution as our heatmaps – 1mb. We then divide each density by the median hindIII density; this gives us the relative density of each 1mb region to the median. Each interaction is between two 1mb regions and is more likely to occur based on the density of the hindIII sites in each of these regions. Therefore, we want to normalise each spot on the heatmap by the product of the relative density of hindIII from the two regions.

We can do this conveniently in numpy, because numpy gives us native matrix operations. Our heatmap has one chromosome on the x axis and another on the y axis. We take the relative hindIII densities for the y axis chromosome and turn this into a column vector by doing:


ypoints.reshape(-1, 1)

Then we can generate a matrix of products with the x axis densities by doing:


normmatrix = xpoints * ypoints.reshape(-1, 1)

To satisfy myself that this produced sensible output, I did this:


>>> from numpy import arange
>>> arange(10).reshape(-1,1) * arange(10)
array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18],
[ 0, 3, 6, 9, 12, 15, 18, 21, 24, 27],
[ 0, 4, 8, 12, 16, 20, 24, 28, 32, 36],
[ 0, 5, 10, 15, 20, 25, 30, 35, 40, 45],
[ 0, 6, 12, 18, 24, 30, 36, 42, 48, 54],
[ 0, 7, 14, 21, 28, 35, 42, 49, 56, 63],
[ 0, 8, 16, 24, 32, 40, 48, 56, 64, 72],
[ 0, 9, 18, 27, 36, 45, 54, 63, 72, 81]])

Then normalise the heatmap matrix by doing:


normalised = matrix / normmatrix

Which is very concise and intuitive. However, the title of this post is about tangling with NaNs again.

At various points, the density of hindIII sites drops to 0. This means that we’ll be dividing by zero for some of the matrix cells. Exactly what numpy does about this is subject to some slightly odd variations, in particular when our heatmap matrix also contains 0 at that position, so we’ll be dividing zero by zero.

In regular Python, here’s what happens when you divide zero by zero:


>>> 0 / 0
Traceback (most recent call last):
File "", line 1, in
ZeroDivisionError: integer division or modulo by zero
>>> 0. / 0
Traceback (most recent call last):
File "", line 1, in
ZeroDivisionError: float division
>>> 0 / 0.
Traceback (most recent call last):
File "", line 1, in
ZeroDivisionError: float division
>>>

Note that in each case, we get a ZeroDivisionError exception, but when either the denominator or numerator is a float, the error message is slightly different. If we do the same inside a numpy array:


>>> from numpy import array
>>> array([0]) / array([0])
array([0])
>>> array([0.]) / array([0])
array([ NaN])
>>> array([0]) / array([0.])
array([ NaN])

This is particularly odd because:


>>> array([2]) / array([0])
array([0])
>>> array([2.]) / array([0])
array([ Inf])
>>> array([2.]) / array([0.])
array([ Inf])
>>> array([0.]) / array([0.])
array([ NaN])

So integers divided by zero in numpy give 0, floating point numbers give infinite and 0 (as a float) gives a NaN. This seems pretty odd to me, but then I though the other nan behaviour was weird too, and that turned out to be based on the standard.

To get sensible heatmaps from this, I have to filter out both infs and nans and replace them with 0s. Filtering inf is easy, using numpy’s selective assignment:


normalised = [normalised==numpy.inf] = 0

but you can’t do this with nan, because numpy.nan != numpy.nan. Fortunately, numpy provides a function to do just what I’m after:


normalised = numpy.nan_to_num(normalised)

Chromosome Interactions, Heatmaps and matplotlib

July 5, 2010

I’m currently working on some chromosome conformation data based on 3C and related technologies. These tools allow you to look at the physical structure of the nucleus and determine how chromosomes are positioned in 3 dimensions, and where regions of one chromosome physically touch regions of another. These touching regions affect the regulation of genes in those regions.

The earlier iterations of these techniques are based on looking at interactions from a particular region or locus, to study the interaction with a specific set of genes. Later techniques allow a more flexible whole-genome approach. We are working with Hi-C data, which combines 3C/4C methods with paired-end sequencing, so that each end of the pair is at a different point on the genome.

After processing, this data is a neat list of interacting pairs, with each end as a chromosome coordinate. Our data produces between 2 and 10 million pairs. Most pairs are intra-chromosome (also called “cis” interactions) but around 5% are inter-chromosome (“trans” interactions).

We now need a method to visualise this data. The prevalent method is the heatmap. The interaction points are grouped together into chromosomes, or regions of 100kb or 1mb (we could think of this as quantization). Interacting regions are marked out in a 2-dimensional matrix and then drawn as a heatmap.

For chromosome-chromosome heatmaps we might also normalise each spot based on the number of expected interactions. This is based on the length of the chromosomes – longer chromosomes should give more interactions on average. We also tend to remove cis-interactions – by reducing them to 0 – since otherwise they dominate and we can’t see the relative strengths of the trans-interactions.

Given the pairs of interactions, it’s easy enough to build the matrix that represents the interactions. Rendering this is matplotlib provides quite a few options. I’ve generated some sample data and showed a few options below, using the following code:

import pylab

chromnames = ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY']
values = pylab.zeros((24, 24))

values[2][6] = 3
values[6][2] = 3
values[23][1] = 13
values[1][23] = 13
values[5][2] = 5
values[2][5] = 5
values[12][14] = 11
values[14][12] = 11
values[19][9] = 6
values[9][19] = 6


f = pylab.Figure()

#pylab.pcolor(values)
#imagename = "pcolor.png"

# pylab.imshow(values)
# imagename = "imshow.png"

pylab.matshow(values)
imagename = "matshow.png"


pylab.xticks(range(len(chromnames)), chromnames, rotation=90)
pylab.yticks(range(len(chromnames)), chromnames)

pylab.savefig(imagename)

My first stab was to use imshow(), but this bleeds each coloured blob into the surroundings, which looks nice but makes it hard to read the individual blobs:

imshow()

There is also a pcolor() function, but this plots from bottom to top (as you might expect) which is not ideal for what we want here. I’d have to turn the data and y axis labelling upside down, which is a little awkward:

pcolor()

I finally plumped for matshow() (which stands for “matrix show”), which does things in the right order with solid blobs:

matshow()

However it’s clearly not without its problems. The xtick labels have not rotated, so I had to do it manually, like this:

im = pylab.matshow(values)
for label in im.axes.xaxis.get_ticklabels():
	label.set_rotation(90)

You might also be able to see that the top and bottom rows are half-height – I had to correct this by adding dummy row labels here. I don’t know why this happens.

There are various other problems with the other plots too. I think it would be safe to say that matplotlib is a powerful tool but if you want plots in a particular way, it can take a lot of documentation bashing to get there. As always, I usually feel that the way I want it should be the standard and it should do that in the first place, but I can never tell how unreasonable I am being when I say this.

An open question here is whether heatmaps are actually the best way to visualise this data. I’m also working on a student project at the moment where we use a graph-based visualisation instead. My theory is that we can use a spring embedder algorithm to automatically lay out the graph of interactions between a pair of chromosomes and show their conformation. I think the usual wisdom here is that no visualisation is “right” – best to have several at your disposal to allow you to look at the data in a variety of different ways.