Chromosome Interactions, Heatmaps and matplotlib

July 5, 2010

I’m currently working on some chromosome conformation data based on 3C and related technologies. These tools allow you to look at the physical structure of the nucleus and determine how chromosomes are positioned in 3 dimensions, and where regions of one chromosome physically touch regions of another. These touching regions affect the regulation of genes in those regions.

The earlier iterations of these techniques are based on looking at interactions from a particular region or locus, to study the interaction with a specific set of genes. Later techniques allow a more flexible whole-genome approach. We are working with Hi-C data, which combines 3C/4C methods with paired-end sequencing, so that each end of the pair is at a different point on the genome.

After processing, this data is a neat list of interacting pairs, with each end as a chromosome coordinate. Our data produces between 2 and 10 million pairs. Most pairs are intra-chromosome (also called “cis” interactions) but around 5% are inter-chromosome (“trans” interactions).

We now need a method to visualise this data. The prevalent method is the heatmap. The interaction points are grouped together into chromosomes, or regions of 100kb or 1mb (we could think of this as quantization). Interacting regions are marked out in a 2-dimensional matrix and then drawn as a heatmap.

For chromosome-chromosome heatmaps we might also normalise each spot based on the number of expected interactions. This is based on the length of the chromosomes – longer chromosomes should give more interactions on average. We also tend to remove cis-interactions – by reducing them to 0 – since otherwise they dominate and we can’t see the relative strengths of the trans-interactions.

Given the pairs of interactions, it’s easy enough to build the matrix that represents the interactions. Rendering this is matplotlib provides quite a few options. I’ve generated some sample data and showed a few options below, using the following code:

import pylab

chromnames = ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY']
values = pylab.zeros((24, 24))

values[2][6] = 3
values[6][2] = 3
values[23][1] = 13
values[1][23] = 13
values[5][2] = 5
values[2][5] = 5
values[12][14] = 11
values[14][12] = 11
values[19][9] = 6
values[9][19] = 6

f = pylab.Figure()

#imagename = "pcolor.png"

# pylab.imshow(values)
# imagename = "imshow.png"

imagename = "matshow.png"

pylab.xticks(range(len(chromnames)), chromnames, rotation=90)
pylab.yticks(range(len(chromnames)), chromnames)


My first stab was to use imshow(), but this bleeds each coloured blob into the surroundings, which looks nice but makes it hard to read the individual blobs:


There is also a pcolor() function, but this plots from bottom to top (as you might expect) which is not ideal for what we want here. I’d have to turn the data and y axis labelling upside down, which is a little awkward:


I finally plumped for matshow() (which stands for “matrix show”), which does things in the right order with solid blobs:


However it’s clearly not without its problems. The xtick labels have not rotated, so I had to do it manually, like this:

im = pylab.matshow(values)
for label in im.axes.xaxis.get_ticklabels():

You might also be able to see that the top and bottom rows are half-height – I had to correct this by adding dummy row labels here. I don’t know why this happens.

There are various other problems with the other plots too. I think it would be safe to say that matplotlib is a powerful tool but if you want plots in a particular way, it can take a lot of documentation bashing to get there. As always, I usually feel that the way I want it should be the standard and it should do that in the first place, but I can never tell how unreasonable I am being when I say this.

An open question here is whether heatmaps are actually the best way to visualise this data. I’m also working on a student project at the moment where we use a graph-based visualisation instead. My theory is that we can use a spring embedder algorithm to automatically lay out the graph of interactions between a pair of chromosomes and show their conformation. I think the usual wisdom here is that no visualisation is “right” – best to have several at your disposal to allow you to look at the data in a variety of different ways.

Teaching Programming to Biologists

January 28, 2010

I’ve just finished delivering a pilot version of a one-day course, giving biologists a one-day “taster” of programming.

The course came about, as is the way of these things, when I overheard a conversation in a meeting about something entirely different. The University is buying a license for a bioinformatics toolkit called CLCBio. You can write your own plugins for this tool, in Java. Somebody else at the meeting joked that they would need to teach all their postgraduate students Java. I could tell that they weren’t really joking, so I mentioned my software engineering for scientists course. I said I was thinking of writing a more fundamental course and they said they were interested.

It turns out, there was a lot of interest in such a course. Both the biological science and vet faculties have been thinking about providing this training to both their PhD and masters students, which could be well over 100 students a year. As with the software carpentry course, we also had support from the training people who provided the facilities.

There was quite a bit of to-ing and fro-ing about the format for the course. There are already a number of programming courses out there, some of them even for bioinformatics. The problem is that all the courses are several days long and are somewhat pedantic about introducing variables, loops, data structures and so on. I wanted something that could show some of things programming can achieve in biology with a bit more immediacy – a show-off tour of some sort. I was also keen to keep it short because biologists don’t all need to program, and many of them might like to have a look without getting too serious.

I also wanted to provide inspiration to the small number of people who had a natural affinity. My experience is that some people see programming and just get it straight away – they even find it enjoyable. Maybe 20% of people might fall into this category.

So the format we came up with was a one-day taster course, including a rapid tour of some cool stuff. The 20% who really got it can go on and do something more thorough. Those who really hate it are only trapped for one day. This is one of the advantages of a postgraduate course like this. People choose their own research area so they can pick up or drop programming with no harm to their careers. You can aim to inspire the top and not worry that you’re terrifying the bottom, because they can ignore it if they don’t like it.

My analogies for this tactic got more and more brutal as things went on: basically you drop all the students out of a helicopter into a high sea. Then you come back 15 minutes later and pick up all those who haven’t yet drowned.

After we had things set up, we sent out an Email advertising the course (I didn’t mention the helicopter thing) and were subscribed 3 times over within 2 days. That seemed like a good start – at least people are keen.

The course itself was based on Perl, which is still the lingua franca of bioinformatics, for better or worse (mostly worse). Perl in bioinformatics is a rant for another day. I had 4 sessions:

  1. Basic principles
  2. Sequence processing
  3. CSV file processing
  4. BioPerl

It’s a lot to pack into one day, especially since I covered loops, arrays and hashes in the first two sessions. As expected, some people found this completely overwhelming and were still struggling with even declaring variables by the end of the day. However, some really excelled, and by the end had written code to parse over a CSV file, take the mean of a group of columns and only print the rows where that mean is above a certain threshold. You could see that some of them were starting to get that excitement from programming that comes from being able to tackle problems and see what can be achieved.

There was a professor present who loved the course and is very keen to roll it out to all the postgrads we had discussed. We’ll push it through a few more pilot stages first, but it’s gratifying to know that this is somewhere I can make a useful contribution. In fact, as the plaudits rolled in, I realised just how much demand there is and have now started asking whether they could pay me a bit more! It’s a great course and I enjoyed teaching it, but I’ve had the University system take advantage of my good nature in the past and I’m not keen to repeat the experience.

We’re also thinking about how to make sure this course can be deployed to these numbers. It’s already at a stage where the project is too big for just me, so we’ll need to recruit other tutors, and preferably a University department to look after things if and when I leave. The obvious choice is computing science, but they seem very reluctant. Complaining about this is probably also a rant for another day.

So what have I learned from all this:

  1. Biologists want to learn how to program. No, really, they do.
  2. By allowing the weaker students to sink, you can get a lot done with the stronger students.
  3. You can cover a decent amount of programming in a day. No really, you can.

Oh, and Perl still sucks. It also really sucks as a teaching language. Try explaining to people who an hour ago had never seen a variable declaration what “array variable in a scalar context” means. Sigh.

Genome Browser Snapshots

December 22, 2009

We use the UCSC Genome Browser for a wide variety of genomic visualisation tasks. I spend a fair proportion of my time preparing tracks for UCSC, including (but not limited to):

  • Solexa runs.
  • Restriction enzyme digestion sites.
  • Marking particular genes (and also occasionally using a colour map to indicate expression level.)

I’ve also implemented a navigation frame (hello, late 90s technology) so you can click through different regions. It’s a bit of a hack, but has been very useful to some of my colleagues, so on the whole a worthwhile bit of code.

The UCSC browser has a very clear URL based interface to uploading tracks and selecting the proper genome build for working with. You can put a track on a webserver and reference it by its URL to upload it. You can also reference a URL full of other URLs to upload many tracks at once, which is really handy. It also has a neat session file feature, which allows you to save the other tracks you want displayed below your own data.

Most of the time, uploading the tracks to UCSC and browsing manually is the right way to interact with the data. However, as always, it’s sometimes useful to be able to script the visualisations. So I wrote a class that allows you to reference a session and track URL and retrieve a PDF file with a snapshot of the browser with the chosen data.

According the UCSC techs, there is a limit of one hit every 15 seconds and 5,000 images a day.

#!/usr/bin/env python

import urllib
import re
import os
import sys

# gets a screenshot from UCSC of a specifid bed file
# a lot of the work is done in the bed file itself
# and a session file, which tells UCSC what tracks should display and how.
# Upgrades of this code might include exposing a few of those session parameters
# through the object

baseurl = ""
pagename = "hgTracks"
convertcmd = "/usr/bin/convert"

class UCSCTools:
	def __init__(self, params=None):
		self.params = params
	# params is a dictionary of url parameters
	def setParams(self, params):
		self.params = params

	def setOneParam(self, param, value):
		self.params[param] = value

	def removeParam(self, param):
		if param in self.params:
			del self.params[param]

	def getImageURL(self):
		self.setOneParam("hgt.psOutput", "on")
		return self._getURL()
	def getLinkURL(self):
		return self._getURL()

	def _getURL(self):
		paramsenc = urllib.urlencode(self.params)
		url = "%s%s?%s" % (baseurl, pagename, paramsenc)
		return url

	def dumpPDF(self, pdffile):
		url = self.getImageURL()
		imgpagestring = urllib.urlopen(url).read()
		self._getPDFFromImagePage(imgpagestring, pdffile)

	def dumpPDFAndPNG(self, filebase):
		pdffile = filebase + ".pdf"
		pngfile = self.convertImage(pdffile, "png")
		return pdffile, pngfile

	# do we need both of these parameters?
	def setSessionDetails(self, sessionurl):
		self.setOneParam("hgS_doLoadUrl", "submit")
		self.setOneParam("hgS_loadUrlName", sessionurl)

	# in this case, bedurl could be a single bed file or
	# a url with a list of urls containing the bed files we want to upload
	def setBEDFiles(self, bedurl):
		self.setOneParam("hgt.customText", bedurl)

	def convertImage(self, inpath, outtype):
		# I wish python had extension removal in basename...
		basepath = inpath.replace(".pdf", "")
		outpath = basepath + "." + outtype
		# magic density value gives us the right appearance for a web output
		cmd = "%s -density 125 %s %s" % (convertcmd, inpath, outpath)
		print "executing convert command:", cmd
		return outpath

	# private
	def _getPDFLinkFromImagePage(self, pagestring):
		pdfre = 'HREF="([^"]+.pdf)"'
		match =, pagestring)

	# private
	def _getPDFFromImagePage(self, pagestring, pdffile):
		link = self._getPDFLinkFromImagePage(pagestring)
		url = baseurl + link
		pdfdata = urllib.urlopen(url).read()
		pdffh = open(pdffile, "w")

if __name__ == "__main__":
	if len(sys.argv) != 3:
		print "usage: ./ <sessionfile> <bedfileurl>"
	# these should both be publically available urls
	sessionurl = sys.argv[1]
	bedurl = sys.argv[2]
	params = { 	"hgt.psOutput"	:	"on",
	ut = UCSCTools(params)
	print ut.getLinkURL()
#	ut.dumpPDFAndPNG("tester")

(I know I should find a better way to distribute my code than this. Sorry.)

Teaching Programming in the Real World

July 15, 2009

I’m currently reading Code Complete, the software engineering book I have had recommended to me most often. It’s good, although like many of these books it makes me feel guilty about some of the sloppy practices I currently engage in.

I was just struck by a passage from the design chapter, talking about how requirements and an understanding of the problem changes as development proceeds. In particular, it addresses “wicked” problems, where the problem only becomes really apparent after a first solution has failed to solve it. The example is the famous Tacoma narrows bridge, where the designers did not realise wind across the bridge would cause a destructive harmonic ripple; the problem was not adequately specified until they had a solution that did not solve it.

The passage reads like this:

One of the main differences between programs you develop in school and those you develop as a professional is that the design problems solved by school programs are rarely, if ever, wicked. Programming assignments in school are devised to move you in a beeline from beginning to end. You’d probably want to tar and feather a teacher who gave you a programming assignment, then changed the assignment as soon as you finished the design, and then changed it again just a as you were about to turn in the completed program. But that very process is an everyday reality in professional programming.

It sure is. Shouldn’t we be teaching that?

Analysing Solexa data with MAQ

July 14, 2009

We are just starting to use Illumina Solexa sequencing to analyse protein/DNA interactions, also known as ChIP-seq.

As I understand it, when you buy a Genome Analyser machine from Illumina, it does come with software to align the reads and allow basic analysis of them. However, the attitude in my lab is that the real experts are at the Sanger Institute in Cambridge and we should use whatever they use. We have consulted with my boss’s colleagues there to make sure we do it in exactly the same way as them. The tool they use is developed at the Sanger (or the EBI, I’m not sure) and is called MAQ

In fact, even before you use MAQ, the data that comes out of a Solexa machine has been through a few other analysis stages and arrives as a (frankly incomprehensible) text file, with reads annotated based on their reliability. Here are a few lines from a sample file:


The pipeline from here is actually quite well explained in the MAQ documentation, but to make sure I verified each step with our Sanger expert colleagues. I also checked that the parameters they use, in particular at the map stage, which happen to be the defaults for maq.

Basically, there are two strands – the reference genome and the solexa file itself. The reference genome goes through the following stages:

  1. Download the genome as .fasta files. I do this by chromosome using a Perl script and the Ensembl API.
  2. Convert the .fasta files into a reference file (.bfa) using “maq fasta2bfa”.

Now you have a reference genome in indexed (.bfa) form, you can use it to generate your alignments:

  1. Convert the raw Solexa files into .fastq (or sanger) files using “maq sol2sanger”.
  2. Convert these into indexed .bfq files using “maq fastq2bfq”.
  3. Create an alignment map, based on the .bfq file and the reference genome .bfa file, using “maq map”. For the record, the settings used by our Sanger experts are “-n 2 -m 0.001 -e 70 -a 250”, which are the defaults described in the maq man page.

The map file represents your alignments. Now you can use maq to either output these as raw alignments (each read with its genomic coordinates) or you can generate “pileup” files, where each genomic coordinate is output with the number of reads that overlap that coordinate.

The pileup method seems to make some of my colleagues nervous, because not that many people seem to be using it and you can’t “see” where each individual read has gone. However, if you only want to see where the most protein/DNA interaction is taking place and view it as (say) a wiggle file, the pileup method is ideal.

The only difficulty is that because it generates one line of text for each genomic coordinate, you get 3 billion lines in total and a really massive file – it got to 12GB and then I told it to give up. It also exposed (as bioinformatics tends to do) the limitations of the operating system, because my machine was grinding like mad as Linux was making a lot of fuss about generating a file that big. At some point, I’ll get somebody geekier than me to explain that to me.

In any case, most of the lines from a pileup are not interesting because most coordinates have no overlapping reads. We can attach a pipe process to discard these and get much more manageable file size. Even so, extracting pileup data from a map is much slower than generating alignments. Alignments take a few seconds, pileups are 10s of minutes.

Below is some Python code that manages the whole pipeline. It uses the multiprocessing module (new to 2.6!) to divide the work on a multi-core machine. I’ve taken the view that using the standard library is a better idea than something 3rd party like parallel python.

I haven’t tuned this to work out the best number of cores to use. The most expensive step is the pileup stage, which needs one process for maq and another for the pipe filter so I don’t know whether using all 4 cores in a quad core setup they’d have to keep swapping jobs. Using 3 cores it doesn’t run out of memory and can process 10 files in 2 or 3 hours.

#!/usr/bin/env python

import glob
from multiprocessing import Pool, current_process
from subprocess import Popen, PIPE
import os
import csv

# base directory for the solexa input files
basedir = "/home/pzs/solexa/run2/"
# the glob to identify the Solexa input files
globfile = "*.txt"
# location of the reference genome, already converted into .bfa format
genomebfa = "/home/pzs/genebuilds/human/copy/allhuman.bfa"
# the minimum size of a peak to output in the pileup option 
minpeaksize = 1
# number of cores to use
ncpus = 3
maqpath = "/home/pzs/src/maq-0.6.6_i686-linux/maq"
maqplpath = "/home/pzs/src/maq-0.6.6_i686-linux/"

def sol2fastq(solfile):
	mycore = open("/proc/%i/stat" % os.getpid()).read().split()[38]
	outpath = solfile + ".fq"
	if os.access(outpath, os.R_OK):
		print "skipping file already present", outpath
		return outpath
	cmd = "%s sol2sanger %s %s" % (maqpath, solfile, outpath)
	print "core:", mycore, "running command:", cmd
	return outpath
def fastq2bfq(fastqfile):
	mycore = open("/proc/%i/stat" % os.getpid()).read().split()[38]
	filebase = fastqfile.split(".")[0]
	outpath = filebase + ".bfq"
	if os.access(outpath, os.R_OK):
		print "skipping file already present", outpath
		return outpath
	cmd = "%s fastq2bfq %s %s" % (maqpath, fastqfile, outpath)
	print "core:", mycore, "running command:", cmd
	return outpath

def bfq2map(bfqfile):
	mycore = open("/proc/%i/stat" % os.getpid()).read().split()[38]
	filebase = bfqfile.split(".")[0]
	outpath = filebase + ".map"
	if os.access(outpath, os.R_OK):
		print "skipping file already present", outpath
		return outpath
	cmd = "%s map %s %s %s" % (maqpath, outpath, genomebfa, bfqfile)
	print "core:", mycore, "running command:", cmd
	return outpath
def pilePeaks(mappath):
	mycore = open("/proc/%i/stat" % os.getpid()).read().split()[38]
	filebase = mappath.split(".")[0]
	outpath = "%s-PeakPile-%d.pp" % (filebase, minpeaksize)
	if os.access(outpath, os.R_OK):
		print "skipping file already present", outpath
		return outpath
	cmd = "%s pileup %s %s" % (maqpath, genomebfa, mappath)
	print "core:", mycore, "running command:", cmd
	p = Popen(cmd, shell=True, bufsize=100, stdout=PIPE)
	reader = csv.reader(p.stdout, delimiter="\t")
	writer = csv.writer(open(outpath, "w"), delimiter="\t")
	for row in reader:
		height = int(row[3])
		if height > minpeaksize:
	return outpath

solpath = basedir + globfile
solfiles = glob.glob(solpath)

pool = Pool(processes=ncpus)
fqfiles =, solfiles)
bfqfiles =, fqfiles)
mapfiles =, bfqfiles)
ppfiles =, mapfiles)

print ppfiles

Illumina Solexa hype

July 14, 2009

(I was just thinking about a post about MAQ and this aside turned into a post of its own…)

Several people in my lab have the attitude that arrays are obsolete technology, and Solexa sequencing will be entirely replacing it over the coming years. My boss may or may not agree.

In any case, the Illumina people are trying to take full advantage of this wave of hype. Last week, I went to a day-long meeting in Edinburgh about the Illumina “product portfolio” and its potential applications. Just look at the punishing sales schedule! They had brought a half dozen marketing people with them – senior product manager this and marketing manager that – as well as a few local scientists who were using the kit. One guy who gave a talk had spent half his £2.2m budget on Illumina kit. Another was enthusing about how sequencing is now so quick and easy a masters student could now do it in a few months. Apparently, they’re also moving into the consumer sector and were talking about swapping genomes with an iPhone app. I’m afraid I face-palmed at that point.

It’s interesting stuff, but you always worry when there is a monopoly on this kind of technology. For some of the analysis we’ve done, the Solexa data seems to have worrying bias relative to arrays, although this may of course be because we haven’t figure out how to do the analysis yet.

Anyway, high throughput sequencing is increasingly entering the public arena so I guess we better get used to it.


July 3, 2009

I’ve just got back from 3 days spent at the annual EuroPython conference, this year held in Birmingham. There were about 450 attendees and a wide variety of different talks from web development to printing money to science stuff.

This was my first time at a non-academic conference and I wasn’t really sure what to expect. Overall, I found it extremely useful, mostly just through being exposed to a diverse range of excellent programmers and seeing some of the ways in which they use their tools. The standard of talks was generally higher than academic conferences (far few people randomly overrunning their time and not caring, for a start) and I also found the people to be more friendly and easy to talk to. In general, there was a lower freak quota, which makes it more bearable to spend 3 days surrounded by these people.

I realised quite early in the conference that I was surrounded by a large number of programmers who were more proficient in Python than I am, despite having used the language for nearly 10 years. Of course, this is intimidating at first, but I realised that this was a great opportunity to discover new ways of doing things and expand my horizons. During those 10 years almost everything I have learned about Python has been self-taught. This is good as far as it goes but I could advance a lot more quickly by more interaction with real experts, including pair programming, code reviews and just generally working with other people’s source code.

I’m saddened, but not surprised, that so few academic programmers attend these conferences; I would say that fewer than 10% of the attendees were academics. To me, something like EuroPython makes just as much of a contribution to my productivity as a conference with other scientists. Maybe I can persuade other computational scientists to come along with me next year.

Software Carpentry

June 11, 2009

Yesterday I delivered my first day-long Software Carpentry workshop, based on some slides I had botch-converted from the web page. I decided to do object oriented programming (both parts), version control, unit testing and debugging.

I used an IT training lab here at Glasgow with a stack of customised live CDs, which seemed to work reasonably well. I may have inadvertently converted a few people to Ubuntu which obviously wasn’t the aim of the course! The only problem was that the liveCD didn’t boot properly without graphical “safe mode” which meant a rather low 800×600 resolution. I had hoped to demonstrate Eric, but it doesn’t actually fit on the screen at 800×600.

I may also have converted a few people to Python, which can only be a good thing if it stops them using Fortran.

I did struggle with a few of my examples – I haven’t set up an SVN repository for ages and I kept making silly path errors – which just shows you have to test everything before the course runs.

Most of my students were maths or engineering. The success was roughly bimodal: half of them were not strong enough programmers to get much from the course at all. They were mathematicians or engineers who had never seen Python before and could not adapt fast enough. The other half were stronger programmers who picked things up quickly and nodded sagely as I identified the less-than-optimal practice they were currently using and how modern techniques might benefit them.

The people who enjoyed it most were actually staff (all from maths), rather than students or post-docs. I told them Greg Wilson’s story of 3 of 38 faculty using version control and stressed how important it is for staff to lead from the front and insist their students and RAs use best practice and they seemed to buy into that. They also agreed that a software best-practice course should be compulsory for new students in subjects with a strong computational element.

I enjoyed giving the course, even though it was a really exhausting day. I’m meeting with the training people and hope to arrange further courses – maybe a “basic programming in Python” course for those poor Fortran programmers.

Computational Science

May 22, 2009

In a few weeks, I’ll be teaching a software engineering course to “computational scientists”: scientists (including mathematicians and statisticians) who use programming as an every-day tool. I’ve always found it interesting that there is so little support and teaching in good software practice for computational scientists. Even as somebody with an undergraduate and post-graduate degree in computing, almost all my practical programming skills are self taught.

Of course, I’m not alone in worrying about this. The software carpentry pages provide course materials for exactly this purpose. The person behind this, Greg Wilson also has some other cool materials and opinions on his web page. I especially liked this opinion paper (rant?) about how little computer scientists understand this problem.

Apparently, in a straw poll in Greg’s department (computer science), he found that on 3 of the 38 faculty were using version control or encouraging their post-docs and PhD students to do so. I bet my local CS department is just as bad or worse.

I start to wonder whether computer science and software engineering are only tangentially related disciplines. If this is the case, are computer science departments are the right place to look for advice and expertise about computational science?

Interval trees: even faster in cython

April 1, 2009

Cython is a clever Python extension that allows you to write Python modules using C data types. This improves performance – quite a lot, as it happens. The guy whose interval trees I was using recommended a cython implementation that reduced my bin filing time from 90 minutes to about 7. Yikes.

This actually reminds me of my previous career in formal methods and specifically model checking. When faced with a large computational problem, it’s very tempting to give up and look to faster hardware to give you execution that is twice or three times as fast. However, with a bit of thought and by knowing your tools (or in this case, knowing who to ask) you can get a much greater improvement without spending a penny…