Archive for July, 2009

Teaching Programming in the Real World

July 15, 2009

I’m currently reading Code Complete, the software engineering book I have had recommended to me most often. It’s good, although like many of these books it makes me feel guilty about some of the sloppy practices I currently engage in.

I was just struck by a passage from the design chapter, talking about how requirements and an understanding of the problem changes as development proceeds. In particular, it addresses “wicked” problems, where the problem only becomes really apparent after a first solution has failed to solve it. The example is the famous Tacoma narrows bridge, where the designers did not realise wind across the bridge would cause a destructive harmonic ripple; the problem was not adequately specified until they had a solution that did not solve it.

The passage reads like this:

One of the main differences between programs you develop in school and those you develop as a professional is that the design problems solved by school programs are rarely, if ever, wicked. Programming assignments in school are devised to move you in a beeline from beginning to end. You’d probably want to tar and feather a teacher who gave you a programming assignment, then changed the assignment as soon as you finished the design, and then changed it again just a as you were about to turn in the completed program. But that very process is an everyday reality in professional programming.

It sure is. Shouldn’t we be teaching that?

Advertisements

Analysing Solexa data with MAQ

July 14, 2009

We are just starting to use Illumina Solexa sequencing to analyse protein/DNA interactions, also known as ChIP-seq.

As I understand it, when you buy a Genome Analyser machine from Illumina, it does come with software to align the reads and allow basic analysis of them. However, the attitude in my lab is that the real experts are at the Sanger Institute in Cambridge and we should use whatever they use. We have consulted with my boss’s colleagues there to make sure we do it in exactly the same way as them. The tool they use is developed at the Sanger (or the EBI, I’m not sure) and is called MAQ

In fact, even before you use MAQ, the data that comes out of a Solexa machine has been through a few other analysis stages and arrives as a (frankly incomprehensible) text file, with reads annotated based on their reliability. Here are a few lines from a sample file:


@HWI-EAS349:4:17:504:763#0/1
AACTNGGACACAGAGCCTGTCAGATCGGAAGAGCNCGTANNNNNNANTNNTNNN
+HWI-EAS349:4:17:504:763#0/1
ab`\DV[\WYW_^a\NU[^VUZP^[TX\BBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-EAS349:4:17:533:887#0/1
CCTCNTCGAAAGATAAAAATTTACATACCCATAANGAGGNNNNNNGNANNANNN
+HWI-EAS349:4:17:533:887#0/1
Y`]TDTXV\bJ^JS`[`a`bY]b[UaaBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-EAS349:4:17:534:420#0/1
AGACNAGAATGCCATCTTCCCGATAGATGGTTNTNNNGCNNNNNNANANNANNN

The pipeline from here is actually quite well explained in the MAQ documentation, but to make sure I verified each step with our Sanger expert colleagues. I also checked that the parameters they use, in particular at the map stage, which happen to be the defaults for maq.

Basically, there are two strands – the reference genome and the solexa file itself. The reference genome goes through the following stages:

  1. Download the genome as .fasta files. I do this by chromosome using a Perl script and the Ensembl API.
  2. Convert the .fasta files into a reference file (.bfa) using “maq fasta2bfa”.

Now you have a reference genome in indexed (.bfa) form, you can use it to generate your alignments:

  1. Convert the raw Solexa files into .fastq (or sanger) files using “maq sol2sanger”.
  2. Convert these into indexed .bfq files using “maq fastq2bfq”.
  3. Create an alignment map, based on the .bfq file and the reference genome .bfa file, using “maq map”. For the record, the settings used by our Sanger experts are “-n 2 -m 0.001 -e 70 -a 250”, which are the defaults described in the maq man page.

The map file represents your alignments. Now you can use maq to either output these as raw alignments (each read with its genomic coordinates) or you can generate “pileup” files, where each genomic coordinate is output with the number of reads that overlap that coordinate.

The pileup method seems to make some of my colleagues nervous, because not that many people seem to be using it and you can’t “see” where each individual read has gone. However, if you only want to see where the most protein/DNA interaction is taking place and view it as (say) a wiggle file, the pileup method is ideal.

The only difficulty is that because it generates one line of text for each genomic coordinate, you get 3 billion lines in total and a really massive file – it got to 12GB and then I told it to give up. It also exposed (as bioinformatics tends to do) the limitations of the operating system, because my machine was grinding like mad as Linux was making a lot of fuss about generating a file that big. At some point, I’ll get somebody geekier than me to explain that to me.

In any case, most of the lines from a pileup are not interesting because most coordinates have no overlapping reads. We can attach a pipe process to discard these and get much more manageable file size. Even so, extracting pileup data from a map is much slower than generating alignments. Alignments take a few seconds, pileups are 10s of minutes.

Below is some Python code that manages the whole pipeline. It uses the multiprocessing module (new to 2.6!) to divide the work on a multi-core machine. I’ve taken the view that using the standard library is a better idea than something 3rd party like parallel python.

I haven’t tuned this to work out the best number of cores to use. The most expensive step is the pileup stage, which needs one process for maq and another for the pipe filter so I don’t know whether using all 4 cores in a quad core setup they’d have to keep swapping jobs. Using 3 cores it doesn’t run out of memory and can process 10 files in 2 or 3 hours.

#!/usr/bin/env python

import glob
from multiprocessing import Pool, current_process
from subprocess import Popen, PIPE
import os
import csv


csv.field_size_limit(1000000000)
# base directory for the solexa input files
basedir = "/home/pzs/solexa/run2/"
# the glob to identify the Solexa input files
globfile = "*.txt"
# location of the reference genome, already converted into .bfa format
genomebfa = "/home/pzs/genebuilds/human/copy/allhuman.bfa"
# the minimum size of a peak to output in the pileup option 
minpeaksize = 1
# number of cores to use
ncpus = 3
maqpath = "/home/pzs/src/maq-0.6.6_i686-linux/maq"
maqplpath = "/home/pzs/src/maq-0.6.6_i686-linux/maq.pl"

def sol2fastq(solfile):
	mycore = open("/proc/%i/stat" % os.getpid()).read().split()[38]
	outpath = solfile + ".fq"
	if os.access(outpath, os.R_OK):
		print "skipping file already present", outpath
		return outpath
	cmd = "%s sol2sanger %s %s" % (maqpath, solfile, outpath)
	print "core:", mycore, "running command:", cmd
	os.system(cmd)
	return outpath
	
def fastq2bfq(fastqfile):
	mycore = open("/proc/%i/stat" % os.getpid()).read().split()[38]
	filebase = fastqfile.split(".")[0]
	outpath = filebase + ".bfq"
	if os.access(outpath, os.R_OK):
		print "skipping file already present", outpath
		return outpath
	cmd = "%s fastq2bfq %s %s" % (maqpath, fastqfile, outpath)
	print "core:", mycore, "running command:", cmd
	os.system(cmd)
	return outpath

def bfq2map(bfqfile):
	mycore = open("/proc/%i/stat" % os.getpid()).read().split()[38]
	filebase = bfqfile.split(".")[0]
	outpath = filebase + ".map"
	if os.access(outpath, os.R_OK):
		print "skipping file already present", outpath
		return outpath
	cmd = "%s map %s %s %s" % (maqpath, outpath, genomebfa, bfqfile)
	print "core:", mycore, "running command:", cmd
	os.system(cmd)
	return outpath
	
def pilePeaks(mappath):
	mycore = open("/proc/%i/stat" % os.getpid()).read().split()[38]
	filebase = mappath.split(".")[0]
	outpath = "%s-PeakPile-%d.pp" % (filebase, minpeaksize)
	if os.access(outpath, os.R_OK):
		print "skipping file already present", outpath
		return outpath
	cmd = "%s pileup %s %s" % (maqpath, genomebfa, mappath)
	print "core:", mycore, "running command:", cmd
	p = Popen(cmd, shell=True, bufsize=100, stdout=PIPE)
	reader = csv.reader(p.stdout, delimiter="\t")
	writer = csv.writer(open(outpath, "w"), delimiter="\t")
	for row in reader:
		height = int(row[3])
		if height > minpeaksize:
			writer.writerow(row[:-1])
	return outpath

	
	
solpath = basedir + globfile
solfiles = glob.glob(solpath)

pool = Pool(processes=ncpus)
	
fqfiles = pool.map(sol2fastq, solfiles)
bfqfiles = pool.map(fastq2bfq, fqfiles)
mapfiles = pool.map(bfq2map, bfqfiles)
ppfiles = pool.map(pilePeaks, mapfiles)

print ppfiles

Illumina Solexa hype

July 14, 2009

(I was just thinking about a post about MAQ and this aside turned into a post of its own…)

Several people in my lab have the attitude that arrays are obsolete technology, and Solexa sequencing will be entirely replacing it over the coming years. My boss may or may not agree.

In any case, the Illumina people are trying to take full advantage of this wave of hype. Last week, I went to a day-long meeting in Edinburgh about the Illumina “product portfolio” and its potential applications. Just look at the punishing sales schedule! They had brought a half dozen marketing people with them – senior product manager this and marketing manager that – as well as a few local scientists who were using the kit. One guy who gave a talk had spent half his £2.2m budget on Illumina kit. Another was enthusing about how sequencing is now so quick and easy a masters student could now do it in a few months. Apparently, they’re also moving into the consumer sector and were talking about swapping genomes with an iPhone app. I’m afraid I face-palmed at that point.

It’s interesting stuff, but you always worry when there is a monopoly on this kind of technology. For some of the analysis we’ve done, the Solexa data seems to have worrying bias relative to arrays, although this may of course be because we haven’t figure out how to do the analysis yet.

Anyway, high throughput sequencing is increasingly entering the public arena so I guess we better get used to it.

EuroPython

July 3, 2009

I’ve just got back from 3 days spent at the annual EuroPython conference, this year held in Birmingham. There were about 450 attendees and a wide variety of different talks from web development to printing money to science stuff.

This was my first time at a non-academic conference and I wasn’t really sure what to expect. Overall, I found it extremely useful, mostly just through being exposed to a diverse range of excellent programmers and seeing some of the ways in which they use their tools. The standard of talks was generally higher than academic conferences (far few people randomly overrunning their time and not caring, for a start) and I also found the people to be more friendly and easy to talk to. In general, there was a lower freak quota, which makes it more bearable to spend 3 days surrounded by these people.

I realised quite early in the conference that I was surrounded by a large number of programmers who were more proficient in Python than I am, despite having used the language for nearly 10 years. Of course, this is intimidating at first, but I realised that this was a great opportunity to discover new ways of doing things and expand my horizons. During those 10 years almost everything I have learned about Python has been self-taught. This is good as far as it goes but I could advance a lot more quickly by more interaction with real experts, including pair programming, code reviews and just generally working with other people’s source code.

I’m saddened, but not surprised, that so few academic programmers attend these conferences; I would say that fewer than 10% of the attendees were academics. To me, something like EuroPython makes just as much of a contribution to my productivity as a conference with other scientists. Maybe I can persuade other computational scientists to come along with me next year.