Analysing Solexa data with MAQ

We are just starting to use Illumina Solexa sequencing to analyse protein/DNA interactions, also known as ChIP-seq.

As I understand it, when you buy a Genome Analyser machine from Illumina, it does come with software to align the reads and allow basic analysis of them. However, the attitude in my lab is that the real experts are at the Sanger Institute in Cambridge and we should use whatever they use. We have consulted with my boss’s colleagues there to make sure we do it in exactly the same way as them. The tool they use is developed at the Sanger (or the EBI, I’m not sure) and is called MAQ

In fact, even before you use MAQ, the data that comes out of a Solexa machine has been through a few other analysis stages and arrives as a (frankly incomprehensible) text file, with reads annotated based on their reliability. Here are a few lines from a sample file:

@HWI-EAS349:4:17:504:763#0/1 AACTNGGACACAGAGCCTGTCAGATCGGAAGAGCNCGTANNNNNNANTNNTNNN +HWI-EAS349:4:17:504:763#0/1 ab`\DV[\WYW_^a\NU[^VUZP^[TX\BBBBBBBBBBBBBBBBBBBBBBBBBB @HWI-EAS349:4:17:533:887#0/1 CCTCNTCGAAAGATAAAAATTTACATACCCATAANGAGGNNNNNNGNANNANNN +HWI-EAS349:4:17:533:887#0/1 Y`]TDTXV\bJ^JS`[`a`bY]b[UaaBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWI-EAS349:4:17:534:420#0/1 AGACNAGAATGCCATCTTCCCGATAGATGGTTNTNNNGCNNNNNNANANNANNN

The pipeline from here is actually quite well explained in the MAQ documentation, but to make sure I verified each step with our Sanger expert colleagues. I also checked that the parameters they use, in particular at the map stage, which happen to be the defaults for maq.

Basically, there are two strands – the reference genome and the solexa file itself. The reference genome goes through the following stages:

Download the genome as .fasta files. I do this by chromosome using a Perl script and the Ensembl API.
Convert the .fasta files into a reference file (.bfa) using “maq fasta2bfa”.

Now you have a reference genome in indexed (.bfa) form, you can use it to generate your alignments:

Convert the raw Solexa files into .fastq (or sanger) files using “maq sol2sanger”.
Convert these into indexed .bfq files using “maq fastq2bfq”.
Create an alignment map, based on the .bfq file and the reference genome .bfa file, using “maq map”. For the record, the settings used by our Sanger experts are “-n 2 -m 0.001 -e 70 -a 250”, which are the defaults described in the maq man page.

The map file represents your alignments. Now you can use maq to either output these as raw alignments (each read with its genomic coordinates) or you can generate “pileup” files, where each genomic coordinate is output with the number of reads that overlap that coordinate.

The pileup method seems to make some of my colleagues nervous, because not that many people seem to be using it and you can’t “see” where each individual read has gone. However, if you only want to see where the most protein/DNA interaction is taking place and view it as (say) a wiggle file, the pileup method is ideal.

The only difficulty is that because it generates one line of text for each genomic coordinate, you get 3 billion lines in total and a really massive file – it got to 12GB and then I told it to give up. It also exposed (as bioinformatics tends to do) the limitations of the operating system, because my machine was grinding like mad as Linux was making a lot of fuss about generating a file that big. At some point, I’ll get somebody geekier than me to explain that to me.

In any case, most of the lines from a pileup are not interesting because most coordinates have no overlapping reads. We can attach a pipe process to discard these and get much more manageable file size. Even so, extracting pileup data from a map is much slower than generating alignments. Alignments take a few seconds, pileups are 10s of minutes.

Below is some Python code that manages the whole pipeline. It uses the multiprocessing module (new to 2.6!) to divide the work on a multi-core machine. I’ve taken the view that using the standard library is a better idea than something 3rd party like parallel python.

I haven’t tuned this to work out the best number of cores to use. The most expensive step is the pileup stage, which needs one process for maq and another for the pipe filter so I don’t know whether using all 4 cores in a quad core setup they’d have to keep swapping jobs. Using 3 cores it doesn’t run out of memory and can process 10 files in 2 or 3 hours.

#!/usr/bin/env python

import glob
from multiprocessing import Pool, current_process
from subprocess import Popen, PIPE
import os
import csv


csv.field_size_limit(1000000000)
# base directory for the solexa input files
basedir = "/home/pzs/solexa/run2/"
# the glob to identify the Solexa input files
globfile = "*.txt"
# location of the reference genome, already converted into .bfa format
genomebfa = "/home/pzs/genebuilds/human/copy/allhuman.bfa"
# the minimum size of a peak to output in the pileup option 
minpeaksize = 1
# number of cores to use
ncpus = 3
maqpath = "/home/pzs/src/maq-0.6.6_i686-linux/maq"
maqplpath = "/home/pzs/src/maq-0.6.6_i686-linux/maq.pl"

def sol2fastq(solfile):
	mycore = open("/proc/%i/stat" % os.getpid()).read().split()[38]
	outpath = solfile + ".fq"
	if os.access(outpath, os.R_OK):
		print "skipping file already present", outpath
		return outpath
	cmd = "%s sol2sanger %s %s" % (maqpath, solfile, outpath)
	print "core:", mycore, "running command:", cmd
	os.system(cmd)
	return outpath
	
def fastq2bfq(fastqfile):
	mycore = open("/proc/%i/stat" % os.getpid()).read().split()[38]
	filebase = fastqfile.split(".")[0]
	outpath = filebase + ".bfq"
	if os.access(outpath, os.R_OK):
		print "skipping file already present", outpath
		return outpath
	cmd = "%s fastq2bfq %s %s" % (maqpath, fastqfile, outpath)
	print "core:", mycore, "running command:", cmd
	os.system(cmd)
	return outpath

def bfq2map(bfqfile):
	mycore = open("/proc/%i/stat" % os.getpid()).read().split()[38]
	filebase = bfqfile.split(".")[0]
	outpath = filebase + ".map"
	if os.access(outpath, os.R_OK):
		print "skipping file already present", outpath
		return outpath
	cmd = "%s map %s %s %s" % (maqpath, outpath, genomebfa, bfqfile)
	print "core:", mycore, "running command:", cmd
	os.system(cmd)
	return outpath
	
def pilePeaks(mappath):
	mycore = open("/proc/%i/stat" % os.getpid()).read().split()[38]
	filebase = mappath.split(".")[0]
	outpath = "%s-PeakPile-%d.pp" % (filebase, minpeaksize)
	if os.access(outpath, os.R_OK):
		print "skipping file already present", outpath
		return outpath
	cmd = "%s pileup %s %s" % (maqpath, genomebfa, mappath)
	print "core:", mycore, "running command:", cmd
	p = Popen(cmd, shell=True, bufsize=100, stdout=PIPE)
	reader = csv.reader(p.stdout, delimiter="\t")
	writer = csv.writer(open(outpath, "w"), delimiter="\t")
	for row in reader:
		height = int(row[3])
		if height > minpeaksize:
			writer.writerow(row[:-1])
	return outpath

	
	
solpath = basedir + globfile
solfiles = glob.glob(solpath)

pool = Pool(processes=ncpus)
	
fqfiles = pool.map(sol2fastq, solfiles)
bfqfiles = pool.map(fastq2bfq, fqfiles)
mapfiles = pool.map(bfq2map, bfqfiles)
ppfiles = pool.map(pilePeaks, mapfiles)

print ppfiles

Tags: illumina, maq, parallel processing, solexa

This entry was posted on July 14, 2009 at 2:48 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

One Response to “Analysing Solexa data with MAQ”

Peter C Says:
September 16, 2009 at 5:02 pm | Reply
Hi Peter,

Are you aware that there are two versions of Sanger/Illumina FASTQ files? Using MAQ’s sol2sanger assumes you have the old style with Solexa scores. You may instead have the Illumina pipeline 1.3+ files which use PHRED scores. This does make a difference with poor quality reads.

There is a patch to MAQ to add ill2sanger support:
http://sourceforge.net/tracker/?func=detail&aid=2841164&group_id=191815&atid=938895

Or, since you are already using Python, you could use Biopython 1.51 or later to do the conversion between any of the three FASTQ variants (Sanger, Solexa, Illumina 1.3+) using the Bio.SeqIO library.

Peter C

Peter Saffrey's Weblog