Archive for February, 2009

High Throughput Primer Design

February 16, 2009

Using PCR Primers to generate specific DNA sequences is a fundamental part of modern genome work. Designing the primers is basically an algorithmic process and there are quite a few tools out there to help you do this.

Some of my colleagues recently attended a talk by CLC Bio about their new all-singing all-dancing genomics toolkit. Much of this is concerned with handling so called Next Generation Sequencing (this is a bit like using the word “new” in the title of a technology – don’t these people realise that it’s never tomorrow today?) output, such as Illumina Solexa, but they also include a flashy graphical tool for designing primers. You can see your sequence on the screen and see in real-time the effectiveness of the primers as you fiddle with them.

The problem with these tools, along with web enabled offerings like Primer3 is that it’s a manual process for building one primer-pair at a time. This doesn’t work so well when you have a big list of sequences for which you need primers and this is where you need some software support. In my case, I have 155000 promoter sequences listed in a GenBank file. We wanted a primer pair for everything 100 bases upstream and downstream of the central occurrence of the motif GTAC.

First of all, I got to find out all about how standard a GenBank file really is. I also discovered than in a GenBank file, you write the sequence in the order it is transcribed, not the order it appears in the standard genome listing. That means that if you have a reverse strand feature, you need to take a sequence complement.

Then it was on to the meat of the design pipeline itself.

It helps if you know roughly how PCR works. Your primer pairs are sequences, about 20 bases each, at the start and end of the sequence you want to amplify. As I understand it, you put copies of your primers in some magic fluid along with a copy of the genome from which you want to amplify a sequence. Your primer pairs “find” the parts of the genome they match and grow lots of copies in the bits between the primers. This means that you want your primers to match only one place in the genome, otherwise you’ll also copy sections of sequence you’re not interested in.

Once you’ve got the primers designed, there are lots of companies that will do the amplification for you based on primer lists.

Roughly speaking, all of the necessary tools are available to download and run on the command line. You need:

  1. RepeatMasker (this one is the most hassle to set up – it’s need some databases you need to register for as well as an install of blast).
  2. Primer3.
  3. The exonerate package, which is in the Ubuntu package repository.
  4. A copy of the human genome, in FASTA format. I used the Ensembl API to grab this by chromosome.

Once they’re all set up, it works like this:

  1. Pass the sequence of interest into RepeatMasker. This replaces repeated fragments of sequence with N characters, to stop you making primers based on these repeated parts.
  2. Pass the masked sequence into Primer3 along with a set of suitable parameters. The parameters define the tolerance on how long the primers and product can be and some other factors. If successful, Primer3 generates several primer-pair candidates.
  3. Use ipcress (part of exonerate) to simulate the PCR reaction. This gives you a number of “hits” – places where the pair of sequences appear close together in the genome. A good primer pair has only 1 hit.
  4. Use exonerate itself to check for occurrences of each primer in the genome. Although it shouldn’t matter if the primers match parts of sequence that are a long way apart, it’s best if they don’t match anywhere but where you intend to amplify. A good primer pair will have 1 (or even 0) matches for each half of the primer pair. You get 0 matches because of the way the exonerate algorithm works, rather than because your sequence isn’t present at all.

This is all fine and works once you’ve debugged it. Unfortunately, it’s also hellishly slow, because ipcress and exonerate are searching for strings of length 20 in the genome, which has a length of about 3 billion. My Xeon system, using one core at a time, was finding a primer pair in a few minutes, but that’s going to be pretty painful for 155000 primer pairs.

To speed up the process, both ipcress and exonerate allow you to search for more than one set of primers at a time. If you bundle 10 sets of candidate primers together it takes only a little longer than 1 set. You can go higher than this, but I found that doing 20 sets at a time, the amount of output from exonerate was using up too much memory – literally gigabytes of it. Maybe I did something wrong here, but I was getting tired of debugging it.

Next, I adapted my code to use Parallel Python. PP has a clever server system where you use a few lines of code to declare a server object with a number of workers (cores) available and then submit jobs to the server object. These jobs are executed in parallel.

The problem with this approach is that on my quad-core box I had 2 or 3 instances of exonerate running at once, each of which has a whole human genome in memory – about 3GB worth. My machine couldn’t cope with this. Luckily, exonerate also comes with a multi-threaded server mode for these eventualities – you run a few commands to index the genome you’re using, start the server and it deals with the requests as they come in using only one cached copy. Phew.

Still, the memory requirements for this code are pretty intense. I’m currently running two cores on the design and another two for my everyday work, and the machine is still pretty sluggish, largely because of the memory constraints. This machine has 8GB of memory, but I can see that doubling up to 16GB would probably be a sensible investment for this kind of work.

If anybody is interested in the finished code for the primer design pipeline, feel free to get in touch.