Archive for April, 2011

Overlapping Genomic Intervals Revisited

April 5, 2011

I’ve posted before on finding the right approach to overlapping genomic intervals. Broadly speaking, the problem is this: you have two sets of intervals, where each interval is (chromosome, start, end). You want to know where the intervals in the first set overlap the intervals in the second set.

For a while now, my go-to tool for this problem has been interval trees, which are a flexible and efficient data structure for inserting intervals from one set and making queries with the other. I have a cython implementation which is nice and fast.

Using an interval tree to find interval overlap has two stages:

  1. Loading intervals into the tree to be queried.
  2. Querying those intervals.

In general, a tree data structure like this is optimised so that adding items is (relatively) slower than making queries, so we should try to minimise this stage. Once the tree is built, we can make nice fast queries over and over again. This is good in a database-like environment, where we build a tree at the start and then make queries here and there throughout the rest of our application.

However, many of my interval overlap applications are not like that. I have two sets of intervals and I want to find the overlaps between them once and then I’m done. I have to go through the expensive tree construction stage every time.

My alternative to this approach is to use a plane sweep algorithm. Conceptually, we build a list of the starts and ends of all the intervals from both sets, sorted in genomic order. Then we “sweep” from left to right recording when we enter and exit intervals from each set, and output when we’re in an interval from both. The problem is complicated slightly because with very large sets of intervals we don’t want to construct the whole list at the start, so I wanted to do the plane sweep dynamically, using pre-sorted lists of both sides and popping off items as I go.

After a bit of bashing, I got this working and the code is below: (I know this is verbose – it has quite a few comments)

# uses a plane sweep to determine the overlaps between two sets of genomic intervals
# they must come in as iterators
# sorted, with the same chromosome order
def overlapGeneratorPlaneSweep(left, right):
	# rightcache holds the points we've retrieved from the right set
	# that match the current left set chromosome, but we haven't finished filing yet
	rightcache = []
	# nextcache holds points from right that from a later chromosome 
	# to what we're currently looking at in left
	nextcache = []


	lastchrom = None
	for linterval in left:
		lchrom, lstart, lend, lvalue = linterval

		# slight complex mechanics to handle a change in chromosome in left
		if lchrom != lastchrom:
			# still some left over in right
			# make sure there aren't any points leftover in right
			# from the last chromosome from left
			if rightcache != []:
				righti = rightcache[-1]
				while righti[0] < lchrom:
					righti = right.next()
				nextcache = [ righti ]
			rightcache = []
			assert len(nextcache) <= 1
			# the next cache is from a chromosome *before* the left chrom
			# suck up all the points from before left's chrom 
			# - they will never overlap now
			if nextcache != [] and nextcache[0][0] < lchrom:
				righti = nextcache[0]
				while righti[0] < lchrom:
					righti = right.next()
				assert righti[0] >= lchrom
				nextcache = [ righti ]
			# if the point in nextcache matches the left chrom
			# nextcache is now current, so make it rightcach
			if nextcache != [] and nextcache[0][0] == lchrom:
				rightcache = nextcache[:]
				nextcache = []
			assert rightcache == [] or rightcache[0][0] == lchrom
			
		# if the last point in the rightcache overlaps this left point,
		# there might be more points in the right that are relevant now
		while nextcache == [] and (rightcache == [] or rightcache[-1][1] <= lend):
			try:
				righti = right.next()
			except StopIteration:
				break
			if righti[0] == lchrom:
				rightcache.append(righti)
			elif righti[0] == lastchrom or (righti[0] == lchrom and righti[2] <= lstart):
				pass
			else:
				nextcache.append(righti)
			
		removecount = 0
		# the bit where we actually check and report the overlaps
		for rinterval in rightcache:
			# we've gone past this one. Remove it
			if rinterval[2] <= lstart:
				removecount += 1
			# we haven't reached this one yet. 
			# We must have filed everything relevant from right.
			elif rinterval[1] >= lend:
				break
			else:
				# if we get here, this is an overlap. Yield it.
				yield((linterval, rinterval))
		
		rightcache = rightcache[removecount:]
					
		lastchrom = lchrom

This operates on two iterators of intervals, left and right which must be sorted in the same chromosome order. We pop off an interval from left, and then pop off intervals from right, dropping those that are before the left interval and caching the others, until we go past the left interval. We report all the overlaps we found, and then pop the next left interval and continue.

A lot of the complicated logic happens in making the transition between chromosomes, particularly when there are gaps; both sets of intervals do not have intervals from the same chromosomes. This is where the algorithm relies on the chromosomes from both being in alphanumeric order, so that we can see if a chromosome is “before” another one. This works if we presort the intervals using (for example) the shell sort utility, and I have some wrapper functions to make this work with bed files:

import csv

def sortBedGenerator(bedfile):
	sortfile = bedfile + ".sort"
	cmd = "sort -k 1.4,1.5 -k 2,2n %s > %s" % (bedfile, sortfile)
	os.system(cmd)
	reader = csv.reader(open(sortfile), delimiter="\t")
	for row in reader:
		yield (row[0], int(row[1]), int(row[2]), 1)
	os.remove(sortfile)

def overlapBedFilesPlaneSweep(leftbedfile, rightbedfile):
	leftgen = sortBedGenerator(leftbedfile)
	rightgen = sortBedGenerator(rightbedfile)
	return overlapGeneratorPlaneSweep(leftgen, rightgen)

So we sort the bed file into a temp file, and then yield the sorted intervals one at a time.

The plane sweep algorithm returns pairs of intervals, one from each set, which overlap. As well as the interval itself, it returns a 4th item which could be anything, so you could overlap objects (or whatever) to provide extra context in whatever application we’re writing. This is slightly different from the way the interval tree would be used, of which more later.

I ran some timing tests on this to compare it to my old faithful cython interval tree, using one tree for each chromosome. My test data was based on the mouse genome. On one side, I used the csp6i restriction enzyme (GTAC), a 4-base cutter we use in a lot of our 4C and Hi-C work; there are 5436400 in the mouse genome. On the other side, I generated random bed files of increasing numbers of 50bp reads – 10, 1000, 10000, 1000000, 2000000 and 5000000. I also did a 10 read example, to make sure the plane sweep algorithm could cope with gaps in the chromosome roster. I made sure both the interval tree and plane sweep methods identified the same intervals (the 10 read examples is missing for this plot, but I did make sure the overlaps were the same). I realise this test scheme is a little arbitrary but this is a typical use case for me, so it seemed like a good place to start.

Since it makes a difference whether you use the csp6i file as the tree or the query and as the “left” or “right” in the plane sweep, I tested it both ways. I did three tests of each. Below are the results, using the triplicate results to generate standard deviation error bars. Note that “git” stands for Genomic Interval Tree (one interval tree for each chromosome) and “PS” stands for Plane Sweep. Since the GIT method does not require the intervals to be sorted, every PS run includes the time it takes to sort the files first.

The plot headings refer to whether the csp6i file was on the left (“big-left” – which means it was used to build the interval trees) or the right (“big-right” – used as the queries). The plane sweep seems to perform much better than the interval trees for all the file sizes. The times are starting to ramp up at the 5,000,000 read level, but are still 10 times faster than the GIT method. I haven’t checked this, but it also ought to be much more memory efficient, since we’re not storing all the intervals in trees – we just visit each interval once to check its overlaps as we make the sweep.

As expected, the interval tree approach is much more efficient when we build a tree with the small number of intervals and make lots of queries on that tree. It makes a difference with the plane sweep too, since one side will have to make use of the more complicated caching machinery. I did a separate analysis to examine this:

To me, this is a good start, and I have already used the plane sweep approach to do some analysis work. I could probably speed it up even more by allowing the use of pre-sorted files. However, there are quite a few caveats:

  1. This is a very specific set of conditions. Before we could conclusively say the plane sweep approach was faster, there are lots of other conditions I would need to test:
    • Different sizes of intervals.
    • Many intervals within each group that overlap each other.
    • Interval sizes that vary wildly within (and between) the groups.
  2. There is a lot of tuning possible on the structure of the interval trees that will affect performance.
  3. I’m not entirely comfortable with the reliance on coordinates in alphanumeric order here. It probably needs more error checking to make sure it reports if something goes wrong with that.
  4. As I mentioned, the way this should be used is different to the interval tree approach. With the csp6i sites, I usually want to look at a 4C read and say “is it near a csp6i site?” interval trees are a natural fit for this – I just query the tree, based on some tolerance for how near the site should be, and if I get any hits then I can act on those. To use the plane sweep approach, I need to change the layout of my code slightly to look for the presence of the read in the list of returned overlaps. I would also need to add an extra wrapper to generate intervals to take account of the tolerance I want to use. These are not insurmountable obstacles, but the plane sweep is not a drop-in replacement for the interval tree approach.
  5. Although comparing two sets of intervals is the basic use case, you always have to be wary of a one-size-fits-all approach. The plane sweep might be quicker for a pair of interval sets like this but, again with respect to the csp6i example, I might want to load the csp6i sites into a set of interval trees, and then query a list of 4C read-sets (lanes) against the tree. With only a few lanes of 4C, the plane sweep will probably still be quicker. How many lanes would I need before the interval tree approach becomes quicker again?

I guess for me the biggest caveat is that the plane sweep approach is more complex and is therefore more susceptible to edge cases and programmer error. The GIT method is mature and I trust it, so I will probably have more anxiety about the PS for a while. Still, these preliminary results look promising.

I would also be interested to implement a multi-way plane sweep that overlaps several sets of intervals against a reference set. Maybe this would be quicker than building a tree for the reference set and comparing the others – something I still do very often.