Media Server Odyssey Part 4: Autorip

March 27, 2013

If you have time on your hands, feel free to start at the beginning.

So now the thing plays music. I also want to be able to update the music supply painlessly. Specifically, I would like to be able to insert a CD into a drive attached to the headless Pi and have it rip the disk, with ID3 tags, and pop the disk without any further interaction.

This is where a flexible solution like a Raspberry Pi starts to come into its own – it doesn’t just play music, like the Streamium or Sonos, it’s also possible to extend it to do anything a Linux box can do. There are packaged devices that play and rip, but they are pretty expensive.

I guess here I’m demonstrating that I’m an ancient crusty person who still buys CDs. I could just buy the MP3s instead and copy them straight to the disk. Meh.

The hardware for CD ripping was easy to come by. I bought a cheap DVD drive on eBay. It’s also USB powered, so my Raspberry Pi media centre (aka giant pile of wires and USB devices) is still driven from a single power socket:


For the software, a tiny amount of googling will identify the best hands-off Linux ripping tool – abcde, which stands for A Better CD Encoder. There are many things I like about this tool:

  1. Command line driven
  2. Configurable, but also designed for use-cases like mine. Some highly configurable tools are very hard to get to do the simple things you want…
  3. Capable of ripping completely non-interactively (the -N switch)
  4. Based on standard tools like lame and cdparanoia
  5. Written in pure bash! In a single 4,500 line script! It’s some of the most transparent Linux software I’ve ever used because I love bash and wallow around in it all day at work. If I want to know how abcde does things, it’s extremely easy to open up the one source file that does the work and browse through it to find the answer.

I found a standard config file for MP3 ripping, which required minimal customisation.

I wanted to use the SD card as the wav file temp space since I thought it might be faster but there wasn’t enough space; possibly there is too much clutter in my install by now. In any case, I’m not actually that interested in how quickly (within reason) it can rip the CDs – I just want it to be easy. Since my media centre sits in my living room, I should be able to gently feed in CDs at my leisure.

The tougher part was getting abcde to launch on CD insert. As always with Linux, there are several hardware event toolkits which succeed each other and all look completely different. The latest in the chronology is udev, but I found that if I ran “udevadm monitor” and plugged in a CD I got no events 😦 Maybe I just didn’t understand the man page – it certainly wouldn’t be the first time.

So I tried halevt. This time, running the monitor command:

lshal -m

and then inserting an audio CD gave this:

Start monitoring devicelist:
21:19:30.866: storage_serial_HL_DT_ST_DVDRAM_GSA_T20L________________________________0_0 property storage.removable.media_available = true
21:19:30.991: storage_serial_HL_DT_ST_DVDRAM_GSA_T20L________________________________0_0 property storage.cdrom.write_speeds = {'4234', '2822', '1764', '706'}
21:19:31.361: volume_part_1_size_472217600 added

I also got useful information about the relevant device by examining the output from lshal, which dumps information about all your devices. There was still a bit of argy-bargy to get the halevt listener/handler configured properly. The first thing I had to do was remove all the default entries, which are all to do with automatically mounting USB drives, something that interfered with my existing mechanism for doing this.

The configuration documentation had some good information, but as usual examples are most useful and I just adapted what was in the default config file. One thing that confused me was that there looks like there is a halevt directory for scripts to be launched from halevt events (/usr/lib/hal/scripts/), but if you put “env > somefile.txt” (note you’ll need to use XML encoding for file redirection operators! It’s hard to put these in my main code examples because WordPress sucks) as your event script, you find that the $PATH is actually just /sbin:/usr/sbin:/bin:/usr/bin. That’s fine – abcde and all its tools are installed there anyway. This was my final config:

<?xml version="1.0" encoding="UTF-8"?>
<halevt:Configuration version="0.1" xmlns:halevt="">
<halevt:Device match=" = true">
<halevt:Property name="">
<halevt:Action value="true" exec="echo $USER > /home/pi/log/abcde.log ; sleep 10 ; abcde -N >> /home/pi/log/abcde.log 2>&1"/>

This has two crucial components – selecting the appropriate event (a combination of the Device match attributes and the Property name attributes) and the commands:

exec="echo $USER > /home/pi/log/abcde.log ; sleep 10 ; abcde -N >> /home/pi/log/abcde.log 2>&1"

This keeps track of what’s happening in /home/pi/log/abcde.log for debugging purposes. I think the $USER is actually root for halevt runs – I probably don’t need this part anymore. Note the sleep 10 – I found that it took some time after the event for the CD to settle and be available for abcde.

It works beautifully – put in a CD and the Pi rips it to the target directory and pops the disk. You just need to wait a little while for the encoding to finish and then update the mpd database from one of the clients – you can do this with mpdroid.

One potential problem was that the disk pops after it has finished ripping the wav files, not the whole process. This means you can rotate the disk straight away and the poor old Pi will have to rip this one while still encoding the last. However, I found it coped perfectly well with this; the Pi can rip one track, encode two others and play MP3s through mpd at the same time. The system load goes up to 5-7, but the Pi soldiers on with no obvious problems. Impressive.


Media Centre Odyssey 3: Architecture Revision

March 26, 2013

Hello, weary traveller. You might want to start at the beginning.

My original architecture used a uPnP Android device with Bubble uPnP and connected to my Bluetooth speaker. It seemed to work fine, but occasionally it would drop out for up to a minute at a time. Now, the problem of debugging this. It could be:

  1. The Pi not supplying the file fast enough
  2. The Wifi not supplying the file fast enough
  3. An encoding bottleneck on the tablet
  4. A Bluetooth problem communicating with the speaker

I could hardly believe the supply of a small MP3 file could cause the problem, but on the other hand local files on an Android device played over the Bluetooth without problems. This was also a frustratingly intermittent problem – sometimes it would play for over an hour with no blips.

I also tried changing the architecture slightly by installing mpd on the Pi and using mpdroid on the Android device. I used the mpd http stream support to play the music on the device and pipe through Bluetooth onto the speaker. In this scheme, the Pi is acting as both storage and decoder, with the tablet as controller but also a conduit from the Pi to the speaker. This architecture is the most complicated yet, and even more intermittent. I heard from the Bubble uPnP developer that http streaming on older Android devices like my HTC Desire was buggy and I should upgrade.

But this seemed like hard work and I thought the number of moving parts in this solution was bound to have problems. I also still didn’t like the dependency on a conduit Android device that was subject to battery life (or, for example, the person with the phone wanting to leave the house). Time for a rethink.

Although the http streaming didn’t work, I really liked mpd, which has server-side playlist you can control from multiple client devices. The mpdroid control is a fully fledged smart-phone remote controller with the ability to edit the upcoming tracks and search for the music you want to play. Also, the search is fast; my 10,000 track collection usually returns from a search in under a second. You can also use a software volume control that you can, again, access from the remote if you uncomment this line in the /etc/mpd.conf

mixer_type "software"

It was time to relax one of my previous constraints, that the Pi should be sited anywhere. If I sited the Pi next to the speaker I could connect them with a robust and easy to debug technology – a wire into the 3.5mm audio jack. Multiple Android devices, each with an mpdroid install, could control the playlist. Having connected all this together, I was very pleased – it all seemed to work! The remote was easy enough that I could install it on my wife’s Android phone and she actually felt able to use it!

Unfortunately, the solution was still not perfect. I found that when the volume was turned down too low, the sound quality became very poor. I spent ages trying to change drivers and tune ALSA (which is bloody awful, by the way) on the Pi. I’d even gone as far as assuming my Pi was broken and ordering another when I found out the answer. The Pi is not meant to have high quality audio through the audio jack. Of course.

The solution, apparently, is to buy a USB sound card. So I tried this one, reported by lsub as:

C-Media Electronics, Inc. Audio Adapter (Planet UP-100, Genius G-Talk)

It kind of worked, but was fuzzy. I tried a lot of hints I found online, including this and some of this but it was still fuzzy. Then I bought this only to find it contained identical hardware. Argh! Then I tried this (notice how I’m gradually getting more expensive), but when it arrived it reported as this:

C-Media Electronics, Inc. Audio Adapter

Argh again! It looks the same! However, this one works without fuzziness. I only had to buy three sound-cards to get there. I also had to do this in /etc/modprobe.d/alsa-base.conf:

options snd-usb-audio index=0

and this in /etc/mpd.conf:

audio_output {
type "alsa"
name "USB Audio"
device "hw:0,0"

So now, we have a completely working high-quality audio jukebox, controlled from multiple remotes that are easy enough my wife can use them. Of course, this might now seem moot since you can get a canned Raspberry Pi distribution that does this. But I’m not done yet!

Media Centre Odyssey Part 2: The Hardware

March 25, 2013

If you’ve arrived in the middle and want to read the whole story, you can start at the beginning.

The core of my media centre is the Raspberry Pi that stores and supplies the music files. Although the Pi is cheap, it comes utterly bare. You can think of it like buying a motherboard, memory, procesor and graphics card: it is the core of what you need, but is quite a way from being a full computer.

Alongside the Pi, I also need:

  1. A Power supply. To start with, I used the power supply from my HTC Desire but later used a powered USB hub – see below.
  2. An SD card for the operating system. I have one 4GB and one 8GB and the 4GB is a bit on the pokey side. I recommend 8GB.
  3. A hard drive for the music. I started with one with an external power supply, but I really wanted the Pi to run on a single power socket. There are plenty of cheap-ish USB drives now that are powered from USB only, although you will need a powered USB hub. I have a 750GB drive.
  4. A wifi dongle. I tried a few that might have worked in theory but after several evenings fruitless module bashing were not working. I ended up with a “Mini Nano 802.11n” (very generic name) which was cheap and worked out of the box. It reports itself (lsusb) as “Realtek Semiconductor Corp. RTL8188CUS 802.11n WLAN Adapter”. They can also be powered by a Raspberry Pi without a powered hub, which will be useful for satellite players.
  5. Depending on what came with your other peripherals, you will also probably need a fistful of cables of all shapes and sizes (HDMI, USB, audio…)

Also useful:

  1. A screen, for the first time you boot the Pi to get ssh setup. Possibly, you can get a Raspbian image that will just boot with ssh switched on so you can go headless from the get-go.
  2. A keyboard. I bought a weird floppy thing (and a keyboard, snigger) which is really horrible to type with but folds up small for storage (so does the keyboard, snigger again. OK, enough sniggering).
  3. A case for the Pi, if I don’t want the bare board sitting around. I bought one of these snazzy plastic things. Ths is nice, but if you have a USB hub and a hard drive, you still have a really messy pile of components so the case on the Pi itself doesn’t really help much.

As well as all the core stuff, you also need a USB hub. The Pi has 2 USB ports, which is enough for Wifi and a USB hard drive, but it can only power very low-power peripherals, so as soon as you ask it to run a few USB devices you need a powered USB hub. Luckily, you can also power the Pi itself from the USB hub. The Pi is a sensitive beast, and with the wrong power supply or an unreliable connection, it will crash often especially if you wiggle the cables by moving the Pi while it’s on.

The hub is the thing I had the most problems getting right. I started by buying an anonymous piece of crap on eBay and the Pi crashed all the time. Then I bought a recommended device, which worked for a while until I couldn’t find the power supply for it, plugged in the wrong one. Then there was a nasty smell and instead of a USB hub I had a smouldering little plastic brick. I bought another one, but even then either the Pi or the hard drive plugged into it kept giving up. Also, the later stages of my project used more and more USB ports and it didn’t have enough. Finally, and I don’t want to get precious about this, but it felt really cheap and nasty. So I bought a bigger one on eBay. This thing has more ports, is cheaper and even feels nicer than the one on ModMyPi, so I recommend it.

This is where you see that the Raspberry Pi project can become a bit of a non stop buy-a-thon. I’ve already spent more just on USB hubs than I have on the Pi itself. On the other hand, all the bits and pieces are fairly cheap on their own and if you move slowly and you’re happy to piss away £10 a month on a tinkering project I guess you can probably justify it. At least I’m not pimping my car or anything.

If I cost up the stuff I really needed for this part of the project it looks roughly like this, including postage:

  • Pi: £35
  • SD Card: £8
  • Hard Drive: I blagged one from somewhere else, but you can get one for about £50
  • USB hub: £5
  • mini-USB power cable: £2
  • Wifi dongle: £5

So that’s £105. If I add the keyboard, case and extra cables it’s maybe another £20-25. If I add all the stupid other crap I bought and it didn’t work, about another £40. So more expensive than buying an AppleTV, then. Sigh. But more “fun”! Sigh again.

Anyway, wiring up all this lot, what you get is a bit of a mess, but it does work:

Pi Media Centre: v1Pi Media Centre: v1

So from top to bottom, there is a single plug; the USB hub, the USB-powered hard drive and the Pi itself in a snazzy case. You can see the Pi is plugged into the hub twice, once to provide the Pi with power and the second time to connect the Pi to the other devices plugged into the hub. You might also be able to see the tiny wifi adapter, plugged into the bottom left slot on the hub. These things are very small, which is both good and bad, I guess. I once lost one when it was stuck in one of the holes on the case.

This is a bit of a mess, but with the current architecture I only need wifi and power so I can site this anywhere it can reach a socket. It can be in the bottom of a draw and I never have to look at it. Unfortunately, as I will explain in the next installment, that architecture needs to be modified so that I need way more devices, way more wires and to site it somewhere specific.

Media Server Odyssey Part 1: The Architecture

March 24, 2013

This is part of my Media Server Project (odyssey!). You might want to start at the beginning.

It was only after I started thinking about this problem that I realised how complicated it could be. A media player actually has four separate components:

  1. The Storage, which holds the encoded music.
  2. The Decoder, to translate the stored music.
  3. The Renderer, which actually makes the noises. I guess a purist might separate this into amplifier and speakers.
  4. The Controller, which allows the user to choose the music. This usually includes a visual display and some buttons to press.

As an example, in an old-fashioned integrated CD player, the storage is a CD; the decoder is inside the device somewhere and it’s rendered through and internal amplifier and the speakers; the controller is probably an infra-red remote control but you also have some crappy LCD display telling you what track is playing and so-forth.

An iPod is actually pretty much the same – everything is contained in the fancy unit you buy from Apple. The only difference is that there is a degree of faffing to get your music (from CDs or the web) and then upload it onto the hard drive of the iPod. Also the controller part involves less manual changing of media (storage) than a CD player.

It’s only when you start to put one of these things together for yourself that you realise you are responsible not only for each component, but also for the linkage between components. In your integrated player this is all invisible inside the device but building your own means worrying about wires or wireless protocols connecting each piece and all the potential problems there can be at each stage.

At the storage stage, a digital media centre (thing) is probably based on a hard drive (or maybe SD card or SSD? That’s probably too small/expensive though). That’s great – now you need to plug the drive into something and serve it. In my case I’m going to use a Raspberry Pi running Linux of some kind, but I still need to export the files. I could do this with a Samba share but then I’ll need something that can mount that – probably another PC-grade device like another Pi. A better choice might be to export the files with uPnP or the more advanced/less general DLNA and then another device can consume those files. The auxiliary benefit of this is that I also serve media other than music, of which more later. Of course, there are then many choices of uPnP/DLNA servers that run on Linux.

Quite a few devices can consume music from uPnP, including a smart phone (I have a crappy old HTC Desire), tablet or even a fancy TV. Or, of course, another Pi – they only cost £30 after all. So I could serve the music and play it on my phone using a uPnP client like Bubble uPnP (which is a cool app by the way, although the search is a bit slow. Maybe that’s a server-side problem). Bubble uPnP has an interface to search, queue up and play music, so this use of uPnP has the phone doing the decoding and the controlling. Then I guess I could plug my phone into my old CD player. This would work, but if I’m going to have a device tethered to a CD player I might as well just use an iPod.

One alternative is to use XBMC, which is integrated media centre software that runs on Linux. There is even a Raspberry Pi XBMC distribution. It’s actually a very polished product and looks great, but as far as I could tell, it’s really not designed to run without a monitor (to do the “controller” part). You can download an XBMC remote control for Android and this thing confused me for a long time – it’s actually designed to operate a bit like a remote mouse to control what you can see on the XBMC interface. This makes perfect sense if you’re expecting the display to be on all the time, like if you’re XBMC device is supposed to be playing videos, but I would really like my Pi to be headless (no monitor), if possible. This is how you might expect a plastic-and-rubber remote control to work, but having a smart phone which has a screen and a keyboard work as just a cursor control seems to be missing an opportunity. Feel free to Email me and tell me I got it all wrong.

Anyway, my first design went like this:

  1. Raspberry Pi running Raspbian with a USB drive full of MP3s. There were some hardware complications with this, which I’ll come back to later.
  2. Minidlna to serve the files. I also tried MediaTomb, but I found it a bit slow and I thought the web interface was awful. I’m an old Linux nerd and I prefer text config and log files. The Pi with minidlna constitutes the storage.
  3. A reconditioned Galaxy Tab 2 Android tablet I got for £200 on eBay. I guess it must be one step behind the curve, because that seemed like a good price to me. I guess I’m just not sophisticated enough to know any better. This has Bubble uPnP installed on it, so it’s doing the decoding and controlling.
  4. A Creative ZiiSound D5x Bluetooth enabled speaker that was going cheap on Amazon. This is the renderer.

So as you can see, not only do we have quite a few technologies involved in the components, but we also have USB, WiFi and Bluetooth linking the components. Complicated. The good thing is that the whole thing is wireless, which is convenient. I can site the Pi wherever I want – in a cupboard or something – provided I can wire it into my home network, probably with wifi (more on that later). The only problem is that it’s not as flexible as I might like – if the tablet is reading and playing the music, it can’t be controlled unless you have the tablet. If the table runs out of batteries, it stops playing.

Did it work? Stay tuned.

Home Media Centre Project: The Odyssey Begins

March 22, 2013

I’ve always liked the idea of having lots of my music available at the same time, rather than listening to just one CD at a time. I owned a 3 CD changer and then a 5 CD changer 5cdchanger, one that made a noise like a car-crusher when it changed disks.

When I first saw an MP3 player advertised on Slashdot, I was very excited. It had a whole 6GB of storage! Imagine a shuffle on that!

When I finally got an iPod I was very excited about it and I was one of the sad bastards who had his name laser engraved on the back. It was a cool piece of kit, and it was great carrying your whole music around you. However, I was still missing a user-friendly living room music solution with a remote control and the ability to choose the music without being tethered to the speaker.

Time passed and eventually I had time to think about this in earnest. I felt sure that since portable MP3 technology was so well developed, there ought to be able to buy a cheap-ish solution for the home too. I wanted something with a central music store and the ability to fling music out to all my existing stereos, with central control and the ability to have different or shared music in different rooms.

There were solutions you could buy, but most of them were rather expensive and/or had proprietary components that tied you into a particular provider – things like the Streamium or later the Sonos. The other problem was that these devices often required a media server, and I wasn’t prepared to have a power-hungry buzzing server in the corner of my living room.

Of course, the really easy way to implement this would be to geneflect towards Cupertino and buy Apple. Just buy Apple TV, Airports, iTunes, an iPhone as a controller and use AirPlay. Job done. The trouble is, despite buying an iPod (in 2004, when I was young and naive) and owning a Mac, I don’t like Apple very much, especially iTunes. They want to own what you do and they tell you how to do it. There is a loss of control. Even though it would be much, much easier to go this route, I’m stubborn. Screw Apple.

I decided that if no decent system existed I would build one myself. I would use a low-cost low-power PC like a SheevaPlug, perhaps with satellite receivers of some sort and install Linux on goddamn everything. The full power of a flexible operating system at every node and full control everywhere.

Then the Raspberry Pi came out which was abso-blooding-lutely perfect for this kind of project. There was a bit of a wait before I could buy one, but then the project could begin in earnest.

From One Thing to Another

March 22, 2013

I stopped working in my old job and started working in a new one in commercial world. I stopped making posts because I wasn’t sure what I’d be able to say and also in this job I don’t have much spare time.

Anyway, this means I’m going to repurpose this blog to describe my hobby projects. It’s still Linux and hacky ranting, so it’s not much of a deviation anyway.

Overlapping Genomic Intervals Revisited

April 5, 2011

I’ve posted before on finding the right approach to overlapping genomic intervals. Broadly speaking, the problem is this: you have two sets of intervals, where each interval is (chromosome, start, end). You want to know where the intervals in the first set overlap the intervals in the second set.

For a while now, my go-to tool for this problem has been interval trees, which are a flexible and efficient data structure for inserting intervals from one set and making queries with the other. I have a cython implementation which is nice and fast.

Using an interval tree to find interval overlap has two stages:

  1. Loading intervals into the tree to be queried.
  2. Querying those intervals.

In general, a tree data structure like this is optimised so that adding items is (relatively) slower than making queries, so we should try to minimise this stage. Once the tree is built, we can make nice fast queries over and over again. This is good in a database-like environment, where we build a tree at the start and then make queries here and there throughout the rest of our application.

However, many of my interval overlap applications are not like that. I have two sets of intervals and I want to find the overlaps between them once and then I’m done. I have to go through the expensive tree construction stage every time.

My alternative to this approach is to use a plane sweep algorithm. Conceptually, we build a list of the starts and ends of all the intervals from both sets, sorted in genomic order. Then we “sweep” from left to right recording when we enter and exit intervals from each set, and output when we’re in an interval from both. The problem is complicated slightly because with very large sets of intervals we don’t want to construct the whole list at the start, so I wanted to do the plane sweep dynamically, using pre-sorted lists of both sides and popping off items as I go.

After a bit of bashing, I got this working and the code is below: (I know this is verbose – it has quite a few comments)

# uses a plane sweep to determine the overlaps between two sets of genomic intervals
# they must come in as iterators
# sorted, with the same chromosome order
def overlapGeneratorPlaneSweep(left, right):
	# rightcache holds the points we've retrieved from the right set
	# that match the current left set chromosome, but we haven't finished filing yet
	rightcache = []
	# nextcache holds points from right that from a later chromosome 
	# to what we're currently looking at in left
	nextcache = []

	lastchrom = None
	for linterval in left:
		lchrom, lstart, lend, lvalue = linterval

		# slight complex mechanics to handle a change in chromosome in left
		if lchrom != lastchrom:
			# still some left over in right
			# make sure there aren't any points leftover in right
			# from the last chromosome from left
			if rightcache != []:
				righti = rightcache[-1]
				while righti[0] < lchrom:
					righti =
				nextcache = [ righti ]
			rightcache = []
			assert len(nextcache) <= 1
			# the next cache is from a chromosome *before* the left chrom
			# suck up all the points from before left's chrom 
			# - they will never overlap now
			if nextcache != [] and nextcache[0][0] < lchrom:
				righti = nextcache[0]
				while righti[0] < lchrom:
					righti =
				assert righti[0] >= lchrom
				nextcache = [ righti ]
			# if the point in nextcache matches the left chrom
			# nextcache is now current, so make it rightcach
			if nextcache != [] and nextcache[0][0] == lchrom:
				rightcache = nextcache[:]
				nextcache = []
			assert rightcache == [] or rightcache[0][0] == lchrom
		# if the last point in the rightcache overlaps this left point,
		# there might be more points in the right that are relevant now
		while nextcache == [] and (rightcache == [] or rightcache[-1][1] <= lend):
				righti =
			except StopIteration:
			if righti[0] == lchrom:
			elif righti[0] == lastchrom or (righti[0] == lchrom and righti[2] <= lstart):
		removecount = 0
		# the bit where we actually check and report the overlaps
		for rinterval in rightcache:
			# we've gone past this one. Remove it
			if rinterval[2] <= lstart:
				removecount += 1
			# we haven't reached this one yet. 
			# We must have filed everything relevant from right.
			elif rinterval[1] >= lend:
				# if we get here, this is an overlap. Yield it.
				yield((linterval, rinterval))
		rightcache = rightcache[removecount:]
		lastchrom = lchrom

This operates on two iterators of intervals, left and right which must be sorted in the same chromosome order. We pop off an interval from left, and then pop off intervals from right, dropping those that are before the left interval and caching the others, until we go past the left interval. We report all the overlaps we found, and then pop the next left interval and continue.

A lot of the complicated logic happens in making the transition between chromosomes, particularly when there are gaps; both sets of intervals do not have intervals from the same chromosomes. This is where the algorithm relies on the chromosomes from both being in alphanumeric order, so that we can see if a chromosome is “before” another one. This works if we presort the intervals using (for example) the shell sort utility, and I have some wrapper functions to make this work with bed files:

import csv

def sortBedGenerator(bedfile):
	sortfile = bedfile + ".sort"
	cmd = "sort -k 1.4,1.5 -k 2,2n %s > %s" % (bedfile, sortfile)
	reader = csv.reader(open(sortfile), delimiter="\t")
	for row in reader:
		yield (row[0], int(row[1]), int(row[2]), 1)

def overlapBedFilesPlaneSweep(leftbedfile, rightbedfile):
	leftgen = sortBedGenerator(leftbedfile)
	rightgen = sortBedGenerator(rightbedfile)
	return overlapGeneratorPlaneSweep(leftgen, rightgen)

So we sort the bed file into a temp file, and then yield the sorted intervals one at a time.

The plane sweep algorithm returns pairs of intervals, one from each set, which overlap. As well as the interval itself, it returns a 4th item which could be anything, so you could overlap objects (or whatever) to provide extra context in whatever application we’re writing. This is slightly different from the way the interval tree would be used, of which more later.

I ran some timing tests on this to compare it to my old faithful cython interval tree, using one tree for each chromosome. My test data was based on the mouse genome. On one side, I used the csp6i restriction enzyme (GTAC), a 4-base cutter we use in a lot of our 4C and Hi-C work; there are 5436400 in the mouse genome. On the other side, I generated random bed files of increasing numbers of 50bp reads – 10, 1000, 10000, 1000000, 2000000 and 5000000. I also did a 10 read example, to make sure the plane sweep algorithm could cope with gaps in the chromosome roster. I made sure both the interval tree and plane sweep methods identified the same intervals (the 10 read examples is missing for this plot, but I did make sure the overlaps were the same). I realise this test scheme is a little arbitrary but this is a typical use case for me, so it seemed like a good place to start.

Since it makes a difference whether you use the csp6i file as the tree or the query and as the “left” or “right” in the plane sweep, I tested it both ways. I did three tests of each. Below are the results, using the triplicate results to generate standard deviation error bars. Note that “git” stands for Genomic Interval Tree (one interval tree for each chromosome) and “PS” stands for Plane Sweep. Since the GIT method does not require the intervals to be sorted, every PS run includes the time it takes to sort the files first.

The plot headings refer to whether the csp6i file was on the left (“big-left” – which means it was used to build the interval trees) or the right (“big-right” – used as the queries). The plane sweep seems to perform much better than the interval trees for all the file sizes. The times are starting to ramp up at the 5,000,000 read level, but are still 10 times faster than the GIT method. I haven’t checked this, but it also ought to be much more memory efficient, since we’re not storing all the intervals in trees – we just visit each interval once to check its overlaps as we make the sweep.

As expected, the interval tree approach is much more efficient when we build a tree with the small number of intervals and make lots of queries on that tree. It makes a difference with the plane sweep too, since one side will have to make use of the more complicated caching machinery. I did a separate analysis to examine this:

To me, this is a good start, and I have already used the plane sweep approach to do some analysis work. I could probably speed it up even more by allowing the use of pre-sorted files. However, there are quite a few caveats:

  1. This is a very specific set of conditions. Before we could conclusively say the plane sweep approach was faster, there are lots of other conditions I would need to test:
    • Different sizes of intervals.
    • Many intervals within each group that overlap each other.
    • Interval sizes that vary wildly within (and between) the groups.
  2. There is a lot of tuning possible on the structure of the interval trees that will affect performance.
  3. I’m not entirely comfortable with the reliance on coordinates in alphanumeric order here. It probably needs more error checking to make sure it reports if something goes wrong with that.
  4. As I mentioned, the way this should be used is different to the interval tree approach. With the csp6i sites, I usually want to look at a 4C read and say “is it near a csp6i site?” interval trees are a natural fit for this – I just query the tree, based on some tolerance for how near the site should be, and if I get any hits then I can act on those. To use the plane sweep approach, I need to change the layout of my code slightly to look for the presence of the read in the list of returned overlaps. I would also need to add an extra wrapper to generate intervals to take account of the tolerance I want to use. These are not insurmountable obstacles, but the plane sweep is not a drop-in replacement for the interval tree approach.
  5. Although comparing two sets of intervals is the basic use case, you always have to be wary of a one-size-fits-all approach. The plane sweep might be quicker for a pair of interval sets like this but, again with respect to the csp6i example, I might want to load the csp6i sites into a set of interval trees, and then query a list of 4C read-sets (lanes) against the tree. With only a few lanes of 4C, the plane sweep will probably still be quicker. How many lanes would I need before the interval tree approach becomes quicker again?

I guess for me the biggest caveat is that the plane sweep approach is more complex and is therefore more susceptible to edge cases and programmer error. The GIT method is mature and I trust it, so I will probably have more anxiety about the PS for a while. Still, these preliminary results look promising.

I would also be interested to implement a multi-way plane sweep that overlaps several sets of intervals against a reference set. Maybe this would be quicker than building a tree for the reference set and comparing the others – something I still do very often.

Automount USB drives – Ubuntu Server

March 21, 2011

If you try to use Ubuntu as a server, sometimes you come a cropper of relying on the facilities provided by the desktop version. I use USB drives to manage some of my data and give them volume labels so that they’re always mounted in the same place – /media/label. This doesn’t work if you’re not logged in to the GUI.

At first, I got tangled up in usbmount, which is recommended. However, it mounts the drives into usb0, usb1 etc. I couldn’t get it to mount by volume name – it even complained “/dev/sdb does not contain a filesystem or disklabel”.

The solution was to install halevt, which provides the hardware abstraction layer and an event handler to deal with plugging in USB drives.

Gene and Exon Expression Using Affymetrix Mouse Exon Arrays (MoEx)

March 15, 2011

Our lab makes extensive use of Affymetrix expression arrays to measure gene expression. I’ve already processed several different types of mouse and human gene arrays to determine which genes are active or inactive in the experimental sample. The pipeline for this is based on advice from contacts at the Sanger, and is based on the bioconductor Affy package. We use RMA but also MAS5 absent/present calls.

Some of our work requires greater resolution than gene expression – we want a measure of which exons are the most transcribed. This is close to the related problem of determining the specific transcript of a particular gene (which set of exons) is transcribed in the cell type used in the experiment. You can do this with RNA-seq, but it’s quite expensive and the tools aren’t very mature yet. We chose to go with Affymetrix’s mouse exon arrays, or MoEx for short.

However, using R/Bioconductor to derive RMA and MAS5 for MoEx is not as easy as the gene arrays. In R you should be able to do (crudely):

> library("affy")
Loading required package: Biobase

Welcome to Bioconductor

Vignettes contain introductory material. To view, type
'openVignette()'. To cite Bioconductor, see
'citation("Biobase")' and for packages 'citation(pkgname)'.

> rawAffy rmaAffy <- rma(rawAffy)

but for MoEx you get:

Error in getCdfInfo(object) :
Could not obtain CDF environment, problems encountered:
Specified environment does not contain MoEx-1_0-st-v1
Library - package moex10stv1cdf not installed
Bioconductor - moex10stv1cdf not available

For gene arrays, you can usually just use bioclite to install a Bioconductor library that includes the CDF environment, and indeed there is a bioclite package for MoEx, but it doesn’t fix the error message.

I’m sure if I was better at R I could resolve this, but instead I downloaded the CDF manually from the XMap page (more about XMap in a minute) and then do this:

adata <- ReadAffy()
adata@cdfName <- "mouseexonpmcdf"
rma <- rma(adata)

Then you can dump this list of RMA values like this:

write.exprs(rma, "RMA_output.csv", sep="\t")

This produces a file that has RMA values for each MoEx probeset for each biorep we used. Each probeset represent a set of oligo probes which, together, cover a region of the genome to provide information about the exon it overlaps. Now we need to translate probeset IDs to exons. In my case, I want Ensembl exon IDs, since Ensembl is the gold standard annotation for us.

With GeneChip arrays, Ensembl has computed mappings from Affy’s probeset IDs to Ensembl genes, and you can get hold of this information quickly using BioMart. Ensembl also has mappings for MoEx, but these are from MoEx probeset IDs to gene IDs, and not exon IDs.

I tried a variety of methods to derive a specific mapping to exon IDs. I used the coordinates Affy provides for its probes (and thus probesets), which can be obtained under technical documentation (registration required?) to look up the overlapping exons from a chosen Ensembl annotation. I also tried using the sequences of the probes (also available on the Affy web page) and BLAST them against the genome to obtain the coordinates of the probesets myself, and then did the overlapping as before. However, all these methods are rather hacky and I’m by no means an expert with this kind of technology so I can’t be tell if what I’ve done is appropriate or not.

Instead, we took advice and decided to use the Manchester Paterson’s exonmap mapping. Exonmap is the new name for XMap, which I mentioned earlier – they seem to be the MoEx experts. Exonmap is actually a full bioconductor package, that allows you to derive RMA values for specific gene lists and a variety of other functionality. However, as always, we need more fine-grained control over how probesets are combined. So I just extracted the probeset to Ensembl exon ID mapping from the exonmap database tables to use for my stuff.

Downstream, I have custom code to work out expression level for individual exons and genes as well as differential expression between treatment groups. We use an approach based on the splice index method.

However, there are other packaged solutions for doing this. One we tried was AltAnalyze, a Python tool with a nice GUI that allows you to run a variety of types of exon analysis on MoEx and other experiment types. It comes automatically packaged with the Affy power tools (although you can also get those yourself) and a selection of Affy annotation files.

As usual, we needed something we could fiddle with more, so we still ended up with custom code. However, we did use the background-corrected calculation for RMA values from AltAnalyze, which is actually an Affy power tools command:

$HOME/src/AltAnalyze_v1release/AltDatabase/affymetrix/APT/Linux/64bit/apt-probeset-summarize -p $HOME/src/AltAnalyze_v1release/AltDatabase/affymetrix/LibraryFiles/MoEx-1_0-st-v1.r2.pgf -c $HOME/src/AltAnalyze_v1release/AltDatabase/affymetrix/LibraryFiles/MoEx-1_0-st-v1.r2.clf -b $HOME/src/AltAnalyze_v1release/AltDatabase/affymetrix/LibraryFiles/MoEx-1_0-st-v1.r2.antigenomic.bgp --kill-list $HOME/src/AltAnalyze_v1release/AltDatabase/EnsMart55/Mm/exon/Mm_probes_to_remove.txt -a rma-sketch -a dabg --cel-files cel_files.txt -o .

Obviously this has a lot of file dependencies from AltAnalyze, but this gives us not only RMA values, but p-value confidences for those values, which we use downstream. This replaces the R/Bioconductor RMA value generation. I expect there are arguments to be had about which tool generates the best expression values.

I like this Affy power tools approach, because it does not rely on R/Bioconductor and can be easily scripted. We managed to find an ArrayExpress entry containing MoEx bioreps for 54 different mouse tissues. Using the Affy power tools, I wrote a script to generate RMA values and apply some other parts of our pipeline to all 54 groups. We could then compare probeset levels in our experiments to all these tissue types to derive the relative level of expression for exons in our samples – all this took only a matter of days.

numpy and NaN again

July 30, 2010

As a continuation of my heatmap work, I’ve been normalising the interaction frequencies against the density of other biological features. In particular, we are more likely to detect interactions around the restriction enzyme we have used to separate the interacting partners after cross-linking. In our case, this is HindIII (which my biological colleagues pronounce “hindi – 3”). Therefore, we’d like to normalise against this density – if we pick up interactions in a region of low hindIII density, we should consider this more significant than in an area of high hindIII density.

First of all, we compute the density of hindIII sites at the same resolution as our heatmaps – 1mb. We then divide each density by the median hindIII density; this gives us the relative density of each 1mb region to the median. Each interaction is between two 1mb regions and is more likely to occur based on the density of the hindIII sites in each of these regions. Therefore, we want to normalise each spot on the heatmap by the product of the relative density of hindIII from the two regions.

We can do this conveniently in numpy, because numpy gives us native matrix operations. Our heatmap has one chromosome on the x axis and another on the y axis. We take the relative hindIII densities for the y axis chromosome and turn this into a column vector by doing:

ypoints.reshape(-1, 1)

Then we can generate a matrix of products with the x axis densities by doing:

normmatrix = xpoints * ypoints.reshape(-1, 1)

To satisfy myself that this produced sensible output, I did this:

>>> from numpy import arange
>>> arange(10).reshape(-1,1) * arange(10)
array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18],
[ 0, 3, 6, 9, 12, 15, 18, 21, 24, 27],
[ 0, 4, 8, 12, 16, 20, 24, 28, 32, 36],
[ 0, 5, 10, 15, 20, 25, 30, 35, 40, 45],
[ 0, 6, 12, 18, 24, 30, 36, 42, 48, 54],
[ 0, 7, 14, 21, 28, 35, 42, 49, 56, 63],
[ 0, 8, 16, 24, 32, 40, 48, 56, 64, 72],
[ 0, 9, 18, 27, 36, 45, 54, 63, 72, 81]])

Then normalise the heatmap matrix by doing:

normalised = matrix / normmatrix

Which is very concise and intuitive. However, the title of this post is about tangling with NaNs again.

At various points, the density of hindIII sites drops to 0. This means that we’ll be dividing by zero for some of the matrix cells. Exactly what numpy does about this is subject to some slightly odd variations, in particular when our heatmap matrix also contains 0 at that position, so we’ll be dividing zero by zero.

In regular Python, here’s what happens when you divide zero by zero:

>>> 0 / 0
Traceback (most recent call last):
File "", line 1, in
ZeroDivisionError: integer division or modulo by zero
>>> 0. / 0
Traceback (most recent call last):
File "", line 1, in
ZeroDivisionError: float division
>>> 0 / 0.
Traceback (most recent call last):
File "", line 1, in
ZeroDivisionError: float division

Note that in each case, we get a ZeroDivisionError exception, but when either the denominator or numerator is a float, the error message is slightly different. If we do the same inside a numpy array:

>>> from numpy import array
>>> array([0]) / array([0])
>>> array([0.]) / array([0])
array([ NaN])
>>> array([0]) / array([0.])
array([ NaN])

This is particularly odd because:

>>> array([2]) / array([0])
>>> array([2.]) / array([0])
array([ Inf])
>>> array([2.]) / array([0.])
array([ Inf])
>>> array([0.]) / array([0.])
array([ NaN])

So integers divided by zero in numpy give 0, floating point numbers give infinite and 0 (as a float) gives a NaN. This seems pretty odd to me, but then I though the other nan behaviour was weird too, and that turned out to be based on the standard.

To get sensible heatmaps from this, I have to filter out both infs and nans and replace them with 0s. Filtering inf is easy, using numpy’s selective assignment:

normalised = [normalised==numpy.inf] = 0

but you can’t do this with nan, because numpy.nan != numpy.nan. Fortunately, numpy provides a function to do just what I’m after:

normalised = numpy.nan_to_num(normalised)