Archive for November, 2008

Finding GTAC

November 28, 2008

In continuing to get to grips with the various tools and technologies of bioinformatics, I was asked a few weeks ago to do some genome searching. A colleague wanted to find every instance of the sequence GTAC in the human genome – about 3 billion bases in total. He wanted the genome coordinate as well as to know what was 100 bases upstream and downstream of each match.

As with most computing tasks, there are a few different tools for this job, but I chose to use the Ensembl API and posted to the ensembl-dev list to ask about it. After a little toing and froing, I arrived at the following Perl script (sorry about the code formatting – WP doesn’t support Perl, so this is using Python formatting…):

#!/usr/bin/perl

use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
	-host => "ensembldb.ensembl.org",
	-user => "anonymous"
);

$slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' );

my @chromosomes = (	"1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", 
					"12", "13", "14", "15", "16", "17", "18", "19", "20", "21", 
					"22", "X", "Y" );

foreach(@chromosomes) { $sitecounts{$_} = 0; }
my $coord_system_name = "chromosome";

foreach $chr (@chromosomes) {           # make an array of chromosome names
	print "working at chromosome: ". $chr . "\n";
	my $chr_slice = $slice_adaptor->fetch_by_region($coord_system_name,$chr) or die "Failed to slice chromosome $chr";
	my $chr_seq = $chr_slice->seq();
	print length($chr_seq)." bases read.\n";        # just to check
	my ($site, $prev) = (0, 0, -1);
	until ($site == -1) {
		$sitecounts{$chr}++;            # record number of sites per chromosome
		$site = index($chr_seq, 'GTAC', $prev + 1);
		last if ($site == -1);
		$prev = $site;
		my $truesite = $site + 1;       # because $chr_seq starts at 0, according to index()
		my $upstream = substr($chr_seq, $site+4, 100);          # $site starts at the 'G' in 'GTAC', so begin capture after 'C'
		my $downstream = substr($chr_seq, $site-100, 100);
		push @output, "$chr\t$truesite\t$upstream\t$downstream\n";
	}
	print "site count: " . $sitecounts . "\n";
}

open FPOUT, "> tester.txt" || die;

foreach(@output) { print FPOUT $_; }
close FPOUT;

while( my($key, $value) = each(%sitecounts) ) {
	print "$key => $value\n";
}

Which did the job in around 10 minutes – most of the time spent downloading the massive sequence files themselves.

I continue to be impressed by the power of these tools.

Of course, now I need to design PCR primers for each of the located sequences, which is another story altogether.

Advertisements