Finding GTAC

In continuing to get to grips with the various tools and technologies of bioinformatics, I was asked a few weeks ago to do some genome searching. A colleague wanted to find every instance of the sequence GTAC in the human genome – about 3 billion bases in total. He wanted the genome coordinate as well as to know what was 100 bases upstream and downstream of each match.

As with most computing tasks, there are a few different tools for this job, but I chose to use the Ensembl API and posted to the ensembl-dev list to ask about it. After a little toing and froing, I arrived at the following Perl script (sorry about the code formatting – WP doesn’t support Perl, so this is using Python formatting…):

#!/usr/bin/perl

use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
	-host => "ensembldb.ensembl.org",
	-user => "anonymous"
);

$slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' );

my @chromosomes = (	"1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", 
					"12", "13", "14", "15", "16", "17", "18", "19", "20", "21", 
					"22", "X", "Y" );

foreach(@chromosomes) { $sitecounts{$_} = 0; }
my $coord_system_name = "chromosome";

foreach $chr (@chromosomes) {           # make an array of chromosome names
	print "working at chromosome: ". $chr . "\n";
	my $chr_slice = $slice_adaptor->fetch_by_region($coord_system_name,$chr) or die "Failed to slice chromosome $chr";
	my $chr_seq = $chr_slice->seq();
	print length($chr_seq)." bases read.\n";        # just to check
	my ($site, $prev) = (0, 0, -1);
	until ($site == -1) {
		$sitecounts{$chr}++;            # record number of sites per chromosome
		$site = index($chr_seq, 'GTAC', $prev + 1);
		last if ($site == -1);
		$prev = $site;
		my $truesite = $site + 1;       # because $chr_seq starts at 0, according to index()
		my $upstream = substr($chr_seq, $site+4, 100);          # $site starts at the 'G' in 'GTAC', so begin capture after 'C'
		my $downstream = substr($chr_seq, $site-100, 100);
		push @output, "$chr\t$truesite\t$upstream\t$downstream\n";
	}
	print "site count: " . $sitecounts . "\n";
}

open FPOUT, "> tester.txt" || die;

foreach(@output) { print FPOUT $_; }
close FPOUT;

while( my($key, $value) = each(%sitecounts) ) {
	print "$key => $value\n";
}

Which did the job in around 10 minutes – most of the time spent downloading the massive sequence files themselves.

I continue to be impressed by the power of these tools.

Of course, now I need to design PCR primers for each of the located sequences, which is another story altogether.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: