BLAST database automation using mirror and cron

This article explains how to maintain a set of BLAST databases on your Linux system automatically (i.e. with no input from you except the initial setup). I use Debian, but the procedure should be easily adapted for other distros.

There are 6 steps:

  1. Install BLAST
  2. Configure BLAST
  3. Install mirror
  4. Configure mirror
  5. Write a formatdb script
  6. Configure cron jobs

Install BLAST

You can use the Debian packages ncbi-tools-bin and ncbi-data. If you want the latest versions, it's best to compile from source. Grab the NCBI toolkit from ftp.ncbi.nih.gov/toolkit/ncbi_tools/ncbi.tar.gz, unpack and follow the instructions in make/readme.unx.

Configure BLAST

You may want to move the compiled binaries to a location in your path (e.g. /usr/local/bin). NCBI data and matrices are supplied in the toolkit data/ folder - you may also want to move this to e.g. /usr/share/ncbi/data, the default location if you use the deb package. Finally choose a location for your BLAST databases - I normally use /usr/local/databases/blast.

Next, you'll need to create the configuration file, .ncbirc, in /home/your_home. Using the locations outlined above, .ncbirc would look like this:

[NCBI]
Data=/usr/share/ncbi/data
[BLAST]
BLASTDB=/usr/local/databases/blast
BLASTMAT=/usr/share/ncbi/data

Install mirror

Mirror is used to synchronise your local files with an ftp site. On Debian, as simple as apt-get install mirror. Or get the software from ftp://sunsite.org.uk/old-sunsite/mirror/.

Configure mirror

First, explore the NCBI ftp site and decide which fasta files you want to download. I am interested primarily in microbial genomes, which are found in ftp.ncbi.nih.gov/genomes/Bacteria.

Next, set up the mirror configuration file. You create config files in /etc/mirror/packages/, giving them an appropriate name. Mine is called ftp.ncbi.nih.gov and looks like this:

package=genomes
comment=NCBI genomes
site=ftp.ncbi.nih.gov
remote_dir=/genomes/Bacteria
local_dir=/usr/local/databases/genomes/Bacteria
exclude_patt=(\.asn|\.val|\.gene2|accession|microbe|README)
remote_user=anonymous
remote_password=neil@nodalpoint.org
timeout=300
user=neil
group=neil

The syntax is pretty straightforward - see the mirror manual for more details. As is, this downloads the Bacteria directory and most files in the sub-directories (which are named by organism) - ptt, genbank and so on. I use these files but if you wanted only protein and DNA fasta files (.faa and .fna), you could exclude the other suffixes on the exclude_patt line.

Write a formatdb script

Now the fun begins. We want a script that we can use to run the NCBI program formatdb on our fasta files at set intervals. I use a Perl script named formatdb.pl, which goes like this:

#!/usr/bin/perl -w
# formatdb.pl
# Used to run formatdb on NCBI microbial genomes
# Input files are .fna and .faa retrieved using mirror
# Organism name is extracted to name the blastdb files
 
use strict;
use File::Find;
 
## globals
# location of your .fna and .faa files
 
my $genomes = '/usr/local/databases/genomes/Bacteria';
my %faa;
my %fna;
find(\&files, $genomes);
&formatdb;
 
## concatenate .faa|.fna files by organism name
 
sub files {
 
  if(/\.fna$/) {
    my $org = $File::Find::dir;
    if($org =~/\/Bacteria\/(.*?)$/) {
      push @{$fna{$1}}, $File::Find::name;
    }
  }
 
  if(/\.faa$/) {
    my $org = $File::Find::dir;
    if($org =~/\/Bacteria\/(.*?)$/) {
      push @{$faa{$1}}, $File::Find::name;
    }
  }
}
 
## create the blast dbs, logfile
 
sub formatdb {
  my $logfile = '/home/neil/logs/formatdb.log';
 
  if(-e $logfile) {
    unlink $logfile;
  }
 
  chdir '/usr/local/databases/blastdb';
 
  for my $key(sort keys %faa) {
    system("cat ".join(" ", @{$faa{$key}})." | /usr/local/bin/formatdb -i stdin -o T -n $key.aa -l $logfile");
  }
 
  for my $key(sort keys %fna) {
    system("cat ".join(" ", @{$fna{$key}})." | /usr/local/bin/formatdb -i stdin -o T -p F -n $key.nt -l $logfile");
  }
}

It's a straightforward script with a couple of little tricks. The File::Find module is used to locate our fasta files (.fna or .faa). These are named by accession number, e.g. NC_001234.fna. This is not very informative when it comes to identifying the BLAST database files, so we use a regex to get the directory name (in this case an organism name like Escherichia_coli_K12) and associate that with the accession number. We push accession numbers onto a hash of arrays, where the key is the organism name and the array elements are the full path to the fasta files (there may be more than 1 per organism if there are multiple chromosomes or plasmids).

Next we set up a log file, move to the BLAST database location and run formatdb. For each organism we concatenate the fasta files, then pass those as standard input to formatdb, using the hash key to name the BLAST database by organism, specifying the correct switches to formatdb for either protein or nucleotide and directing messages to our logfile.

Configure cron jobs

You should test your mirror configuration (use -n for a dry run), make sure that it downloads your fasta files and test your formatdb script to make sure that it creates BLAST database files in the desired location without error. Then you can run crontab to have these things happen at a set interval of your choosing. Mine looks like so:

0 6 * * * mirror /etc/mirror/packages/ftp.ncbi.nih.gov
0 0 * * 0 /home/neil/projects/utilities/scripts/perl/bio/formatdb.pl

So mirror runs at 6 am every day, looks for new genomes and downloads them if found. At midnight on Saturdays, my formatdb script runs and builds the BLAST databases. Currently I rebuild all databases each week, but you could easily tweak the script to run on only newly-downloaded files.

There's a modification of the script which enables it to generate files for the NCBI WWW BLAST interface, but we'll save that for another day.

 
automated_databases.txt · Last modified: 2007/08/20 13:27 (external edit)
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki