For example, in taxonomy-based tools such as Kraken, mapping
1) taxonomy id to sequence id (gi or accession) and
2) taxonomy id to a human-readable taxonomy tree,
are built-in and transparent to the user.
Unfortunately, with BLAST+ these steps must be completed manually and are included in two separate programs,
makeblastdb
for (1) and blastn
/blastp
/blastx
for (2). (1) Taxonomy id <–> sequence id
In BLAST+, a taxid_map file file must be created and passed to makeblastdbmakeblastdb -in <FASTA file> -dbtype nucl -parse_seqids -taxid_map taxid_map.txt
where taxid_map.txt is a space or tab separated list of sequence ids (either gi or accession) and taxonomy ids. For example, with gi:
taxid_map.txtAlternatively with accession:
556927176 4570
556926995 4573
501594995 3914
taxid_map.txtThere is no turn-key way to generate this mapping
NC_022714.1 4570
NC_022666.1 4573
NC_021092.1 3914
taxid
to sequence_id
for a moderately large set of sequencing.- Creating a look-up dictionary from the complete NCBI taxonomy database is memory prohibitive.
- Entrez API calls are capped at 3 per second.
sequence_id
and taxid
. They can both be obtained from the NCBI, searching and exporting with Send to: This simple Python code snippet will do the trick for small and moderately large datasets.
from Bio import SeqIO
genbankfile = "DNA.gb"
f = open('taxid_map.txt','w')
for gb in SeqIO.parse(genbankfile,"gb"):
try:
annotations = gb.annotations['gi']
taxid = gb.features[0].qualifiers['db_xref'][0].split(':')[1]
f.write("{} {}\n".format(annotations, taxid))
except:
pass
f.close()
For large datasets, the bandwidth cost of of downloading the GenBank from NCBI becomes prohibitive, and the dictionary approach would probably be warranted.Download both the FASTA and GenBank, alternatively extract the FASTA from GenBank, e.g. with BioPython.
(2) Taxonomy id <–> Taxonomy tree
Simply include this NCBI database in the same directory as your database for the look-up to work withblastn
/blastp
/blastx
: ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gzEating your cake
blastn -db <DATABASE> -query <QUERY> -outfmt "10 qseqid sseqid pident staxids sscinames scomnames sblastnames sskingdoms"
1,gi|312233363|ref|NC_014692.1|,86.26,310261,Sus scrofa taiwanensis,Sus scrofa taiwanensis,even-toed ungulates,EukaryotaYour comma separated file (
1,gi|223976078|ref|NC_012095.1|,86.26,9825,Sus scrofa domesticus,domestic pig,even-toed ungulates,Eukaryota
1,gi|5835862|ref|NC_000845.1|,86.26,9823,Sus scrofa,pig,even-toed ungulates,Eukaryota
-outfmt
) showing human-readable taxonomy info.
2 comments:
Hi, it was really nice but unfortunately I am unable to have a taxon id in BLAST output. It says N/A. I have downloaded FASTA sequences and I am doing BLAST all vs all.
makeblastdb -in file.fasta -db prot -out ./database/db
TaxDB is in the same folder of BLAST database. BLAST query:
blastp -db ./database/db -query file.fasta -out results.out -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend staxid sscinames scomnames stitle sstart send evalue bitscore"
Any suggestions? Your help will be appreciated.
Very helpful post! Thanks a million!
You may also want to add a test like this, to make sure that the taxid is what you think it is, rather than some sort of numbder from the Human Microbiome Project or ATCC. I ran into a couple of those in my set of genbank files.
if gb.features[0].qualifiers['db_xref'][0].split(':')[0] == "taxon":
taxid = gb.features[0].qualifiers['db_xref'][0].split(':')[1]
Post a Comment