Blast+ 使用补充笔记

栏目: 编程语言 · XML · 发布时间: 6年前

内容简介:Blast比对软件大概是是短序列局部比对软件中最常用的一个了,但是其参数众多,一些参数一直没好好仔细研究过,如下:blast的其实我们一般常用的就是

Blast比对软件大概是是短序列局部比对软件中最常用的一个了,但是其参数众多,一些参数一直没好好仔细研究过,如下:

增加blast比对结果信息

blast的 -outfmt 参数,使 blastp -help 即可查看每个输出格式的信息,如下所示:

*** Formatting options
 -outfmt <String>
   alignment view options:
     0 = Pairwise,
     1 = Query-anchored showing identities,
     2 = Query-anchored no identities,
     3 = Flat query-anchored showing identities,
     4 = Flat query-anchored no identities,
     5 = BLAST XML,
     6 = Tabular,
     7 = Tabular with comment lines,
     8 = Seqalign (Text ASN.1),
     9 = Seqalign (Binary ASN.1),
    10 = Comma-separated values,
    11 = BLAST archive (ASN.1),
    12 = Seqalign (JSON),
    13 = Multiple-file BLAST JSON,
    14 = Multiple-file BLAST XML2,
    15 = Single-file BLAST JSON,
    16 = Single-file BLAST XML2,
    18 = Organism Report

   Options 6, 7 and 10 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
        qseqid means Query Seq-id
           qgi means Query GI
          qacc means Query accesion
       qaccver means Query accesion.version
          qlen means Query sequence length
        sseqid means Subject Seq-id
     sallseqid means All subject Seq-id(s), separated by a ';'
           sgi means Subject GI
        sallgi means All subject GIs
          sacc means Subject accession
       saccver means Subject accession.version
       sallacc means All subject accessions
          slen means Subject sequence length
        qstart means Start of alignment in query
          qend means End of alignment in query
        sstart means Start of alignment in subject
          send means End of alignment in subject
          qseq means Aligned part of query sequence
          sseq means Aligned part of subject sequence
        evalue means Expect value
      bitscore means Bit score
         score means Raw score
        length means Alignment length
        pident means Percentage of identical matches
        nident means Number of identical matches
      mismatch means Number of mismatches
      positive means Number of positive-scoring matches
       gapopen means Number of gap openings
          gaps means Total number of gaps
          ppos means Percentage of positive-scoring matches
        frames means Query and subject frames separated by a '/'
        qframe means Query frame
        sframe means Subject frame
          btop means Blast traceback operations (BTOP)
        staxid means Subject Taxonomy ID
      ssciname means Subject Scientific Name
      scomname means Subject Common Name
    sblastname means Subject Blast Name
     sskingdom means Subject Super Kingdom
       staxids means unique Subject Taxonomy ID(s), separated by a ';'
             (in numerical order)
     sscinames means unique Subject Scientific Name(s), separated by a ';'
     scomnames means unique Subject Common Name(s), separated by a ';'
    sblastnames means unique Subject Blast Name(s), separated by a ';'
             (in alphabetical order)
    sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
             (in alphabetical order) 
        stitle means Subject Title
    salltitles means All Subject Title(s), separated by a '<>'
       sstrand means Subject Strand
         qcovs means Query Coverage Per Subject
       qcovhsp means Query Coverage Per HSP
        qcovus means Query Coverage Per Unique Subject (blastn only)

其实我们一般常用的就是 -outfmt 5 或者 -outfmt 6 ,前者输出XML格式,后者输出TAB分割格式;前者在早期一篇博文Blast+ xml格式解读中提起过(信息比较全,用处也相对比较广),而后者则是平时最为常用的格式(也是一些软件喜欢调用的格式)

TAB格式每列信息如下(可以对照上面的说明理解一下):

qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

但我们有时想要的并不止上述12列信息,比如我还想知道比对结果的覆盖度信息(qcovs:Query Coverage Per Subject)

其实只要在blast比对命令中先事先加上需要增加的列ID即可,如在 outfmt 6 基础上加上覆盖度信息:

-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs"

注:需要几列就一直往上加即可,空格分割

分割NR子库

之前分割NR子库选用的是早期一篇博文 创建NR子库以及从NR库提取特定物种分类的序列

但是现在NCBI出了blast-2.8版本,其可支持用NCBI自带代码分割的NR子库的索引作为比对的库,使用比较方便

  • Support for a new version of the BLAST database that allows you to limit search by taxonomy as well some other improvements.

当然如果用这个版本的话,NR库也要重新下载了 ftp://ftp.ncbi.nlm.nih.gov/blast/db/v5/

使用方式也比较简单(至少比之前的方法方便了),如果只想比对单一物种(如人:9606的话),命令如下:

blastp db nr query query.fasta taxids 9606 outfmt 6 out blast.outfm6

如果想比对NR子库哺乳动物的话,需要先建个哺乳动物子库索引

get_species_taxids.sh -t 40674 > 40674.txids

然后再将序列比对至NR哺乳动物子库

blastp db nr query query.fasta taxidlist 40674.txids outfmt 6 out blast.outfm6

具体说明可看: https://ftp.ncbi.nlm.nih.gov/blast/db/v5/blastdbv5.pdf

本文出自于 http://www.bioinfo-scrounger.com 转载请注明出处


以上所述就是小编给大家介绍的《Blast+ 使用补充笔记》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

挑战编程

挑战编程

斯基纳 / 刘汝佳 / 2009-7 / 39.00元

《挑战编程:程序设计竞赛训练手册》分为14章,分别介绍在线评测系统的基本使用方法、数据结构、字符串、排序、算术与代数、组合数学、数论、回溯法、图遍历、图算法、动态规划、网格、几何,以及计算几何,并在附录中介绍了一些著名的程序设计竞赛以及相应的备赛建议与比赛技巧。每章的正文用十余页的篇幅覆盖了该领域最核心的概念和算法,然后给出八道可在线提交的完整编程挑战题目供读者练习。 全书内容紧凑、信息量大......一起来看看 《挑战编程》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具