Diffing coronaviruses

栏目: IT技术 · 发布时间: 4年前

内容简介:We can use standard UNIX tools to investigate the origins of the Wuhan coronavirus! I read on Wikipedia that “2019-nCoV has been reported to have a genome sequence 75% to 80% identical to the SARS-CoV and to have more similarities to several bat coronaviru

We can use standard UNIX tools to investigate the origins of the Wuhan coronavirus! I read on Wikipedia that “2019-nCoV has been reported to have a genome sequence 75% to 80% identical to the SARS-CoV and to have more similarities to several bat coronaviruses.” We can use diff to see those similarities:

$ ./genome_diff MG772933.1 MN988713.1
MG772933.1: 29802 words  26618 89% common  861 3% deleted  2323 8% changed
MN988713.1: 29874 words  26618 89% common  896 3% inserted  2360 8% changed

This says that there’s an 89% similarity between bat CoV (MG772933.1) and human nCoV (MN988713.1). More precisely, they share a subsequence of 26618 bases, in a total genome of only ~29800 bases.

That genome_diff script looks like this:

#!/bin/bash
fetch_genome() {
  curl -s "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=$1" \
  | grep -v '^>' | tr -d -C 'ATGC' | sed 's/\(.\)/\1 /g' > $1
}
fetch_genome $1
fetch_genome $2
wdiff -s -123 $1 $2

This script works by fetching the genome from the NCBI database . The strings “MG772933.1” and “MN988713.1” are accession numbers . The text at https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN988713.1 is 2019-nCoV’s RNA sequence in FASTA format, which looks like:

$ curl -s 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN988713.1'
>MN988713.1 Wuhan seafood market pneumonia virus isolate 2019-nCoV/USA-IL1/2020, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA
CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC
TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG
TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC
...

The FASTA format needs a bit of “massaging” before we can diff it. The first line, starting with > , describes the sequence that follows. We don’t need this metadata, so we strip it with grep -v '^>' . Next, we don’t need those newline characters, so we strip them with tr -d -C 'ATGC' . Finally, because diff doesn’t work at the “character” level, we’ll instead use wdiff , but first separating the characters into separate words using sed 's/\(.\)/\1 /g' . This gives us genomes that look like A T A T T A G G ... .

Finally, we can call wdiff -s -123 on these genomes, which gives us some statistics about their similarity. If we omit -s -123 , we get the actual base differences between the sequences:

A T [-A T-] T A {+A A+} G G T T T [-T-] {+A+} T A C C
...

A different way to see similarities is to use NCBI’s BLAST tool . Enter the accession number MN988713.1 , and you’ll get a list of other sequences, ranked by “percent identity”. The most similar are several recent sequences of 2019-nCoV, followed by the “Bat SARS-like coronavirus”, followed by many SARS coronavirus sequences.

Get updates on Twitter

More by Jim

Tagged#programming,#bioinformatics. All content copyright James Fisher 2020. This post is not associated with my employer. Found an error? Edit this page.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

CSS实战精粹

CSS实战精粹

(美)克拉夫特 / 李方进 / 电子工业 / 2008-1 / 59.80元

《CSS实战精粹》囊括了CSS工作原理的所有细节,主要介绍了能够在项目中使用的具体CSS设计技巧,包括CSS布局、创建一般页面元素以及如何排版从而使内容兼具可读性和艺术性。另外,《CSS实战精粹》还讨论了CSS中一些常见的问题,给出了一些解决办法。一起来看看 《CSS实战精粹》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

MD5 加密
MD5 加密

MD5 加密工具

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具