内容简介:We can use standard UNIX tools to investigate the origins of the Wuhan coronavirus! I read on Wikipedia that “2019-nCoV has been reported to have a genome sequence 75% to 80% identical to the SARS-CoV and to have more similarities to several bat coronaviru
We can use standard UNIX tools
to investigate the origins of the Wuhan coronavirus!
I read on Wikipedia that
“2019-nCoV has been reported to have a genome sequence 75% to 80% identical to the SARS-CoV
and to have more similarities to several bat coronaviruses.”
We can use diff
to see those similarities:
$ ./genome_diff MG772933.1 MN988713.1 MG772933.1: 29802 words 26618 89% common 861 3% deleted 2323 8% changed MN988713.1: 29874 words 26618 89% common 896 3% inserted 2360 8% changed
This says that there’s an 89% similarity between bat CoV (MG772933.1) and human nCoV (MN988713.1). More precisely, they share a subsequence of 26618 bases, in a total genome of only ~29800 bases.
That genome_diff
script looks like this:
#!/bin/bash
fetch_genome() {
curl -s "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=$1" \
| grep -v '^>' | tr -d -C 'ATGC' | sed 's/\(.\)/\1 /g' > $1
}
fetch_genome $1
fetch_genome $2
wdiff -s -123 $1 $2
This script works by fetching the genome from the NCBI database
.
The strings “MG772933.1” and “MN988713.1” are accession numbers
.
The text at https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN988713.1
is 2019-nCoV’s RNA sequence in FASTA format, which looks like:
$ curl -s 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN988713.1' >MN988713.1 Wuhan seafood market pneumonia virus isolate 2019-nCoV/USA-IL1/2020, complete genome ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC ...
The FASTA format needs a bit of “massaging” before we can diff
it.
The first line, starting with >
, describes the sequence that follows.
We don’t need this metadata, so we strip it with grep -v '^>'
.
Next, we don’t need those newline characters,
so we strip them with tr -d -C 'ATGC'
.
Finally,
because diff
doesn’t work at the “character” level,
we’ll instead use wdiff
,
but first separating the characters into separate words using sed 's/\(.\)/\1 /g'
.
This gives us genomes that look like A T A T T A G G ...
.
Finally, we can call wdiff -s -123
on these genomes,
which gives us some statistics about their similarity.
If we omit -s -123
,
we get the actual base differences between the sequences:
A T [-A T-] T A {+A A+} G G T T T [-T-] {+A+} T A C C
...
A different way to see similarities is to use NCBI’s BLAST tool
.
Enter the accession number MN988713.1
,
and you’ll get a list of other sequences,
ranked by “percent identity”.
The most similar are several recent sequences of 2019-nCoV,
followed by the “Bat SARS-like coronavirus”,
followed by many SARS coronavirus sequences.
More by Jim
- The inception bar: a new phishing method
- The hacker hype cycle
- Project C-43: the lost origins of asymmetric crypto
- How Hacker News stays interesting
- My parents are Flat-Earthers
- The dots do matter: how to scam a Gmail user
- The sorry state of OpenSSL usability
- I hate telephones
- The Three Ts of Time, Thought and Typing: measuring cost on the web
- Granddad died today
- Your syntax highlighter is wrong
Tagged#programming,#bioinformatics. All content copyright James Fisher 2020. This post is not associated with my employer. Found an error? Edit this page.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
数据结构与算法分析(Java版)(英文原版)
(美)Clifford A.Shaffer / 电子工业出版社 / 2002-5 / 39.00元
《数据结构与算法分析(C++版)(第2版)》采用程序员最爱用的面向对象C++语言来描述数据结构和算法,并把数据结构原理和算法分析技术有机地结合在一起,系统介绍了各种类型的数据结构和排序、检索的各种方法。作者非常注意对每一种数据结构的不同存储方法及有关算法进行分析比较。书中还引入了一些比较高级的数据结构与先进的算法分析技术,并介绍了可计算性理论的一般知识。本版的重要改进在于引入了参数化的模板,从而提......一起来看看 《数据结构与算法分析(Java版)(英文原版)》 这本书的介绍吧!