Diffing coronaviruses

栏目: IT技术 · 发布时间: 5年前

内容简介：We can use standard UNIX tools to investigate the origins of the Wuhan coronavirus! I read on Wikipedia that “2019-nCoV has been reported to have a genome sequence 75% to 80% identical to the SARS-CoV and to have more similarities to several bat coronaviru

We can use standard UNIX tools to investigate the origins of the Wuhan coronavirus! I read on Wikipedia that “2019-nCoV has been reported to have a genome sequence 75% to 80% identical to the SARS-CoV and to have more similarities to several bat coronaviruses.” We can use diff to see those similarities:

$ ./genome_diff MG772933.1 MN988713.1
MG772933.1: 29802 words  26618 89% common  861 3% deleted  2323 8% changed
MN988713.1: 29874 words  26618 89% common  896 3% inserted  2360 8% changed

This says that there’s an 89% similarity between bat CoV (MG772933.1) and human nCoV (MN988713.1). More precisely, they share a subsequence of 26618 bases, in a total genome of only ~29800 bases.

That genome_diff script looks like this:

#!/bin/bash
fetch_genome() {
  curl -s "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=$1" \
  | grep -v '^>' | tr -d -C 'ATGC' | sed 's/\(.\)/\1 /g' > $1
}
fetch_genome $1
fetch_genome $2
wdiff -s -123 $1 $2

This script works by fetching the genome from the NCBI database . The strings “MG772933.1” and “MN988713.1” are accession numbers . The text at https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN988713.1 is 2019-nCoV’s RNA sequence in FASTA format, which looks like:

$ curl -s 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN988713.1'
>MN988713.1 Wuhan seafood market pneumonia virus isolate 2019-nCoV/USA-IL1/2020, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA
CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC
TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG
TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC
...

The FASTA format needs a bit of “massaging” before we can diff it. The first line, starting with > , describes the sequence that follows. We don’t need this metadata, so we strip it with grep -v '^>' . Next, we don’t need those newline characters, so we strip them with tr -d -C 'ATGC' . Finally, because diff doesn’t work at the “character” level, we’ll instead use wdiff , but first separating the characters into separate words using sed 's/\(.\)/\1 /g' . This gives us genomes that look like A T A T T A G G ... .

Finally, we can call wdiff -s -123 on these genomes, which gives us some statistics about their similarity. If we omit -s -123 , we get the actual base differences between the sequences:

A T [-A T-] T A {+A A+} G G T T T [-T-] {+A+} T A C C
...

A different way to see similarities is to use NCBI’s BLAST tool . Enter the accession number MN988713.1 , and you’ll get a list of other sequences, ranked by “percent identity”. The most similar are several recent sequences of 2019-nCoV, followed by the “Bat SARS-like coronavirus”, followed by many SARS coronavirus sequences.

Get updates on Twitter

More by Jim

Tagged#programming,#bioinformatics. All content copyright James Fisher 2020. This post is not associated with my employer. Found an error? Edit this page.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Diffing coronaviruses

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

爆品战略

金错刀 / 北京联合出版公司 / 2016-7-1 / 56.00

◆ 划时代的商业著作！传统企业转型、互联网创业的实战指南！ ◆ 爆品是一种极端的意志力，是一种信仰，是整个企业运转的灵魂！ ◆ 小米创始人雷军亲自作序推荐！小米联合创始人黎万强、分众传媒创始人江南春、美的董事长方洪波、九阳董事长王旭宁等众多一线品牌创始人联袂推荐！ ◆ 创图书类众筹新纪录！众筹上线2小时，金额达到10万元；上线1星期，金额突破100万元！ ◆ 未售......一起来看看《爆品战略》这本书的介绍吧!

码农工具

Diffing coronaviruses

More by Jim

爆品战略

HTML 压缩/解压工具

图片转BASE64编码

HEX CMYK 转换工具