Diffing coronaviruses

栏目: IT技术 · 发布时间: 4年前

内容简介:We can use standard UNIX tools to investigate the origins of the Wuhan coronavirus! I read on Wikipedia that “2019-nCoV has been reported to have a genome sequence 75% to 80% identical to the SARS-CoV and to have more similarities to several bat coronaviru

We can use standard UNIX tools to investigate the origins of the Wuhan coronavirus! I read on Wikipedia that “2019-nCoV has been reported to have a genome sequence 75% to 80% identical to the SARS-CoV and to have more similarities to several bat coronaviruses.” We can use diff to see those similarities:

$ ./genome_diff MG772933.1 MN988713.1
MG772933.1: 29802 words  26618 89% common  861 3% deleted  2323 8% changed
MN988713.1: 29874 words  26618 89% common  896 3% inserted  2360 8% changed

This says that there’s an 89% similarity between bat CoV (MG772933.1) and human nCoV (MN988713.1). More precisely, they share a subsequence of 26618 bases, in a total genome of only ~29800 bases.

That genome_diff script looks like this:

#!/bin/bash
fetch_genome() {
  curl -s "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=$1" \
  | grep -v '^>' | tr -d -C 'ATGC' | sed 's/\(.\)/\1 /g' > $1
}
fetch_genome $1
fetch_genome $2
wdiff -s -123 $1 $2

This script works by fetching the genome from the NCBI database . The strings “MG772933.1” and “MN988713.1” are accession numbers . The text at https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN988713.1 is 2019-nCoV’s RNA sequence in FASTA format, which looks like:

$ curl -s 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN988713.1'
>MN988713.1 Wuhan seafood market pneumonia virus isolate 2019-nCoV/USA-IL1/2020, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA
CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC
TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG
TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC
...

The FASTA format needs a bit of “massaging” before we can diff it. The first line, starting with > , describes the sequence that follows. We don’t need this metadata, so we strip it with grep -v '^>' . Next, we don’t need those newline characters, so we strip them with tr -d -C 'ATGC' . Finally, because diff doesn’t work at the “character” level, we’ll instead use wdiff , but first separating the characters into separate words using sed 's/\(.\)/\1 /g' . This gives us genomes that look like A T A T T A G G ... .

Finally, we can call wdiff -s -123 on these genomes, which gives us some statistics about their similarity. If we omit -s -123 , we get the actual base differences between the sequences:

A T [-A T-] T A {+A A+} G G T T T [-T-] {+A+} T A C C
...

A different way to see similarities is to use NCBI’s BLAST tool . Enter the accession number MN988713.1 , and you’ll get a list of other sequences, ranked by “percent identity”. The most similar are several recent sequences of 2019-nCoV, followed by the “Bat SARS-like coronavirus”, followed by many SARS coronavirus sequences.

Get updates on Twitter

More by Jim

Tagged#programming,#bioinformatics. All content copyright James Fisher 2020. This post is not associated with my employer. Found an error? Edit this page.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

HTML5经典实例

HTML5经典实例

Christopher Schmitt、Kyle Simpson / 李强 / 中国电力出版社 / 2013-7 / 48.00元

《HTML5经典实例》对于从中级到高级的Web和移动Web开发者来说是绝佳之选,它帮助你选择对你有用的HTML5功能,并且帮助你体验其他的功能。个技巧的信息十分丰富,都包含了示例代码,并详细讨论了解决方案为何有效以及如何工作。一起来看看 《HTML5经典实例》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

SHA 加密
SHA 加密

SHA 加密工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换