Diffing coronaviruses

栏目: IT技术 · 发布时间: 4年前

内容简介:We can use standard UNIX tools to investigate the origins of the Wuhan coronavirus! I read on Wikipedia that “2019-nCoV has been reported to have a genome sequence 75% to 80% identical to the SARS-CoV and to have more similarities to several bat coronaviru

We can use standard UNIX tools to investigate the origins of the Wuhan coronavirus! I read on Wikipedia that “2019-nCoV has been reported to have a genome sequence 75% to 80% identical to the SARS-CoV and to have more similarities to several bat coronaviruses.” We can use diff to see those similarities:

$ ./genome_diff MG772933.1 MN988713.1
MG772933.1: 29802 words  26618 89% common  861 3% deleted  2323 8% changed
MN988713.1: 29874 words  26618 89% common  896 3% inserted  2360 8% changed

This says that there’s an 89% similarity between bat CoV (MG772933.1) and human nCoV (MN988713.1). More precisely, they share a subsequence of 26618 bases, in a total genome of only ~29800 bases.

That genome_diff script looks like this:

#!/bin/bash
fetch_genome() {
  curl -s "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=$1" \
  | grep -v '^>' | tr -d -C 'ATGC' | sed 's/\(.\)/\1 /g' > $1
}
fetch_genome $1
fetch_genome $2
wdiff -s -123 $1 $2

This script works by fetching the genome from the NCBI database . The strings “MG772933.1” and “MN988713.1” are accession numbers . The text at https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN988713.1 is 2019-nCoV’s RNA sequence in FASTA format, which looks like:

$ curl -s 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN988713.1'
>MN988713.1 Wuhan seafood market pneumonia virus isolate 2019-nCoV/USA-IL1/2020, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA
CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC
TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG
TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC
...

The FASTA format needs a bit of “massaging” before we can diff it. The first line, starting with > , describes the sequence that follows. We don’t need this metadata, so we strip it with grep -v '^>' . Next, we don’t need those newline characters, so we strip them with tr -d -C 'ATGC' . Finally, because diff doesn’t work at the “character” level, we’ll instead use wdiff , but first separating the characters into separate words using sed 's/\(.\)/\1 /g' . This gives us genomes that look like A T A T T A G G ... .

Finally, we can call wdiff -s -123 on these genomes, which gives us some statistics about their similarity. If we omit -s -123 , we get the actual base differences between the sequences:

A T [-A T-] T A {+A A+} G G T T T [-T-] {+A+} T A C C
...

A different way to see similarities is to use NCBI’s BLAST tool . Enter the accession number MN988713.1 , and you’ll get a list of other sequences, ranked by “percent identity”. The most similar are several recent sequences of 2019-nCoV, followed by the “Bat SARS-like coronavirus”, followed by many SARS coronavirus sequences.

Get updates on Twitter

More by Jim

Tagged#programming,#bioinformatics. All content copyright James Fisher 2020. This post is not associated with my employer. Found an error? Edit this page.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

CSS 3实战

CSS 3实战

成林 / 机械工业出版社 / 2011-5 / 69.00元

全书一共分为9章,首先从宏观上介绍了CSS 3技术的最新发展现状、新特性,以及现有的主流浏览器对这些新特性的支持情况;然后详细讲解了CSS 3的选择器、文本特性、颜色特性、弹性布局、边框和背景特性、盒模型、UI设计、多列布局、圆角和阴影、渐变、变形、转换、动画、投影、开放字体、设备类型、语音样式等重要的理论知识,这部分内容是本书的基础和核心。不仅每个知识点都配有丰富的、精心设计的实战案例,而且详细......一起来看看 《CSS 3实战》 这本书的介绍吧!

随机密码生成器
随机密码生成器

多种字符组合密码

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器