内容简介:We can use standard UNIX tools to investigate the origins of the Wuhan coronavirus! I read on Wikipedia that “2019-nCoV has been reported to have a genome sequence 75% to 80% identical to the SARS-CoV and to have more similarities to several bat coronaviru
We can use standard UNIX tools
to investigate the origins of the Wuhan coronavirus!
I read on Wikipedia that
“2019-nCoV has been reported to have a genome sequence 75% to 80% identical to the SARS-CoV
and to have more similarities to several bat coronaviruses.”
We can use diff
to see those similarities:
$ ./genome_diff MG772933.1 MN988713.1 MG772933.1: 29802 words 26618 89% common 861 3% deleted 2323 8% changed MN988713.1: 29874 words 26618 89% common 896 3% inserted 2360 8% changed
This says that there’s an 89% similarity between bat CoV (MG772933.1) and human nCoV (MN988713.1). More precisely, they share a subsequence of 26618 bases, in a total genome of only ~29800 bases.
That genome_diff
script looks like this:
#!/bin/bash
fetch_genome() {
curl -s "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=$1" \
| grep -v '^>' | tr -d -C 'ATGC' | sed 's/\(.\)/\1 /g' > $1
}
fetch_genome $1
fetch_genome $2
wdiff -s -123 $1 $2
This script works by fetching the genome from the NCBI database
.
The strings “MG772933.1” and “MN988713.1” are accession numbers
.
The text at https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN988713.1
is 2019-nCoV’s RNA sequence in FASTA format, which looks like:
$ curl -s 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN988713.1' >MN988713.1 Wuhan seafood market pneumonia virus isolate 2019-nCoV/USA-IL1/2020, complete genome ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC ...
The FASTA format needs a bit of “massaging” before we can diff
it.
The first line, starting with >
, describes the sequence that follows.
We don’t need this metadata, so we strip it with grep -v '^>'
.
Next, we don’t need those newline characters,
so we strip them with tr -d -C 'ATGC'
.
Finally,
because diff
doesn’t work at the “character” level,
we’ll instead use wdiff
,
but first separating the characters into separate words using sed 's/\(.\)/\1 /g'
.
This gives us genomes that look like A T A T T A G G ...
.
Finally, we can call wdiff -s -123
on these genomes,
which gives us some statistics about their similarity.
If we omit -s -123
,
we get the actual base differences between the sequences:
A T [-A T-] T A {+A A+} G G T T T [-T-] {+A+} T A C C
...
A different way to see similarities is to use NCBI’s BLAST tool
.
Enter the accession number MN988713.1
,
and you’ll get a list of other sequences,
ranked by “percent identity”.
The most similar are several recent sequences of 2019-nCoV,
followed by the “Bat SARS-like coronavirus”,
followed by many SARS coronavirus sequences.
More by Jim
- The inception bar: a new phishing method
- The hacker hype cycle
- Project C-43: the lost origins of asymmetric crypto
- How Hacker News stays interesting
- My parents are Flat-Earthers
- The dots do matter: how to scam a Gmail user
- The sorry state of OpenSSL usability
- I hate telephones
- The Three Ts of Time, Thought and Typing: measuring cost on the web
- Granddad died today
- Your syntax highlighter is wrong
Tagged#programming,#bioinformatics. All content copyright James Fisher 2020. This post is not associated with my employer. Found an error? Edit this page.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Data Structures and Algorithm Analysis in Java
Mark A. Weiss / Pearson / 2006-3-3 / USD 143.00
As the speed and power of computers increases, so does the need for effective programming and algorithm analysis. By approaching these skills in tandem, Mark Allen Weiss teaches readers to develop wel......一起来看看 《Data Structures and Algorithm Analysis in Java》 这本书的介绍吧!