内容简介:We can use standard UNIX tools to investigate the origins of the Wuhan coronavirus! I read on Wikipedia that “2019-nCoV has been reported to have a genome sequence 75% to 80% identical to the SARS-CoV and to have more similarities to several bat coronaviru
We can use standard UNIX tools
to investigate the origins of the Wuhan coronavirus!
I read on Wikipedia that
“2019-nCoV has been reported to have a genome sequence 75% to 80% identical to the SARS-CoV
and to have more similarities to several bat coronaviruses.”
We can use diff
to see those similarities:
$ ./genome_diff MG772933.1 MN988713.1 MG772933.1: 29802 words 26618 89% common 861 3% deleted 2323 8% changed MN988713.1: 29874 words 26618 89% common 896 3% inserted 2360 8% changed
This says that there’s an 89% similarity between bat CoV (MG772933.1) and human nCoV (MN988713.1). More precisely, they share a subsequence of 26618 bases, in a total genome of only ~29800 bases.
That genome_diff
script looks like this:
#!/bin/bash fetch_genome() { curl -s "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=$1" \ | grep -v '^>' | tr -d -C 'ATGC' | sed 's/\(.\)/\1 /g' > $1 } fetch_genome $1 fetch_genome $2 wdiff -s -123 $1 $2
This script works by fetching the genome from the NCBI database
.
The strings “MG772933.1” and “MN988713.1” are accession numbers
.
The text at https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN988713.1
is 2019-nCoV’s RNA sequence in FASTA format, which looks like:
$ curl -s 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN988713.1' >MN988713.1 Wuhan seafood market pneumonia virus isolate 2019-nCoV/USA-IL1/2020, complete genome ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC ...
The FASTA format needs a bit of “massaging” before we can diff
it.
The first line, starting with >
, describes the sequence that follows.
We don’t need this metadata, so we strip it with grep -v '^>'
.
Next, we don’t need those newline characters,
so we strip them with tr -d -C 'ATGC'
.
Finally,
because diff
doesn’t work at the “character” level,
we’ll instead use wdiff
,
but first separating the characters into separate words using sed 's/\(.\)/\1 /g'
.
This gives us genomes that look like A T A T T A G G ...
.
Finally, we can call wdiff -s -123
on these genomes,
which gives us some statistics about their similarity.
If we omit -s -123
,
we get the actual base differences between the sequences:
A T [-A T-] T A {+A A+} G G T T T [-T-] {+A+} T A C C ...
A different way to see similarities is to use NCBI’s BLAST tool
.
Enter the accession number MN988713.1
,
and you’ll get a list of other sequences,
ranked by “percent identity”.
The most similar are several recent sequences of 2019-nCoV,
followed by the “Bat SARS-like coronavirus”,
followed by many SARS coronavirus sequences.
More by Jim
- The inception bar: a new phishing method
- The hacker hype cycle
- Project C-43: the lost origins of asymmetric crypto
- How Hacker News stays interesting
- My parents are Flat-Earthers
- The dots do matter: how to scam a Gmail user
- The sorry state of OpenSSL usability
- I hate telephones
- The Three Ts of Time, Thought and Typing: measuring cost on the web
- Granddad died today
- Your syntax highlighter is wrong
Tagged#programming,#bioinformatics. All content copyright James Fisher 2020. This post is not associated with my employer. Found an error? Edit this page.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。