内容简介:? Ask for a free invite.March 18, 2020Please share:.
? Ask for a free invite.
March 18, 2020
Please share:.
New books are available for subscription.
The Corona pandemic is on everyone's mind. If your country has not been locked down yet, it will be soon. The world and the human society is not going to be the same as before. Let's not worry too much about how economy is going to be hit, but hope that we solve it health-wise.
Anyway, soon, most of us will have to spend most of our time inside, and when we solve basic needs, and make sure our loved ones are safe, we'll have some time to pass. Most people will re-watch their favorite TV shows and re-play their games. I'm sure that many will acquire various skills of America's Got Talent quality. Some programmers, though, will itch to try their skills on Covid-19 data.
Maybe you wanted to add some machine learning skills to your programming-fu, but the day job just made that impossible. Why not start now? If this thing is going to be on our minds for months, if not years, we might throw some programming magic at it.
By now, we've seen many shiny visualizations and analyses of the pandemic, published by experts and amateurs. Maybe you itch to throw it into a magic machine learning framework and get some super-x-ray insight from artificial intelligence.
I won't do that here. First, because I know nothing about epidemiology. Please do not take any conclusions you hear from non-experts for granted. They don't know what they're talking about . Second, because the data that we can publicly access is too scarce to be thrown to any machine learning beast. You might get some numbers out, but these numbers will tell you only what's obvious from the visualizations anyway, at best, or spit out complete garbage at worst.
Once the data becomes more reliable, and abundant, we might be able to use it for some insight, provided that we learn some basics of epidemiology by then. Until then, I propose that we brush up our basic data skills through pure play.
So, there is no need to feel sorry that you hadn't learned some Big Machine Learning Framework yet. With basic programming skills, you can dissect the data that is currently available just fine, if not more easily! Just pure, plain Clojure, without specialized libraries!
Let's take some basic steps with the Covid-19 data published by Johns Hopkins University . I'll use a copy provided by Oscar Wahltinez in this Git repository.
Loading the data
The data is in a CSV file, so we require some useful namespaces for working with files.
(ns dragan.rocks.covid-19.world (:require [clojure.java.io :as io] [clojure.data.csv :as csv]))
CSV files are textual files where words are separated by comas and newlines. Typically, each line represent an observation, while the values of different variables for that observations are separated by comas. The first task is to translate that into convenient data structures in memory.
We load the file as a resource, slurp its contents into a string, and parse it into a lazy sequence.
(csv/read-csv (slurp (io/resource "open-covid-19/output/world.csv")))
(["Date" "CountryCode" "CountryName" "Confirmed" "Deaths" "Latitude" "Longitude"] ["2019-12-31" "AE" "United Arab Emirates" "0" "0" "23.424076" "53.847818"] ["2019-12-31" "AF" "Afghanistan" "0" "0" "33.93911" "67.709953"] ["2019-12-31" "AM" "Armenia" "0" "0" "40.069099" "45.038189"] ["2019-12-31" "AT" "Austria" "0" "0" "47.516231" "14.550072"] ["2019-12-31" "AU" "Australia" "0" "0" "-25.274398" "133.775136"] ["2019-12-31" "AZ" "Azerbaijan" "0" "0" "40.143105" "47.576927"] ["2019-12-31" "BE" "Belgium" "0" "0" "50.503887" "4.469936"] ["2019-12-31" "BH" "Bahrain" "0" "0" "25.930414" "50.637772"] ["2019-12-31" "BR" "Brazil" "0" "0" "-14.235004" "-51.92528"] ["2019-12-31" "BY" "Belarus" "0" "0" "53.709807" "27.953389"] ["2019-12-31" "CA" "Canada" "0" "0" "56.130366" "-106.346771"] ["2019-12-31" "CH" "Switzerland" "0" "0" "46.818188" "8.227512"] ["2019-12-31" "CN" "China" "27" "0" "35.86166" "104.195397"] ["2019-12-31" "CZ" "Czech Republic" "0" "0" "49.817492" "15.472962"] ["2019-12-31" "DE" "Germany" "0" "0" "51.165691" "10.451526"] ...)
As you can see, this sequence contains a map of vectors such as ["2019-12-31" "AT" "Austria" "0" "0" "47.516231" "14.550072"]
.
This is a sign that we successfully loaded the data. This is the data from the world.csv
file, while there are a few more similar datasets: usa.csv
, china.csv
. We can create a convenience function for loading these files.
(defn read-open-covid [csv-name] (csv/read-csv (slurp (io/resource (format "open-covid-19/output/%s.csv" csv-name)))))
And now use it to load the world data and stash it into a global variable (normally a bad, bad, programming practice, but acceptable if we are only playing in the REPL, notebook-style). BTW, I run this code in emacs+CIDER, and automatically generate this post from org-mode. If you copy and paste the code, it should run in any Clojure REPL setup.
(def covid-world (read-open-covid "world"))
Feeling the basic structure
Now, the most basic info I can get is "What variables this data set have?". CSV files typically list that in the first line, and we access it as the first element in our sequence.
(first covid-world)
["Date" "CountryCode" "CountryName" "Confirmed" "Deaths" "Latitude" "Longitude"]
To see how example data looks like, let's take the second row.
(second covid-world)
["2019-12-31" "AE" "United Arab Emirates" "0" "0" "23.424076" "53.847818"]
So, date is in the YYYY-MM-DD
format, which could be convenient for sorting. There is hope that Clojure can handle the comparisong and sorting of these strings as-is, without conversion to proper date objects (spoiler: it does). Next is the country code of the observation, which is obviously a useful identifier. CountryName
is redundant, but can be a fine time saver for all of us who do not remember all country codes. Next is the official number of confirmed cases of infection by Covid-19, and official death toll. Latitude and longitude refer to the position of the country, and are included because this data set is used as a source for the visualization of the pandemic on the interactive wold map that you can access here .
Unsurprisingly, on the New Year's Eve, There were no (discovered!) cases of infection in UAE.
How many observations do we have
The answer to this question is so easy to get, that I'm out of inspiration for this paragraph.
(count world-data)
We have a little more than 5000 observation.
How many countries do we have this data for? To answer this question, we should access country codes for each observation and then see how many distinct codes we have. It may require more fiddling in some other programming languages, but in Clojure it's bees knees.
(count (distinct (map second world-data)))
So, is our data complete?
(rem (count world-data) (count (distinct (map second world-data))))
Apparently not, since there is a remainder in this division. Some dates are certainly missing for some countries.
This means that we can't blindly treat all data for all countries uniformly; whatever the analysis we plan to do we will have to do something about that.
How many observations are missing
First, let's see how many distinct dates there are. Today is the 18th March 2020, and I can count that by hand on the calendar, but the point here is to do that using code.
(count (distinct (map first world-data)))
Since there are 143 countries and 79 dates, ideally there would be this many observations:
(* (count (distinct (map first world-data))) (count (distinct (map second world-data))))
Which means that we are missing half the data.
But it's not all. How many observations of the Confirmed
variable are 0
?
(* (count (filter zero? (map #(nth % 3) world-data))))
class java.lang.ClassCastExceptionclass java.lang.ClassCastExceptionExecution error (ClassCastException) at dragan.rocks.covid-19.world/eval15849 (form-init8558230572914833950.clj:1). java.lang.String cannot be cast to java.lang.Number
We get the exception, since "0"
and "4"
are not a numbers, but strings of characters.
Let's convert these columns to proper types:
(def world-data2 (map (fn [[d cc cn conf death]] [d cc cn (Long/parseLong conf) (Long/parseLong death)]) world-data))
#'dragan.rocks.covid-19.world/world-data2
(* (count (filter zero? (map #(nth % 3) world-data2))))
In roughly half of the observations, there were no confirmed cases. But not even all zeros are equal. Some zeroes are here because the pandemic hasn't reached a country at the particular date. Some other zeros might be there because no new cases were discovered in a country that has previous case. But even that does not mean there are no new case. In my country, Serbia, on some dates no tests were done (or, perhaps, were done but haven't been published, who knows).
The point is that this data is so early, that it is very scattered and very rough.
Anyway, let's see how many data is recorded at all per each day ( 0
or otherwise).
(def date-freqs (sort-by first (frequencies (map first world-data2))))
(["2019-12-31" 66] ["2020-01-01" 66] ["2020-01-02" 66] ["2020-01-03" 66] ["2020-01-04" 66] ["2020-01-05" 66] ["2020-01-06" 66] ["2020-01-07" 66] ["2020-01-08" 66] ["2020-01-09" 66] ["2020-01-10" 66] ["2020-01-11" 66] ["2020-01-12" 66] ["2020-01-13" 66] ["2020-01-14" 66] ["2020-01-15" 66] ...)
At the beginning, most data is available for (probably) the same 66 countries.
Let's discover (by code) what's the first date with a different number of observations.
It seems that this 66
runs right until two weeks ago. And then?
All dates after the 3rd of March first see less observations, and then, starting with the March 11th, the number of observations suddenly jumps. My hunch is that at first, most countries just submitted the default 0
to whomever collected this data (the World Health Organization, I suppose?), simply ignoring the problem. Then, as they started to realize the immediate danger, they were reluctant to send the invented data (or the WHO stopped collecting the default zeros?), and then, on 15th March the data becomes more complete. My hunch is the global pandemic was officially announced sometimes before that. Since this was in the past, I can simply check on the Internet (…typing away in the browser…): the pandemic was announced on March 11th 2020.
How much data do we have for each particular country
Analogously to the frequencies of observations on a particular date, we can count the frequencies related to countries; instead of the first column, we will use the second.
(def country-freqs (sort-by first (frequencies (map second world-data2))))
(["AD" 4] ["AE" 72] ["AF" 68] ["AG" 1] ["AL" 9] ["AM" 69] ["AR" 11] ["AT" 78] ["AU" 78] ["AZ" 71] ["BA" 5] ["BD" 3] ["BE" 78] ["BF" 5] ["BG" 8] ["BH" 77] ...)
Selecting your country
The human eye quickly gets lost in this bunch of numbers. Let's create a function that selects only the data available for the country, or a set of countries, that we are interested in.
For, example, for this set of countries: #{"IT" "FR" "ES" "CN"}
(filter (fn [[_ code]] (#{"IT" "FR" "ES" "CN"} code)) world-data2)
(["2019-12-31" "CN" "China" 27 0] ["2019-12-31" "ES" "Spain" 0 0] ["2019-12-31" "FR" "France" 0 0] ["2019-12-31" "IT" "Italy" 0 0] ["2020-01-01" "CN" "China" 27 0] ["2020-01-01" "ES" "Spain" 0 0] ["2020-01-01" "FR" "France" 0 0] ["2020-01-01" "IT" "Italy" 0 0] ["2020-01-02" "CN" "China" 27 0] ["2020-01-02" "ES" "Spain" 0 0] ["2020-01-02" "FR" "France" 0 0] ["2020-01-02" "IT" "Italy" 0 0] ["2020-01-03" "CN" "China" 44 0] ["2020-01-03" "ES" "Spain" 0 0] ["2020-01-03" "FR" "France" 0 0] ["2020-01-03" "IT" "Italy" 0 0] ...)
We'll write some convenient functions for computing the previously discussed values.
(defn take-countries [data country-set] (filter (fn [[_ code]] (country-set code)) data))
(defn date-freqs [data] (sort-by first (frequencies (map first data))))
(defn country-freqs [data] (sort-by first (frequencies (map second data))))
(def my-countries (country-freqs (take-countries world-data #{"IT" "FR" "ES" "CN" "US" "RS" "DE"})))
Now we can see that most of these countries have pretty complete (if not overly reliable) data, while Serbia only recently started doing tests and reporting some numbers.
(2019-12-31 CN China 27 0 35.86166 104.195397) | (2019-12-31 DE Germany 0 0 51.165691 10.451526) | (2019-12-31 ES Spain 0 0 40.463667 -3.74922) | (2019-12-31 FR France 0 0 46.227638 2.213749) | (2019-12-31 IT Italy 0 0 41.87194 12.56738) | (2019-12-31 US United States of America 0 0 37.09024 -95.712891) | (2020-01-01 CN China 27 0 35.86166 104.195397) | (2020-01-01 DE Germany 0 0 51.165691 10.451526) | (2020-01-01 ES Spain 0 0 40.463667 -3.74922) | (2020-01-01 FR France 0 0 46.227638 2.213749) | (2020-01-01 IT Italy 0 0 41.87194 12.56738) | (2020-01-01 US United States of America 0 0 37.09024 -95.712891) | (2020-01-02 CN China 27 0 35.86166 104.195397) | (2020-01-02 DE Germany 0 0 51.165691 10.451526) | (2020-01-02 ES Spain 0 0 40.463667 -3.74922) | (2020-01-02 FR France 0 0 46.227638 2.213749) | … |
Draw some plots
Instead of flashy plotting libraries, I'll draw some ASCII art. The reason is that the data is so obvious although coarse, that I don't want to make a false impression that you'll learn anything new that you haven't already seen in the news and on the Internet.
The second is: we are programmers, we present data in any silly way that we please!
I selected a pretty basic Java ASCII plotting library after a quick search on GitHub. Great thanks to Mitch Talmadge for ASCII-Data :)
(import 'com.mitchtalmadge.asciidata.graph.ASCIIGraph)
First I'll just take the number of confirmed cases from Serbia, and remove whatever zeros there are before the first case (we are not interesting in plotting a flat line).
(drop-while zero? (map #(nth % 3) (take-countries world-data2 #{"RS"})))
1 | 5 | 18 | 24 | 41 | 46 | 55 | 57 |
(def rs-data (drop-while zero? (map #(nth % 3) (take-countries world-data2 #{"RS"}))))
Let's plot this.
(println (.plot (ASCIIGraph/fromSeries (double-array rs-data))))
nil
57.00 ┤ ╭ 56.00 ┤ │ 55.00 ┤ ╭╯ 54.00 ┤ │ 53.00 ┤ │ 52.00 ┤ │ 51.00 ┤ │ 50.00 ┤ │ 49.00 ┤ │ 48.00 ┤ │ 47.00 ┤ │ 46.00 ┤ ╭╯ 45.00 ┤ │ 44.00 ┤ │ 43.00 ┤ │ 42.00 ┤ │ 41.00 ┤ ╭╯ 40.00 ┤ │ 39.00 ┤ │ 38.00 ┤ │ 37.00 ┤ │ 36.00 ┤ │ 35.00 ┤ │ 34.00 ┤ │ 33.00 ┤ │ 32.00 ┤ │ 31.00 ┤ │ 30.00 ┤ │ 29.00 ┤ │ 28.00 ┤ │ 27.00 ┤ │ 26.00 ┤ │ 25.00 ┤ │ 24.00 ┤ ╭╯ 23.00 ┤ │ 22.00 ┤ │ 21.00 ┤ │ 20.00 ┤ │ 19.00 ┤ │ 18.00 ┤ ╭╯ 17.00 ┤ │ 16.00 ┤ │ 15.00 ┤ │ 14.00 ┤ │ 13.00 ┤ │ 12.00 ┤ │ 11.00 ┤ │ 10.00 ┤ │ 9.00 ┤ │ 8.00 ┤ │ 7.00 ┤ │ 6.00 ┤ │ 5.00 ┤╭╯ 4.00 ┤│ 3.00 ┤│ 2.00 ┤│ 1.00 ┼╯
Whoaaa. Although the numbers looked pretty tame, graphs shoots up in the skies. This is because the growth is exponential .
Since the exponential function grows really fast, the lower numbers quickly become miniscule. However, we are not interested in absolute numbers, but in growth. Therefore, it is more appropriate to take the logarithm of this function, and see whether the logarithm starts to drop off, if only for a tiny bit.
We need a convenience log
function. I could have imported one from Neanderthal , but a fast CPU and GPU library is clearly an overkill for such a task. Hopefully soon there will be abundance of data, and we'll be able to put these nuclear options to use. For now, let's use sticks and stones.
(defn log ^double [^double x] (Math/log x))
#'dragan.rocks.covid-19.world/log
Now, graph the logarithm of the function of interest.
(println (.plot (ASCIIGraph/fromSeries (double-array (map log rs-data)))))
nil
It grows quite fast, and it is only at the beginning.
4.04 ┤ ╭─── 3.03 ┤ ╭─╯ 2.02 ┤╭╯ 1.01 ┤│ 0.00 ┼╯
Italy is overwhelmed
Now, let's see how Italy is holding. For a few weeks we've listened to really bad news.
(defn extract-data [country-code] (drop-while zero? (map #(nth % 3) (take-countries world-data2 #{country-code}))))
(reverse (map log (extract-data "IT")))
10.357933282865915 | 10.239245248219472 | 10.12414802355653 | 9.959726098983317 | 9.77905747415795 | 9.62331057957012 | 9.430439293104167 | 9.23824732522919 | 9.123910643977796 | 8.905851181208021 | 8.679822114864455 | 8.441607204459642 | 8.27563105457801 | 8.035602692918582 | 7.824845691026856 | 7.618742377670413 | … |
(println (log-plot (extract-data "IT")))
It hasn't started to slow down yet, although it looks like it is about to.
dragan.rocks.covid-19.world=> (println (log-plot (extract-data "IT"))) 10.36 ┤ ╭─── 9.39 ┤ ╭────╯ 8.42 ┤ ╭────╯ 7.46 ┤ ╭───╯ 6.49 ┤ ╭──╯ 5.53 ┤ ╭──╯ 4.56 ┤ ╭╯ 3.59 ┤ │ 2.63 ┤ ╭╯ 1.66 ┤ │ 0.69 ┼─────────────────────╯
China is slowing down
And China already won this battle, and, I hope, war itself.
(reverse (map log (extract-data "CN")))
11.303808085389111 | 11.302451316756681 | 11.302142703354239 | 11.301871044753339 | 11.301636371103024 | 11.301364574899084 | 11.301067985672987 | 11.300709489621875 | 11.300462176064121 | 11.29990549682687 | 11.29933612646083 | 11.29808484869358 | 11.295975195631474 | 11.294520668003193 | 11.293039103249892 | 11.291455512408028 | … |
See how the numbers are rising slowly on the log scale . The absolute numbers are still bad, but each day they are less bad.
(log-plot (extract-data "CN"))
nil
11.30 ┤ ╭───────────────────────────────── 10.30 ┤ ╭────────╯ 9.30 ┤ ╭────╯ 8.30 ┤ ╭──╯ 7.30 ┤ ╭─╯ 6.30 ┤ ╭───╯ 5.30 ┤ ╭─╯ 4.30 ┤ ╭─────────────╯ 3.30 ┼────╯
Programmers, learn Machine Learning!
I hope this was easy and interesting, and it occupied your attention away from the news for at least some time.
Simple tools really make you think about the problem, so the flashy new tools are not necessarry when you're just starting.
Although Machine Learning may look like a high mountain to climb, I hope this post proved to you that you made that first step long ago, with your first steps in programming!
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
程序员的职业素养
Robert C.Martin / 章显洲、余晟 / 人民邮电出版社 / 2012-9-1 / 49.00元
本书是编程大师Bob 大叔40 余年编程生涯的心得体会, 讲解成为真正专业的程序员需要什么样的态度、原则,需要采取什么样的行动。作者以自己以及身边的同事走过的弯路、犯过的错误为例,意在为后来人引路,助其职业生涯迈上更高台阶。 本书适合所有程序员,也可供所有想成为具备职业素养的职场人士参考。一起来看看 《程序员的职业素养》 这本书的介绍吧!