Data Science
Types of Data You Need to Know as a Data Scientist
Understand Data Science and Data Types
Apr 17 ·5min read
Table of contents
- What’s Data Science?
- Is Data Science a New Filed?
- Types of Data
- Data Formats/Sources
- Summary and Conclusion
What’s Data Science?
Data science is the study of large quantities of data . Like biological sciences is a study of biology, physical sciences, it’s the study of physical reactions. It is the process of using data to understand different things, to understand the world.
Data science is a field about manners to extract data from various forms of whether it is unstructured or structured form. It’s a multi-disciplinary field that brings together concepts from computer science , mathematics/ statistics , and data analysis .
The heart of data science is to always ask questions . Data Scientist is always in need to be curious about the world.
- What can we learn from this data?
- What actions can we take once we find whatever it is we are looking for?
“In the next 10 years, data science and software will do more for medicine than all of the biological sciences together”. Vinod Khosla
Is Data Science a New Field?
Back in the years following World War II, there was a rise in the number of deaths because of lung cancer, scientists and doctors did not agree on a specific reason, and no one considered the hypothesis of cigarettes can be the reason because at that time everyone smoke. In Britain, the data and details of all the doctors were written in a central register (including: Smoker?; Alive/death?, and if he’s dead, the reason of death?) so unintentionally, we have huge data that they can manipulate to predict and understand this phenomenon. So Bradford Hill and Richard Doll started to manipulate this data, extracting the doctors who died from lung cancer, and see if they were smokers. The result was so clear, 100% of people who died from lung cancer, during the past 29 months, were smokers. So Data Science isn’t new, what is new is the vast quantity of data available from massively varied sources.
There are many paths to a career in data science; most, but not all, involve a little math, a little science, and a lot of curiosity about data.
Types of Data
Big Data: No precise or universal definition can be given to Big Data ... Massive datasets, or data that contains greater variety arriving in increasing volumes and with ever-higher velocity (3V rule).
- Data Volume : Huge data size, terabytes — petabytes.
- Data Velocity : High speed of data flow, data changes, and data processing.
- Data Variety : Various data sources (social media, mobile, structured data, unstructured data…).
“The world is one big data problem”. Andrew McAfee
Structured Data: Data that has predefined structures which are already stored in relational databases or spreadsheets in an ordered manner (traditional row-column databases). There are two sources of structured data:
- Machine-generated : All the data received from sensors, weblogs, and financial systems including medical devices, GPS data, data of usage statistics captured by servers.
- Human-generated : Mainly includes all the data that humans input into computers such as names and other personal details, websites visited, types of movies watched (this can be used by companies to figure out their customer behavior and make the appropriate business decision, for example, the recommendations system of Netflix, Youtube or Medium)…
Unstructured Data:Data with no predefined structure, comes in any size or form (they have no clear format in storage), cannot be easily stored in tables. It is also classified based on its source:
- Machine-generated: accounts for all the satellite images, security cameras, radar data captures and many more besides.
- Human-generated: is found in abundance across the internet since it includes social media data, mobile data, email, and website content. This means that the pictures we upload to Facebook or Instagram handle, the videos we watch on Youtube and even text messages we send.
Semi-Structured Data:The line between unstructured data and semi-structured data has always been unclear since most of the semi-structured data appear to be unstructured at a glance. Information that is not in the traditional database format as structured data, but contains some organizational properties which make it easier to process. For example, NoSQL documents are considered to be semi-structured, since they contain keywords that can be used to process the document easily.
Other data types
Quantitative Data: Numerical. For example: height, weight, salary, prices …
Categorical Data:Data that can be labeled or divided into groups. For example: sex, hair color, race, fruits, animals …
Data Formats/Sources
Most Common Data Formats:
- CSV - Comma-separated values: as its name says is a delimited text file that uses a comma to separate values.
- XML - Extensible Markup Language
- SQL - Structured Query Language
- JSON - JavaScript Object Notation
- Protocol Buffers
Data Sources: Companies, APIs, Government, Academic, Web Scraping/Crawling …
Summary
- Data science is the study of large quantities of dat a, which can reveal insights that help organizations make strategic choices.
- Data Science isn’t new , what is new is the vast quantity of data available from massively varied sources: from weblogs, social media, sales data, GPS data, data of usage statistics captured by servers, email, patient information files, sports performance data, sensor data, security cameras, and many more besides.
- The term Big Data describes collections of very large volumes of data — whether structured, semi-structured or unstructured — that can be processed and exploited in order to obtain intelligible and relevant information. We summarize the problem of Big Data by the 3V rule: Volume, Velocity, and Variety .
Conclusion
In addition to solving today’s problems, Data Scientists are also at the heart of tomorrow’s projects. You’ve surely heard of self-driving cars. It is, in fact, through better use of data that we are able to build more powerful and intelligent robots. And this is the responsibility of Data Scientists.
Resources:
- https://www.goodreads.com/book/show/7170627-the-emperor-of-all-maladies
- https://www.goodreads.com/book/show/705365.The_Rise_and_Fall_of_Modern_Medicine
- https://www.theguardian.com/news/2005/jun/02/thisweekssciencequestions.cancer
- https://en.wikipedia.org/wiki/Comma-separated_values
- https://www.edvancer.in/50-amazing-big-data-and-data-science-quotes-to-inspire-you/
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Android编程权威指南
[美] Bill Phillips、[美] Brian Hardy / 王明发 / 人民邮电出版社 / 2014-4 / CNY 99.00元
权威、全面、实用、易懂,是本书最大的特色。本书根据美国大名鼎鼎的Big Nerd Ranch训练营的Android培训讲义编写而成,已经为微软、谷歌、Facebook等行业巨头培养了众多专业人才。作者巧妙地把Android开发所需的庞杂知识、行业实践、编程规范等融入一本书中,通过精心编排的应用示例、循序渐进的内容组织,以及循循善诱的语言,深入地讲解了Android开发的方方面面。如果学完一章之后仍......一起来看看 《Android编程权威指南》 这本书的介绍吧!