Data Science
Types of Data You Need to Know as a Data Scientist
Understand Data Science and Data Types
Apr 17 ·5min read
Table of contents
- What’s Data Science?
- Is Data Science a New Filed?
- Types of Data
- Data Formats/Sources
- Summary and Conclusion
What’s Data Science?
Data science is the study of large quantities of data . Like biological sciences is a study of biology, physical sciences, it’s the study of physical reactions. It is the process of using data to understand different things, to understand the world.
Data science is a field about manners to extract data from various forms of whether it is unstructured or structured form. It’s a multi-disciplinary field that brings together concepts from computer science , mathematics/ statistics , and data analysis .
The heart of data science is to always ask questions . Data Scientist is always in need to be curious about the world.
- What can we learn from this data?
- What actions can we take once we find whatever it is we are looking for?
“In the next 10 years, data science and software will do more for medicine than all of the biological sciences together”. Vinod Khosla
Is Data Science a New Field?
Back in the years following World War II, there was a rise in the number of deaths because of lung cancer, scientists and doctors did not agree on a specific reason, and no one considered the hypothesis of cigarettes can be the reason because at that time everyone smoke. In Britain, the data and details of all the doctors were written in a central register (including: Smoker?; Alive/death?, and if he’s dead, the reason of death?) so unintentionally, we have huge data that they can manipulate to predict and understand this phenomenon. So Bradford Hill and Richard Doll started to manipulate this data, extracting the doctors who died from lung cancer, and see if they were smokers. The result was so clear, 100% of people who died from lung cancer, during the past 29 months, were smokers. So Data Science isn’t new, what is new is the vast quantity of data available from massively varied sources.
There are many paths to a career in data science; most, but not all, involve a little math, a little science, and a lot of curiosity about data.
Types of Data
Big Data: No precise or universal definition can be given to Big Data ... Massive datasets, or data that contains greater variety arriving in increasing volumes and with ever-higher velocity (3V rule).
- Data Volume : Huge data size, terabytes — petabytes.
- Data Velocity : High speed of data flow, data changes, and data processing.
- Data Variety : Various data sources (social media, mobile, structured data, unstructured data…).
“The world is one big data problem”. Andrew McAfee
Structured Data: Data that has predefined structures which are already stored in relational databases or spreadsheets in an ordered manner (traditional row-column databases). There are two sources of structured data:
- Machine-generated : All the data received from sensors, weblogs, and financial systems including medical devices, GPS data, data of usage statistics captured by servers.
- Human-generated : Mainly includes all the data that humans input into computers such as names and other personal details, websites visited, types of movies watched (this can be used by companies to figure out their customer behavior and make the appropriate business decision, for example, the recommendations system of Netflix, Youtube or Medium)…
Unstructured Data:Data with no predefined structure, comes in any size or form (they have no clear format in storage), cannot be easily stored in tables. It is also classified based on its source:
- Machine-generated: accounts for all the satellite images, security cameras, radar data captures and many more besides.
- Human-generated: is found in abundance across the internet since it includes social media data, mobile data, email, and website content. This means that the pictures we upload to Facebook or Instagram handle, the videos we watch on Youtube and even text messages we send.
Semi-Structured Data:The line between unstructured data and semi-structured data has always been unclear since most of the semi-structured data appear to be unstructured at a glance. Information that is not in the traditional database format as structured data, but contains some organizational properties which make it easier to process. For example, NoSQL documents are considered to be semi-structured, since they contain keywords that can be used to process the document easily.
Other data types
Quantitative Data: Numerical. For example: height, weight, salary, prices …
Categorical Data:Data that can be labeled or divided into groups. For example: sex, hair color, race, fruits, animals …
Data Formats/Sources
Most Common Data Formats:
- CSV - Comma-separated values: as its name says is a delimited text file that uses a comma to separate values.
- XML - Extensible Markup Language
- SQL - Structured Query Language
- JSON - JavaScript Object Notation
- Protocol Buffers
Data Sources: Companies, APIs, Government, Academic, Web Scraping/Crawling …
Summary
- Data science is the study of large quantities of dat a, which can reveal insights that help organizations make strategic choices.
- Data Science isn’t new , what is new is the vast quantity of data available from massively varied sources: from weblogs, social media, sales data, GPS data, data of usage statistics captured by servers, email, patient information files, sports performance data, sensor data, security cameras, and many more besides.
- The term Big Data describes collections of very large volumes of data — whether structured, semi-structured or unstructured — that can be processed and exploited in order to obtain intelligible and relevant information. We summarize the problem of Big Data by the 3V rule: Volume, Velocity, and Variety .
Conclusion
In addition to solving today’s problems, Data Scientists are also at the heart of tomorrow’s projects. You’ve surely heard of self-driving cars. It is, in fact, through better use of data that we are able to build more powerful and intelligent robots. And this is the responsibility of Data Scientists.
Resources:
- https://www.goodreads.com/book/show/7170627-the-emperor-of-all-maladies
- https://www.goodreads.com/book/show/705365.The_Rise_and_Fall_of_Modern_Medicine
- https://www.theguardian.com/news/2005/jun/02/thisweekssciencequestions.cancer
- https://en.wikipedia.org/wiki/Comma-separated_values
- https://www.edvancer.in/50-amazing-big-data-and-data-science-quotes-to-inspire-you/
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。