Types of Data You Need to Know as a Data Scientist

栏目: IT技术 · 发布时间: 4年前

Data Science

Types of Data You Need to Know as a Data Scientist

Understand Data Science and Data Types

Types of Data You Need to Know as a Data Scientist

Background by Helloquence on Unsplash

Table of contents

  1. What’s Data Science?
  2. Is Data Science a New Filed?
  3. Types of Data
  4. Data Formats/Sources
  5. Summary and Conclusion

What’s Data Science?

Data science is the study of large quantities of data . Like biological sciences is a study of biology, physical sciences, it’s the study of physical reactions. It is the process of using data to understand different things, to understand the world.

Data science is a field about manners to extract data from various forms of whether it is unstructured or structured form. It’s a multi-disciplinary field that brings together concepts from computer science , mathematics/ statistics , and data analysis .

The heart of data science is to always ask questions . Data Scientist is always in need to be curious about the world.

  1. What can we learn from this data?
  2. What actions can we take once we find whatever it is we are looking for?

“In the next 10 years, data science and software will do more for medicine than all of the biological sciences together”. Vinod Khosla

Is Data Science a New Field?

Back in the years following World War II, there was a rise in the number of deaths because of lung cancer, scientists and doctors did not agree on a specific reason, and no one considered the hypothesis of cigarettes can be the reason because at that time everyone smoke. In Britain, the data and details of all the doctors were written in a central register (including: Smoker?; Alive/death?, and if he’s dead, the reason of death?) so unintentionally, we have huge data that they can manipulate to predict and understand this phenomenon. So Bradford Hill and Richard Doll started to manipulate this data, extracting the doctors who died from lung cancer, and see if they were smokers. The result was so clear, 100% of people who died from lung cancer, during the past 29 months, were smokers. So Data Science isn’t new, what is new is the vast quantity of data available from massively varied sources.

There are many paths to a career in data science; most, but not all, involve a little math, a little science, and a lot of curiosity about data.

Types of Data

Big Data: No precise or universal definition can be given to Big Data ... Massive datasets, or data that contains greater variety arriving in increasing volumes and with ever-higher velocity (3V rule).

Types of Data You Need to Know as a Data Scientist

Image source: Author
  1. Data Volume : Huge data size, terabytes — petabytes.
  2. Data Velocity : High speed of data flow, data changes, and data processing.
  3. Data Variety : Various data sources (social media, mobile, structured data, unstructured data…).

“The world is one big data problem”. Andrew McAfee

Structured Data: Data that has predefined structures which are already stored in relational databases or spreadsheets in an ordered manner (traditional row-column databases). There are two sources of structured data:

  1. Machine-generated : All the data received from sensors, weblogs, and financial systems including medical devices, GPS data, data of usage statistics captured by servers.
  2. Human-generated : Mainly includes all the data that humans input into computers such as names and other personal details, websites visited, types of movies watched (this can be used by companies to figure out their customer behavior and make the appropriate business decision, for example, the recommendations system of Netflix, Youtube or Medium)…

Unstructured Data:Data with no predefined structure, comes in any size or form (they have no clear format in storage), cannot be easily stored in tables. It is also classified based on its source:

  1. Machine-generated: accounts for all the satellite images, security cameras, radar data captures and many more besides.
  2. Human-generated: is found in abundance across the internet since it includes social media data, mobile data, email, and website content. This means that the pictures we upload to Facebook or Instagram handle, the videos we watch on Youtube and even text messages we send.

Semi-Structured Data:The line between unstructured data and semi-structured data has always been unclear since most of the semi-structured data appear to be unstructured at a glance. Information that is not in the traditional database format as structured data, but contains some organizational properties which make it easier to process. For example, NoSQL documents are considered to be semi-structured, since they contain keywords that can be used to process the document easily.

Other data types

Quantitative Data: Numerical. For example: height, weight, salary, prices …

Categorical Data:Data that can be labeled or divided into groups. For example: sex, hair color, race, fruits, animals …

Data Formats/Sources

Most Common Data Formats:

  • CSV - Comma-separated values: as its name says is a delimited text file that uses a comma to separate values.
  • XML - Extensible Markup Language
  • SQL - Structured Query Language
  • JSON - JavaScript Object Notation
  • Protocol Buffers

Data Sources: Companies, APIs, Government, Academic, Web Scraping/Crawling …

Summary

  • Data science is the study of large quantities of dat a, which can reveal insights that help organizations make strategic choices.
  • Data Science isn’t new , what is new is the vast quantity of data available from massively varied sources: from weblogs, social media, sales data, GPS data, data of usage statistics captured by servers, email, patient information files, sports performance data, sensor data, security cameras, and many more besides.
  • The term Big Data describes collections of very large volumes of data — whether structured, semi-structured or unstructured — that can be processed and exploited in order to obtain intelligible and relevant information. We summarize the problem of Big Data by the 3V rule: Volume, Velocity, and Variety .

Conclusion

In addition to solving today’s problems, Data Scientists are also at the heart of tomorrow’s projects. You’ve surely heard of self-driving cars. It is, in fact, through better use of data that we are able to build more powerful and intelligent robots. And this is the responsibility of Data Scientists.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Android编程权威指南

Android编程权威指南

[美] Bill Phillips、[美] Brian Hardy / 王明发 / 人民邮电出版社 / 2014-4 / CNY 99.00元

权威、全面、实用、易懂,是本书最大的特色。本书根据美国大名鼎鼎的Big Nerd Ranch训练营的Android培训讲义编写而成,已经为微软、谷歌、Facebook等行业巨头培养了众多专业人才。作者巧妙地把Android开发所需的庞杂知识、行业实践、编程规范等融入一本书中,通过精心编排的应用示例、循序渐进的内容组织,以及循循善诱的语言,深入地讲解了Android开发的方方面面。如果学完一章之后仍......一起来看看 《Android编程权威指南》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换