Types of Data You Need to Know as a Data Scientist

栏目: IT技术 · 发布时间: 4年前

Data Science

Types of Data You Need to Know as a Data Scientist

Understand Data Science and Data Types

Types of Data You Need to Know as a Data Scientist

Background by Helloquence on Unsplash

Table of contents

  1. What’s Data Science?
  2. Is Data Science a New Filed?
  3. Types of Data
  4. Data Formats/Sources
  5. Summary and Conclusion

What’s Data Science?

Data science is the study of large quantities of data . Like biological sciences is a study of biology, physical sciences, it’s the study of physical reactions. It is the process of using data to understand different things, to understand the world.

Data science is a field about manners to extract data from various forms of whether it is unstructured or structured form. It’s a multi-disciplinary field that brings together concepts from computer science , mathematics/ statistics , and data analysis .

The heart of data science is to always ask questions . Data Scientist is always in need to be curious about the world.

  1. What can we learn from this data?
  2. What actions can we take once we find whatever it is we are looking for?

“In the next 10 years, data science and software will do more for medicine than all of the biological sciences together”. Vinod Khosla

Is Data Science a New Field?

Back in the years following World War II, there was a rise in the number of deaths because of lung cancer, scientists and doctors did not agree on a specific reason, and no one considered the hypothesis of cigarettes can be the reason because at that time everyone smoke. In Britain, the data and details of all the doctors were written in a central register (including: Smoker?; Alive/death?, and if he’s dead, the reason of death?) so unintentionally, we have huge data that they can manipulate to predict and understand this phenomenon. So Bradford Hill and Richard Doll started to manipulate this data, extracting the doctors who died from lung cancer, and see if they were smokers. The result was so clear, 100% of people who died from lung cancer, during the past 29 months, were smokers. So Data Science isn’t new, what is new is the vast quantity of data available from massively varied sources.

There are many paths to a career in data science; most, but not all, involve a little math, a little science, and a lot of curiosity about data.

Types of Data

Big Data: No precise or universal definition can be given to Big Data ... Massive datasets, or data that contains greater variety arriving in increasing volumes and with ever-higher velocity (3V rule).

Types of Data You Need to Know as a Data Scientist

Image source: Author
  1. Data Volume : Huge data size, terabytes — petabytes.
  2. Data Velocity : High speed of data flow, data changes, and data processing.
  3. Data Variety : Various data sources (social media, mobile, structured data, unstructured data…).

“The world is one big data problem”. Andrew McAfee

Structured Data: Data that has predefined structures which are already stored in relational databases or spreadsheets in an ordered manner (traditional row-column databases). There are two sources of structured data:

  1. Machine-generated : All the data received from sensors, weblogs, and financial systems including medical devices, GPS data, data of usage statistics captured by servers.
  2. Human-generated : Mainly includes all the data that humans input into computers such as names and other personal details, websites visited, types of movies watched (this can be used by companies to figure out their customer behavior and make the appropriate business decision, for example, the recommendations system of Netflix, Youtube or Medium)…

Unstructured Data:Data with no predefined structure, comes in any size or form (they have no clear format in storage), cannot be easily stored in tables. It is also classified based on its source:

  1. Machine-generated: accounts for all the satellite images, security cameras, radar data captures and many more besides.
  2. Human-generated: is found in abundance across the internet since it includes social media data, mobile data, email, and website content. This means that the pictures we upload to Facebook or Instagram handle, the videos we watch on Youtube and even text messages we send.

Semi-Structured Data:The line between unstructured data and semi-structured data has always been unclear since most of the semi-structured data appear to be unstructured at a glance. Information that is not in the traditional database format as structured data, but contains some organizational properties which make it easier to process. For example, NoSQL documents are considered to be semi-structured, since they contain keywords that can be used to process the document easily.

Other data types

Quantitative Data: Numerical. For example: height, weight, salary, prices …

Categorical Data:Data that can be labeled or divided into groups. For example: sex, hair color, race, fruits, animals …

Data Formats/Sources

Most Common Data Formats:

  • CSV - Comma-separated values: as its name says is a delimited text file that uses a comma to separate values.
  • XML - Extensible Markup Language
  • SQL - Structured Query Language
  • JSON - JavaScript Object Notation
  • Protocol Buffers

Data Sources: Companies, APIs, Government, Academic, Web Scraping/Crawling …

Summary

  • Data science is the study of large quantities of dat a, which can reveal insights that help organizations make strategic choices.
  • Data Science isn’t new , what is new is the vast quantity of data available from massively varied sources: from weblogs, social media, sales data, GPS data, data of usage statistics captured by servers, email, patient information files, sports performance data, sensor data, security cameras, and many more besides.
  • The term Big Data describes collections of very large volumes of data — whether structured, semi-structured or unstructured — that can be processed and exploited in order to obtain intelligible and relevant information. We summarize the problem of Big Data by the 3V rule: Volume, Velocity, and Variety .

Conclusion

In addition to solving today’s problems, Data Scientists are also at the heart of tomorrow’s projects. You’ve surely heard of self-driving cars. It is, in fact, through better use of data that we are able to build more powerful and intelligent robots. And this is the responsibility of Data Scientists.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

编程.建筑

编程.建筑

保罗·科茨 / 2012-9 / 45.00元

《编程•建筑》简单明了地介绍了计算机算法与程序用于建筑设计的历史,解释了基本的算法思想和计算机作为建筑设计工具的运用。作为计算机辅助设计的先驱,保罗·科茨通过多年讲授的计算、设计的教学内容和实例研究,向我们展示了算法思维。《编程•建筑》提供了详细、可操作的编码所需要的技术和哲学思想,给读者一些代码和算法例子的认识。一起来看看 《编程.建筑》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

MD5 加密
MD5 加密

MD5 加密工具