12 Rules that made today’s Data Science

栏目: IT技术 · 发布时间: 4年前

内容简介：I am going to ask you a really stupid and ridiculous question.Why do we store data in tables?For anyone who works with data, this might be a question in the lines of

Introduction

I am going to ask you a really stupid and ridiculous question.

Why do we store data in tables?

For anyone who works with data, this might be a question in the lines of ‘Why do we use this sign (1) for the number one?’ Well, it is obvious, isn’t it?

Turns out, no. No one really seems to know. When I searched the term on Wikipedia, I expected people to have gotten their head cut off (it was very common) because of its invention during medieval times. But, surprise, surprise, Wikipedia does not have a History section for data tables.

While we can’t know the origin of a data table itself, we know how they transformed into the form they exist today.

As the volume of data grew, we needed more efficiency and order for data storage. In 1970, while working at IBM , Edgar ‘Ted’ Codd devised a set of 13 rules for creating a database and called them ‘ Codd’s 12 rules ’ (zero-indexed, geeky, right?).

At that time, hardware was very limited so, no one could implement his vision into practice. However, after the technology boom, his ideas quickly caught up and are now the main reason companies like Oracle exist today.

While not all of the rules are strictly followed these days, there are a few important ones that data scientists use in their daily lives. These rules are essential for having clean datasets because they establish a basis for all data cleaning operations.

Relational database model

All of the rules were summarized as one term, relational database model , and databases that use this model are called relational databases.

In relational databases, according to Rule 1 , all data should be presented at the most logical level and in exactly one way — tables.

The actual list of the rules can be found on this Wikipedia page

Let’s say, I own a private hospital and want to store all the data like patient info, appointments, staff.

For my database to be relational, each table should be one entity type. What that means is that I can’t store both patient information and appointments in one table.

Sample database schema for the hospital. Image created by the author

Firstly, the above tables are not linked together. You might say that we can cross-match patients and appointments on a common patient_name column. But what if you had two several other patients with the same name? The same goes for the staff info.

Key constraints

To solve these types of problems, Edgar Codd specifies in Rule 5 that each row of each table should be unique with the use of primary keys . The brilliance of this idea was that this unique key for each row could be used in other tables as a reference to that particular row. So, let’s change our tables a bit to make them compatible with this rule.

Sample database schema for the hospital. Image created by the author

After I added patient_id, staff_id columns , all tables can be linked together. Using languages like SQL , all of the data can be combined to answer questions that include all tables. For example, we might be interested in how many patients showed up to their appointment or from which region did the hospital mostly receive their patients and even more complex questions.

Domain constraints

Apart from key constraints, Rule 5 also tells that on each table and column, there should be domain constraints too. These constraints ensure the integrity of the data.

For example, in my patients table, I cannot put information about their appointment in the same table. This would be a violation of the rule that tables should represent only one entity type. Besides, a single row should only contain a single record. This would mean that I cannot record two appointments from the same patient in a single row.

When it comes to columns, the data for the whole column should represent a single and logical data type. What I mean by single is that a column like the phone number of a patient cannot be a string in one cell and a large integer in another.

Logical data type constraint is essential for a lot of operations that involve exploratory data analysis. Columns like address, name, phone number should be of type string. Weight and height should be represented as float numbers because this would allow us to make calculations later on.

For other columns, let’s say, for problem column, which represents patients’ health issues, it is a bit tricky. You might think that their type should be a string.

But sometimes, we might want to make analyses on patients depending on the severity of diseases. Unless we specifically tell them, our computers cannot classify diseases by their severity, so we need a different datatype.

If you are doing your data cleaning with pandas , the package allows you to set these types of columns as categorical data and also specify the order of diseases by severity.

Logical constraints

When I am cleaning any dataset and face a column that is like an attended column from the appointments table, I set their data type to a boolean.

Why? If you use the describe() function on the dataframe with incorrect data types (i.e. string) for boolean columns, the function leaves them out.

Sometimes, these columns can be represented as 1 or 0. You might say that now the describe() function works on this column too, you are wrong. Using the describe() function returns the mean for that column which is not what we want. If you set the column type to boolean, all the functions that involve booleans work as expected.

As a final point on logical integrity, you should always check for values that do not make sense. For example, if the age and the birthday of the person are given, you should make sure that the difference matches up if you subtract the birthday column from the current date.

That’s great but how do we check if columns like weight and height are logical and correct. Well, one solution that I would use is to calculate the body mass index for each row and look up additional data on BMIs . You can find tons of information about BMIs for each age. This means that by comparing each BMI to a patient’s age, you can catch outliers or minimum values that does not make sense.

Conclusion

This article showed just how important Edgar Codd’s ideas were to data science and to the whole world. We only looked at two of his rules, Rule 1 and Rule 5 .

These rules ensure that we have high-quality, clean data. Clean data means beautiful visualizations that do not lie, unskewed analyses, and less unbiased machine learning models.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网