Discovering New Data

栏目: IT技术 · 发布时间: 4年前

Discovering New Data

Illustration by Héizel Vázquez

When you are working in data science one of the hardest parts is discovering which data to use when trying to solve a business problem.

Remember that before trying to get data to solve a problem you need to get the context of a business and the project. With context I mean all the specifics on how a company runs its projects, how the company is established, its competitors, how many departments exist, the different objectives and goals they have, and how they measure success or failure.

When you have all of that you can start thinking about getting the required data to solve the business problem. In this article I won’t talk that much about data collection, instead, I want to discuss and show you the process to enrich the data you already have with new data.

Remember that getting new data has to be done in a systematic fashion, it’s not just getting data out of nowhere, we have to do it consistently, plan it, create a process to do it, and this depends in engineering, architect, DataOps and more things that I’ll be discussing in other articles.

Setting up the environment

In this article, we will be using three things: Python, GitHub, and Explorium. If you want to know more about Explorium check this:

Let’s start by creating a new git repo. Here we will be storing our data, code, and documentation. Go to your terminal and create a new folder and move there:

mkdir data_discovery
cd data_discovery

Then initialize the git repo:

git init

Now let’s create a remote repo on GitHub :

Discovering New Data

Now go to your terminal and type (change the URL to yours):

git remote add origin https://github.com/FavioVazquez/data_discovery.git

Now let’s check:

git remote -v

You should see (with your own URL of course):

origin https://github.com/FavioVazquez/data_discovery.git (fetch)
origin https://github.com/FavioVazquez/data_discovery.git (push)

Now let’s start a Readme file (I’m using Vim ):

vim Readme.md

And write whatever you want in there.

Now let’s add the file to git:

git add .

And create our first commit:

git commit -m "Add readme file"

Finally, let’s push this to our GitHub repo:

git push --set-upstream origin master

By this point your repo should look like this:

Discovering New Data

Finding the data

Let’s find some data. I’m not going to do the whole data science process here of understanding the data, exploring it, modeling, or anything like that. I’m just going to find some interesting data as a demo for you.

For this example, we will be exploring the data from YELP reviews for some businesses. The data is originally on Kaggle:

https://www.yelp.com/dataset/challenge

But I’m using a CSV dataset fromhere. The CSV file I’m using is called “yelp_reviews_RV_categories.csv”

Let’s add that to git:

git add .
git commit -m "Add data"
git push

We will begin by loading the data on Pandas and performing a basic EDA:

from pandas_profiling import ProfileReport
profile = ProfileReport(df, title="Pandas Profiling Report")
profile

This will give you a comprehensive report on the data that looks like this:

Discovering New Data

Using Explorium to get more data

Great now it’s time to go to Explorium. To know more about Explorium click here:

Let’s create a new project:

Discovering New Data

After naming your project you should add your data from your local machine. We will add our main dataset. And you’ll see something like this:

Discovering New Data

You can get more basic information about the columns on Explorium in the exploration bar below:

Discovering New Data

The data we have contains information about a specific business, like where it’s located (city, state, zip code), its name, category, and also data about its reviewer. But we would like to know more. More about the business. We will start by blacklisting all the information we don’t care about:

  • user_id
  • text

Discovering New Data

Now let’s try to get more information about this dataset. Explorium right now will ask you for something to predict to be able to run, we actually don’t want to predict anything but let’s put something so it works (we will use “stars” as a target”):

Discovering New Data

Using stars as target

When we click Play, then the system will start gathering information about the dataset. We have to wait here. At this point the system it’s not only bringing new data from external sources but also creating new features based on our existing columns. We won’t use that for now, but in the next articles about feature engineering, it will be important.

After some minutes I got this screen:

Discovering New Data

That means that Explorium found 102 datasets that can compliment my original data, and in the end, it created/fetched 3791 columns from my data and from the external sources. Remember that we are interested in finding more information about the businesses, so I’ll pick some columns coming from the external datasets and add them to my original data.

This is the actual process of enriching the data. As you can see the system can tell you what are the top 50 features, but with respect to what? If you remember we are trying to “predict” the stars from the other columns, so what it’s telling you is that these 50 or 100 features have some predicting power regarding the target we chose.

You can actually get more information about the specific column of interest.

Let’s start with something very basic. Getting the website of the business. For that, I’ll use the dataset: Search Engine Results:

Discovering New Data

If you click the arrow you’ll download it. After that, you can load it on Pandas:

# search engine
search = pd.read_csv("Search  Engine  Results.csv")

As you will see we have a lot of missing data, but that is normal when you do data enrichment. Great, let’s choose the company’s website:

search[["name", "Company Website"]]

Discovering New Data

And what you are seeing is the most likely webpage for a specific business given its name. Pretty cool right?

What about the number of violent crimes around each business? For that, I’ll be using the “US Crime Statistics” dataset:

Discovering New Data

We will use a different method to download this data. I want direct just violent crimes, so in my features section, after filtering by the crimes dataset, I’ll just select the violent crimes:

Discovering New Data

And click download. Let’s see it on Python:

crimes[["name","violent_crime"]].dropna().drop_duplicates()

Discovering New Data

And just like that, you know how many violent crimes you have for a specific store. The actual process of creating this variable is quite complex. To know more about it, go to Explorium and click on Learn More when selecting a variable:

Discovering New Data

In the feature origin section, you see how it was created and from which data sources:

Discovering New Data

As you can see the process is not that simple, but the system is doing it for you so that’s awesome :)

You can do that with every single variable that the system gathered or created for you, and with that, you have full control over the process.

Even more data

If you remember we put “stars” as the column to predict. Even though we are not interested in that, Explorium did it best to get data for predicting that column. In future articles, I’ll create whole projects with the tool so you see the whole picture.

For now, we can select the best 50 features for predicting that “test” variable we selected. To do that, we go to the Engine tab and select Features:

Discovering New Data

We will only get the best 50 variables out of the 3791 the system gather and created. And then we will download them as before. The dataset we downloaded it’s called “all_features.csv” by default. So let’s load it in our EDA notebook:

data = pd.read_csv("all_features.csv")

These are the columns we have:

['name',
 'address',
 'latitude',
 'longitude',
 'categories',
 'review_stars',
 'KNeighbors(latitude, longitude)',
 'review_stars.1',
 '"rv" in categories',
 'KNeighbors(Latitude, Longitude)',
 '"world" in Results Snippets',
 '"camping" in Results Snippets',
 '"camping" in Title of the First Result',
 '"camping" in Results Titles',
 '"world rv" in Results Titles',
 '"motorhomes" in Results Snippets',
 '"campers" in Results Snippets',
 '"accessories" in Results Titles',
 '"rated based" in Results Snippets',
 '"parts accessories" in Results Snippets',
 '"5th wheels" in Results Snippets',
 '"sale" in Results Titles',
 '"based" in Results Snippets',
 '"service center" in Title of the First Result',
 '"rvs" in Results Titles',
 '"buy" in Results Snippets',
 '"dealer" in Results Titles',
 '"inventory" in Results Snippets',
 '"travel" in Results Titles',
 'KNeighbors(LAT, LONG)',
 'Number of Related Links',
 'day(Website Creation Date)',
 'month(Website Creation Date)',
 '"service" in Website Description',
 'year(Website Creation Date)',
 'Percentile',
 'Number of Website Subdomains',
 '"rv" in Website Description',
 'MedianLoadTime',
 '"camping" in Website Description',
 '"buy" in Website Description',
 '"rv dealer" in Title',
 '"dealer" in Title',
 'Number of Connections to Youtube',
 '"trailers" in Website Description',
 'month(Domain Update Date)',
 'Domain Creation Date - Domain Update Date',
 'Domain Creation Date - Domain Expiry Date',
 'Stopword Count(Associated Keywords)',
 '"pinterest instagram" in Social Networks',
 'Number of Social Networks',
 '"facebook" in Social Networks',
 'Year of Establishment',
 'AdultContent->False',
 'AdultContent->empty',
 'Results From Facebook->False',
 'Results From Facebook->True',
 'Url Split(Company Website).company_website_10->empty',
 'Url Split(Company Website).company_website_10->utm_campaign=Yext%20Directory%20Listing',
 'Url Split(Company Website).company_website_10->utm_medium=organic',
 'Url Split(Company Website).company_website_10->utm_source=moz',
 '_TARGET_']

As you can see we have very different data but interesting of course. I’m not going to do more with it right now because the idea was to show you how to get more data for the data you already have, but again, I’ll be creating whole projects where I’m going to follow the whole data science process.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

架构真经

架构真经

马丁L. 阿伯特(Martin L. Abbott)、迈克尔T.费舍尔(Michael T. Fisher) / 机械工业出版社 / 2017-4 / 79

前言 感谢你对本书第2版感兴趣!作为一本入门、进修和轻量级的参考手册,本书旨在帮助工程师、架构师和管理者研发及维护可扩展的互联网产品。本书给出了一系列规则,每个规则围绕着不同的主题展开讨论。大部分的规则聚焦在技术上,少数规则涉及一些关键的思维或流程问题,每个规则对构建可扩展的产品都是至关重要的。这些规则在深度和焦点上都有所不同。有些规则是高级的,例如定义一个可以应用于几乎任何可扩展性问题的模......一起来看看 《架构真经》 这本书的介绍吧!

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具