When you are working in data science one of the hardest parts is discovering which data to use when trying to solve a business problem.
Remember that before trying to get data to solve a problem you need to get the context of a business and the project. With context I mean all the specifics on how a company runs its projects, how the company is established, its competitors, how many departments exist, the different objectives and goals they have, and how they measure success or failure.
When you have all of that you can start thinking about getting the required data to solve the business problem. In this article I won’t talk that much about data collection, instead, I want to discuss and show you the process to enrich the data you already have with new data.
Remember that getting new data has to be done in a systematic fashion, it’s not just getting data out of nowhere, we have to do it consistently, plan it, create a process to do it, and this depends in engineering, architect, DataOps and more things that I’ll be discussing in other articles.
Setting up the environment
In this article, we will be using three things: Python, GitHub, and Explorium. If you want to know more about Explorium check this:
Let’s start by creating a new git repo. Here we will be storing our data, code, and documentation. Go to your terminal and create a new folder and move there:
mkdir data_discovery cd data_discovery
Then initialize the git repo:
git init
Now let’s create a remote repo on GitHub :
Now go to your terminal and type (change the URL to yours):
git remote add origin https://github.com/FavioVazquez/data_discovery.git
Now let’s check:
git remote -v
You should see (with your own URL of course):
origin https://github.com/FavioVazquez/data_discovery.git (fetch)
origin https://github.com/FavioVazquez/data_discovery.git (push)
Now let’s start a Readme file (I’m using Vim ):
vim Readme.md
And write whatever you want in there.
Now let’s add the file to git:
git add .
And create our first commit:
git commit -m "Add readme file"
Finally, let’s push this to our GitHub repo:
git push --set-upstream origin master
By this point your repo should look like this:
Finding the data
Let’s find some data. I’m not going to do the whole data science process here of understanding the data, exploring it, modeling, or anything like that. I’m just going to find some interesting data as a demo for you.
For this example, we will be exploring the data from YELP reviews for some businesses. The data is originally on Kaggle:
https://www.yelp.com/dataset/challenge
But I’m using a CSV dataset fromhere. The CSV file I’m using is called “yelp_reviews_RV_categories.csv”
Let’s add that to git:
git add . git commit -m "Add data" git push
We will begin by loading the data on Pandas and performing a basic EDA:
from pandas_profiling import ProfileReport profile = ProfileReport(df, title="Pandas Profiling Report") profile
This will give you a comprehensive report on the data that looks like this:
Using Explorium to get more data
Great now it’s time to go to Explorium. To know more about Explorium click here:
Let’s create a new project:
After naming your project you should add your data from your local machine. We will add our main dataset. And you’ll see something like this:
You can get more basic information about the columns on Explorium in the exploration bar below:
The data we have contains information about a specific business, like where it’s located (city, state, zip code), its name, category, and also data about its reviewer. But we would like to know more. More about the business. We will start by blacklisting all the information we don’t care about:
- user_id
- text
Now let’s try to get more information about this dataset. Explorium right now will ask you for something to predict to be able to run, we actually don’t want to predict anything but let’s put something so it works (we will use “stars” as a target”):
When we click Play, then the system will start gathering information about the dataset. We have to wait here. At this point the system it’s not only bringing new data from external sources but also creating new features based on our existing columns. We won’t use that for now, but in the next articles about feature engineering, it will be important.
After some minutes I got this screen:
That means that Explorium found 102 datasets that can compliment my original data, and in the end, it created/fetched 3791 columns from my data and from the external sources. Remember that we are interested in finding more information about the businesses, so I’ll pick some columns coming from the external datasets and add them to my original data.
This is the actual process of enriching the data. As you can see the system can tell you what are the top 50 features, but with respect to what? If you remember we are trying to “predict” the stars from the other columns, so what it’s telling you is that these 50 or 100 features have some predicting power regarding the target we chose.
You can actually get more information about the specific column of interest.
Let’s start with something very basic. Getting the website of the business. For that, I’ll use the dataset: Search Engine Results:
If you click the arrow you’ll download it. After that, you can load it on Pandas:
# search engine search = pd.read_csv("Search Engine Results.csv")
As you will see we have a lot of missing data, but that is normal when you do data enrichment. Great, let’s choose the company’s website:
search[["name", "Company Website"]]
And what you are seeing is the most likely webpage for a specific business given its name. Pretty cool right?
What about the number of violent crimes around each business? For that, I’ll be using the “US Crime Statistics” dataset:
We will use a different method to download this data. I want direct just violent crimes, so in my features section, after filtering by the crimes dataset, I’ll just select the violent crimes:
And click download. Let’s see it on Python:
crimes[["name","violent_crime"]].dropna().drop_duplicates()
And just like that, you know how many violent crimes you have for a specific store. The actual process of creating this variable is quite complex. To know more about it, go to Explorium and click on Learn More when selecting a variable:
In the feature origin section, you see how it was created and from which data sources:
As you can see the process is not that simple, but the system is doing it for you so that’s awesome :)
You can do that with every single variable that the system gathered or created for you, and with that, you have full control over the process.
Even more data
If you remember we put “stars” as the column to predict. Even though we are not interested in that, Explorium did it best to get data for predicting that column. In future articles, I’ll create whole projects with the tool so you see the whole picture.
For now, we can select the best 50 features for predicting that “test” variable we selected. To do that, we go to the Engine tab and select Features:
We will only get the best 50 variables out of the 3791 the system gather and created. And then we will download them as before. The dataset we downloaded it’s called “all_features.csv” by default. So let’s load it in our EDA notebook:
data = pd.read_csv("all_features.csv")
These are the columns we have:
['name', 'address', 'latitude', 'longitude', 'categories', 'review_stars', 'KNeighbors(latitude, longitude)', 'review_stars.1', '"rv" in categories', 'KNeighbors(Latitude, Longitude)', '"world" in Results Snippets', '"camping" in Results Snippets', '"camping" in Title of the First Result', '"camping" in Results Titles', '"world rv" in Results Titles', '"motorhomes" in Results Snippets', '"campers" in Results Snippets', '"accessories" in Results Titles', '"rated based" in Results Snippets', '"parts accessories" in Results Snippets', '"5th wheels" in Results Snippets', '"sale" in Results Titles', '"based" in Results Snippets', '"service center" in Title of the First Result', '"rvs" in Results Titles', '"buy" in Results Snippets', '"dealer" in Results Titles', '"inventory" in Results Snippets', '"travel" in Results Titles', 'KNeighbors(LAT, LONG)', 'Number of Related Links', 'day(Website Creation Date)', 'month(Website Creation Date)', '"service" in Website Description', 'year(Website Creation Date)', 'Percentile', 'Number of Website Subdomains', '"rv" in Website Description', 'MedianLoadTime', '"camping" in Website Description', '"buy" in Website Description', '"rv dealer" in Title', '"dealer" in Title', 'Number of Connections to Youtube', '"trailers" in Website Description', 'month(Domain Update Date)', 'Domain Creation Date - Domain Update Date', 'Domain Creation Date - Domain Expiry Date', 'Stopword Count(Associated Keywords)', '"pinterest instagram" in Social Networks', 'Number of Social Networks', '"facebook" in Social Networks', 'Year of Establishment', 'AdultContent->False', 'AdultContent->empty', 'Results From Facebook->False', 'Results From Facebook->True', 'Url Split(Company Website).company_website_10->empty', 'Url Split(Company Website).company_website_10->utm_campaign=Yext%20Directory%20Listing', 'Url Split(Company Website).company_website_10->utm_medium=organic', 'Url Split(Company Website).company_website_10->utm_source=moz', '_TARGET_']
As you can see we have very different data but interesting of course. I’m not going to do more with it right now because the idea was to show you how to get more data for the data you already have, but again, I’ll be creating whole projects where I’m going to follow the whole data science process.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
The Probabilistic Method Second Edition
Noga Alon、Joel H. Spencer / Wiley-Blackwell / 2000 / $121.95
The leading reference on probabilistic methods in combinatorics-now expanded and updated When it was first published in 1991, The Probabilistic Method became instantly the standard reference on one......一起来看看 《The Probabilistic Method Second Edition》 这本书的介绍吧!