Discovering New Data

栏目: IT技术 · 发布时间: 5年前

When you are working in data science one of the hardest parts is discovering which data to use when trying to solve a business problem.

Remember that before trying to get data to solve a problem you need to get the context of a business and the project. With context I mean all the specifics on how a company runs its projects, how the company is established, its competitors, how many departments exist, the different objectives and goals they have, and how they measure success or failure.

When you have all of that you can start thinking about getting the required data to solve the business problem. In this article I won’t talk that much about data collection, instead, I want to discuss and show you the process to enrich the data you already have with new data.

Remember that getting new data has to be done in a systematic fashion, it’s not just getting data out of nowhere, we have to do it consistently, plan it, create a process to do it, and this depends in engineering, architect, DataOps and more things that I’ll be discussing in other articles.

Setting up the environment

In this article, we will be using three things: Python, GitHub, and Explorium. If you want to know more about Explorium check this:

Where is the data?

Or how to enrich your datasets and create new features automatically.

towardsdatascience.com

Let’s start by creating a new git repo. Here we will be storing our data, code, and documentation. Go to your terminal and create a new folder and move there:

mkdir data_discovery
cd data_discovery

Then initialize the git repo:

git init

Now let’s create a remote repo on GitHub :

Now go to your terminal and type (change the URL to yours):

git remote add origin https://github.com/FavioVazquez/data_discovery.git

Now let’s check:

git remote -v

You should see (with your own URL of course):

origin https://github.com/FavioVazquez/data_discovery.git (fetch)
origin https://github.com/FavioVazquez/data_discovery.git (push)

Now let’s start a Readme file (I’m using Vim ):

vim Readme.md

And write whatever you want in there.

Now let’s add the file to git:

git add .

And create our first commit:

git commit -m "Add readme file"

Finally, let’s push this to our GitHub repo:

git push --set-upstream origin master

By this point your repo should look like this:

Finding the data

Let’s find some data. I’m not going to do the whole data science process here of understanding the data, exploring it, modeling, or anything like that. I’m just going to find some interesting data as a demo for you.

For this example, we will be exploring the data from YELP reviews for some businesses. The data is originally on Kaggle:

https://www.yelp.com/dataset/challenge

But I’m using a CSV dataset fromhere. The CSV file I’m using is called “yelp_reviews_RV_categories.csv”

Let’s add that to git:

git add .
git commit -m "Add data"
git push

We will begin by loading the data on Pandas and performing a basic EDA:

from pandas_profiling import ProfileReport
profile = ProfileReport(df, title="Pandas Profiling Report")
profile

This will give you a comprehensive report on the data that looks like this:

Using Explorium to get more data

Great now it’s time to go to Explorium. To know more about Explorium click here:

Augmented Data Science Platform | Explorium

Automatically connect to thousands of sources and discover the features that drive accurate models Explorium is driving…

hubs.ly

Let’s create a new project:

After naming your project you should add your data from your local machine. We will add our main dataset. And you’ll see something like this:

You can get more basic information about the columns on Explorium in the exploration bar below:

The data we have contains information about a specific business, like where it’s located (city, state, zip code), its name, category, and also data about its reviewer. But we would like to know more. More about the business. We will start by blacklisting all the information we don’t care about:

user_id
text

Now let’s try to get more information about this dataset. Explorium right now will ask you for something to predict to be able to run, we actually don’t want to predict anything but let’s put something so it works (we will use “stars” as a target”):

When we click Play, then the system will start gathering information about the dataset. We have to wait here. At this point the system it’s not only bringing new data from external sources but also creating new features based on our existing columns. We won’t use that for now, but in the next articles about feature engineering, it will be important.

After some minutes I got this screen:

That means that Explorium found 102 datasets that can compliment my original data, and in the end, it created/fetched 3791 columns from my data and from the external sources. Remember that we are interested in finding more information about the businesses, so I’ll pick some columns coming from the external datasets and add them to my original data.

This is the actual process of enriching the data. As you can see the system can tell you what are the top 50 features, but with respect to what? If you remember we are trying to “predict” the stars from the other columns, so what it’s telling you is that these 50 or 100 features have some predicting power regarding the target we chose.

You can actually get more information about the specific column of interest.

Let’s start with something very basic. Getting the website of the business. For that, I’ll use the dataset: Search Engine Results:

If you click the arrow you’ll download it. After that, you can load it on Pandas:

# search engine
search = pd.read_csv("Search  Engine  Results.csv")

As you will see we have a lot of missing data, but that is normal when you do data enrichment. Great, let’s choose the company’s website:

search[["name", "Company Website"]]

And what you are seeing is the most likely webpage for a specific business given its name. Pretty cool right?

What about the number of violent crimes around each business? For that, I’ll be using the “US Crime Statistics” dataset:

We will use a different method to download this data. I want direct just violent crimes, so in my features section, after filtering by the crimes dataset, I’ll just select the violent crimes:

And click download. Let’s see it on Python:

crimes[["name","violent_crime"]].dropna().drop_duplicates()

And just like that, you know how many violent crimes you have for a specific store. The actual process of creating this variable is quite complex. To know more about it, go to Explorium and click on Learn More when selecting a variable:

In the feature origin section, you see how it was created and from which data sources:

As you can see the process is not that simple, but the system is doing it for you so that’s awesome :)

You can do that with every single variable that the system gathered or created for you, and with that, you have full control over the process.

Even more data

If you remember we put “stars” as the column to predict. Even though we are not interested in that, Explorium did it best to get data for predicting that column. In future articles, I’ll create whole projects with the tool so you see the whole picture.

For now, we can select the best 50 features for predicting that “test” variable we selected. To do that, we go to the Engine tab and select Features:

We will only get the best 50 variables out of the 3791 the system gather and created. And then we will download them as before. The dataset we downloaded it’s called “all_features.csv” by default. So let’s load it in our EDA notebook:

data = pd.read_csv("all_features.csv")

These are the columns we have:

['name',
 'address',
 'latitude',
 'longitude',
 'categories',
 'review_stars',
 'KNeighbors(latitude, longitude)',
 'review_stars.1',
 '"rv" in categories',
 'KNeighbors(Latitude, Longitude)',
 '"world" in Results Snippets',
 '"camping" in Results Snippets',
 '"camping" in Title of the First Result',
 '"camping" in Results Titles',
 '"world rv" in Results Titles',
 '"motorhomes" in Results Snippets',
 '"campers" in Results Snippets',
 '"accessories" in Results Titles',
 '"rated based" in Results Snippets',
 '"parts accessories" in Results Snippets',
 '"5th wheels" in Results Snippets',
 '"sale" in Results Titles',
 '"based" in Results Snippets',
 '"service center" in Title of the First Result',
 '"rvs" in Results Titles',
 '"buy" in Results Snippets',
 '"dealer" in Results Titles',
 '"inventory" in Results Snippets',
 '"travel" in Results Titles',
 'KNeighbors(LAT, LONG)',
 'Number of Related Links',
 'day(Website Creation Date)',
 'month(Website Creation Date)',
 '"service" in Website Description',
 'year(Website Creation Date)',
 'Percentile',
 'Number of Website Subdomains',
 '"rv" in Website Description',
 'MedianLoadTime',
 '"camping" in Website Description',
 '"buy" in Website Description',
 '"rv dealer" in Title',
 '"dealer" in Title',
 'Number of Connections to Youtube',
 '"trailers" in Website Description',
 'month(Domain Update Date)',
 'Domain Creation Date - Domain Update Date',
 'Domain Creation Date - Domain Expiry Date',
 'Stopword Count(Associated Keywords)',
 '"pinterest instagram" in Social Networks',
 'Number of Social Networks',
 '"facebook" in Social Networks',
 'Year of Establishment',
 'AdultContent->False',
 'AdultContent->empty',
 'Results From Facebook->False',
 'Results From Facebook->True',
 'Url Split(Company Website).company_website_10->empty',
 'Url Split(Company Website).company_website_10->utm_campaign=Yext%20Directory%20Listing',
 'Url Split(Company Website).company_website_10->utm_medium=organic',
 'Url Split(Company Website).company_website_10->utm_source=moz',
 '_TARGET_']

As you can see we have very different data but interesting of course. I’m not going to do more with it right now because the idea was to show you how to get more data for the data you already have, but again, I’ll be creating whole projects where I’m going to follow the whole data science process.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

游戏编程入门

莫里森 / 人民邮电出版社 / 2005-9 / 49.00元

本书介绍如何设计和构建自己的计算机游戏。书中从零开始，引导读者开发一个“即插即用”的游戏引擎，并基于该引擎，循序渐进地开发7个完整的游戏。全书分为8个部分，共24章，内容包括游戏编程基础知识、如何与玩家交互、使用子画面动画、使用声音和音乐、高级动画、游戏人工智能、增添游戏的趣味性和附加练习。此外，在随书光盘中提供有附录，包括C++语言和windows编程的入门指导、游戏开发工具以及游戏图形创建的介......一起来看看《游戏编程入门》这本书的介绍吧!

码农工具

CSS 压缩/解压工具

在线压缩/解压 CSS 代码

随机密码生成器

多种字符组合密码