End-to-End Machine Learning Project Tutorial — Part 1

栏目: IT技术 · 发布时间: 3年前

内容简介：The perpetual question with regards to Data Science that I come across:My answer remains constant: There is no alternative to working onIn my post,

The perpetual question with regards to Data Science that I come across:

What is the best way to master Data Science? What will get me hired?

My answer remains constant: There is no alternative to working on portfolio-worthy projects . Even after clearing the TensorFlow Developer Certificate Exam, I’d say no certificates, no courses, nothing, you can only prove your competency with projects that showcase your research, programming skills, mathematical background, etc.

In my post, how to build an effective Data Science Portfolio , I shared many project ideas and other tips to prepare a kickass portfolio. This post is dedicated to one of those ideas where I mentioned about end-to-end data science/ML projects.

Agenda

This tutorial is intended to walk you through all the major steps involved in completing an End-to-End Machine Learning project. For this project, I’ve chosen a supervised learning regression problem.

Major topics covered:-

Pre-requisites and Resources
Data Collection and Problem Statement
Exploratory Data Analysis with Pandas and NumPy
Data Preparation using Sklearn
Selecting and Training a few Machine Learning Models
Cross-Validation and Hyperparameter Tuning using Sklearn
Deploying the Final Trained Model on Heroku via a Flask App

Let’s start building…

Pre-requisites and Resources

This project and tutorial expect familiarity with Machine Learning algorithms, Python environment setup, and common ML terminologies. Here are a few resources to get you started:

Read the first 2–3 chapters of The hundred page ML book: http://themlbook.com/wiki/doku.php
List of Tasks for almost every Machine Learning Project — Keep referring to this list while working on this(or any other) ML project.
You need a Python Environment set up — a virtual environment dedicated to this project.
Familiarity withJupyter Notebook.

That’s it, make sure you have an understanding of these concepts and tools and you’re ready to go!

Data Collection and Problem Statement

End-to-End Machine Learning Project Tutorial — Part 1

The first step is to get your hands onto the data but if you have access to data(as in most product-based companies) then, the first step is to define the problem that you want to solve. We don’t have the data yet, so we are going to collect the data first.

We are using the Auto MPG dataset from the UCI Machine Learning Repository . Here is the link to the dataset:

http://archive.ics.uci.edu/ml/datasets/Auto+MPG

The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.

Once you have downloaded the data, move it to your project directory, activate your virtualenv, start the jupyter local server.

You can download the data into your project from the notebook as well using wget :

!wget "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"

The next step is to load this .data file into a pandas datagram, for that, make sure you have pandas and other general use case libraries installed. Import all the general use case libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Reading and loading the file into a dataframe using read_csv() method:

Looking at a few rows of the dataframe and reading the description of each attribute from the website helps you define the problem statement.

Problem Statement —The data contains the MPG(Mile Per Gallon) variable which is continuous data and tells us about the efficiency of fuel consumption of a vehicle in the 70s and 80s.

Our aim here is to predict the MPG value for a vehicle given we have other attributes of that vehicle.

Exploratory Data Analysis with Pandas and NumPy

For this rather simple dataset, the exploration is broken down into a series of steps:

Check for Data type of columns

##checking the data info
data.info()

2. Check for null values.

##checking for all the null values
data.isnull().sum()

The horsepower column has 6 missing values. We’ll have to study the column a bit more.

3. Check for outliers in horsepower column

##summary statistics of quantitative variables
data.describe()##looking at horsepower box plot
sns.boxplot(x=data['Horsepower'])

Since there are a few outliers, we can use the median of the column to impute the missing values using the pandas median() method.

##imputing the values with median
median = data['Horsepower'].median()
data['Horsepower'] = data['Horsepower'].fillna(median)
data.info()

4. Look for the category distribution in categorical columns

##category distribution
data["Cylinders"].value_counts() / len(data)data['Origin'].value_counts()

The 2 categorical columns are Cylinders and Origin which only have a few categories of values. Looking at the distribution of the values among these categories will tell us how the data is distributed:

5. Plot for correlation

##pairplots to get an intuition of potential correlations
sns.pairplot(data[["MPG", "Cylinders", "Displacement", "Weight", "Horsepower"]], diag_kind="kde")

The pair plot gives you a brief overview of how each variable behaves with respect to every other variable.

For example, the MPG column(our target variable) is negatively correlated with Displacement, weight, and horsepower features.

6. Set aside the test data set

This is one of the first things we should do as we want to test our final model on unseen/unbiased data.

There are many ways to split the data into training and testing sets but we want our test set to represent the overall population and not just a few specific categories. Thus, instead of using simple and common train_test_split() method from sklearn, we use stratified sampling.

Stratified Sampling — We create homogeneous subgroups called strata from the overall population and sample the right number of instances to each stratum to ensure that the test set is representative of the overall population.

In task 4, we saw how the data is distributed over each category of the Cylinder column. We’re using the Cylinder column to create the strata:

from sklearn.model_selection import StratifiedShuffleSplitsplit = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["Cylinders"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

Checking for the distribution in training set:

##checking for cylinder category distribution in training set
strat_train_set['Cylinders'].value_counts() / len(strat_train_set)

Testing set:

strat_test_set["Cylinders"].value_counts() / len(strat_test_set)

You can compare these results with the output of train_test_split() to find out which one produces better splits.

7. Checking the Origin Column

The Origin column about the origin of the vehicle and has discrete values that look like the code of a country.

To add some complication and make it more explicit, I converted these numbers to strings:

##converting integer classes to countries in Origin column
train_set['Origin'] = train_set['Origin'].map({1: 'India', 2: 'USA', 3 : 'Germany'})
train_set.sample(10)

We’ll have to preprocess this categorical column by one-hot encoding these values:

##one hot encoding
train_set = pd.get_dummies(train_set, prefix='', prefix_sep='')
train_set.head()

8. Testing for new variables — Analyze the correlation of each variable with the target variable

## testing new variables by checking their correlation w.r.t. MPG
data['displacement_on_power'] = data['Displacement'] / data['Horsepower']
data['weight_on_cylinder'] = data['Weight'] / data['Cylinders']
data['acceleration_on_power'] = data['Acceleration'] / data['Horsepower']
data['acceleration_on_cyl'] = data['Acceleration'] / data['Cylinders']corr_matrix = data.corr()
corr_matrix['MPG'].sort_values(ascending=False)

We found acceleration_on_power and acceleration_on_cyl as 2 new variables which turned out to be more positively correlated than the original variables.

This brings us to the end of the Exploratory Analysis. We are ready to proceed to our next step of preparing the data for our Machine Learning.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

End-to-End Machine Learning Project Tutorial — Part 1

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Web 2.0 Architectures

Duane Nickull、Dion Hinchcliffe、James Governor / O'Reilly / 2009 / USD 34.99

The "Web 2.0" phenomena has become more pervasive than ever before. It is impacting the very fabric of our society and presents opportunities for those with knowledge. The individuals who understand t......一起来看看《Web 2.0 Architectures》这本书的介绍吧!

码农工具