A Data Scientists Guide to Python Modules and Packages

栏目: IT技术 · 发布时间: 4年前

内容简介：A python module is simply a set of python operations, often functions, placed in a single file with aLet’s run through an example.In the below code I have read in the CSV file I will be working with using pandas.

A data science use case for modules

A python module is simply a set of python operations, often functions, placed in a single file with a .py extension. This file can then imported into a Jupyter notebook, IPython shell or into another module for use in your project.

Let’s run through an example.

In the below code I have read in the CSV file I will be working with using pandas.

import pandas as pddata = pd.read_csv('adults_data.csv')
data.head()

This dataset contains a lot of categorical features. If we were planning to use this to train a machine learning model we would first need to perform some pre-processing.

Having analysed this data I have determined that I will take the following steps to preprocess the data before training a model.

One-hot encode the following columns: workclass, marital-status, relationship, race and gender.
Take the most commonly occurring values, group remaining values as ‘others’ and one-hot encode the resulting feature. This will need to be performed for the following columns as they have a large number of unique values: education, occupation, native-country.
Scale the remaining numerical values.

The code that we will need to write to perform these tasks will be quite large. Additionally, these are all tasks that we may want to perform more than once. To make our code more readable and to easily be able to re-use it we can write a series of functions into a separate file that can be imported for use in our notebook — a module.

Writing a module

To create a module you will need to first make a new blank text file and save it with the .py extension. You can use an ordinary text editor for this but many people use an IDE (Integrated Development Environment). IDE’s provide a lot of additional functionality for writing code including tools for compiling code, debugging and integrating with Github. There are many different types of IDE’s available and it is worth experimenting with a few to find the one which works best for you. I personally prefer PyCharm so I will be using this in the example.

To start writing the python module I am going to create a new python file.

I will name it preprocessing.py .

Let’s write our first preprocessing function in this file and test importing and using it in a Jupyter Notebook.

I have written the following code at the top of the preprocessing.py file. It is good practice to annotate the code to make it more readable. I have added some notes to the function in the code below.

def one_hot_encoder(df, column_list):
"""Takes in a dataframe and a list of columns
 for pre-processing via one hot encoding"""
df_to_encode = df[column_list]
 df = pd.get_dummies(df_to_encode)
 return df

To import this module into a Jupyter Notebook we simply write the following.

import preprocessing as pr

IPython has a handy magic extension known as autoreload . If you add the following code before the import then if you make any changes to the module file they will automatically be reflected in the notebook.

%load_ext autoreload
%autoreload 2import preprocessing as pr

Let’s test using it to preprocess some data.

cols = ['workclass', 'marital-status', 'relationship', 'race', 'gender']one_hot_df = pr.one_hot_encoder(data, cols)

Now we will add the remaining preprocessing functions to our preprocessing.py file.

def one_hot_encoder(df, column_list):
"""Takes in a dataframe and a list of columns
 for pre-processing via one hot encoding returns
 a dataframe of one hot encoded values"""
df_to_encode = df[column_list]
 df = pd.get_dummies(df_to_encode)
 return df

def reduce_uniques(df, column_threshold_dict):
"""Takes in a dataframe and a dictionary consisting
 of column name : value count threshold returns the original
 dataframe"""
for key, value in column_threshold_dict.items():
 counts = df[key].value_counts()
 others = set(counts[counts < value].index)
 df[key] = df[key].replace(list(others), 'Others')
 return df

def scale_data(df, column_list):
"""Takes in a dataframe and a list of column names to transform
 returns a dataframe of scaled values"""
df_to_scale = df[column_list]
 x = df_to_scale.values
 min_max_scaler = preprocessing.MinMaxScaler()
 x_scaled = min_max_scaler.fit_transform(x)
 df_to_scale = pd.DataFrame(x_scaled, columns=df_to_scale.columns)
 return df_to_scale

If we go back to the notebook we can use all these functions to transform the data.

import pandas as pd
from sklearn import preprocessing%load_ext autoreload
%autoreload 2import preprocessing as prone_hot_list = ['workclass', 'marital-status', 'relationship', 'race', 'gender']
reduce_uniques_dict = {'education' : 1000,'occupation' : 3000, 'native-country' : 100}
scale_data_list = data.select_dtypes(include=['int64', 'float64']).columnsone_hot_enc_df = pr.one_hot_encoder(data, one_hot_list)
reduce_uniques_df = pr.reduce_uniques(data, reduce_uniques_dict)
reduce_uniques_df = pr.one_hot_encoder(data, reduce_uniques_dict.keys())
scale_data_df = pr.scale_data(data, scale_data_list)final_data = pd.concat([one_hot_enc_df, reduce_uniques_df, scale_data_df], axis=1)final_data.dtypes

We now have an entirely numerical dataset which is suitable for training a machine learning model.

A snapshot of transformed columns

Packages

When working on a machine learning project it can often be desirable or sometimes necessary to create several related modules and package them so that they can be installed and used together.

For example, in my work, I am currently using a Google Cloud deployment solution for machine learning models called AI Platform . This tool requires that you package up preprocessing, training and prediction steps in the machine learning model to upload and install on the platform to deploy the final model.

A python package is a directory containing modules, files and subdirectories. The directory needs to contain a file called __init__.py . This file indicates that the directory it is contained within should be treated as a package and specifies the modules and functions that should be imported.

We are going to create a package for all the steps in our preprocessing pipeline. The contents of the __init__.py file are as follows.

from .preprocessing import one_hot_encoder
from .preprocessing import reduce_uniques
from .preprocessing import scale_data
from .makedata import preprocess_data

Modules within the same package can be imported for use within another module. We are going to add another module to our directory called makedata.py that uses the preprocessing.py module to execute the data transformations and then export the final dataset as a CSV file for later use.

import preprocessing as pr
import pandas as pd

def preprocess_data(df, one_hot_list, reduce_uniques_dict, scale_data_list, output_filename):
 one_hot_enc_df = pr.one_hot_encoder(data, one_hot_list)
 reduce_uniques_df = pr.reduce_uniques(data, reduce_uniques_dict)
 reduce_uniques_df = pr.one_hot_encoder(data, reduce_uniques_dict.keys())
 scale_data_df = pr.scale_data(data, scale_data_list)
 final_data = pd.concat([one_hot_enc_df, reduce_uniques_df, scale_data_df], axis=1)
 final_data.to_csv(output_filename)

The new directory now looks like this.

Now we can go back to the Jupyter Notebook and use this package to execute all the preprocessing. Our code is now very simple and clean.

import pandas as pd%load_ext autoreload
%autoreload 2
import preprocessing as prdata = pd.read_csv('adults_data.csv')one_hot_list = ['workclass', 'marital-status', 'relationship', 'race', 'gender']
reduce_uniques_dict = {'education' : 1000,'occupation' : 3000, 'native-country' : 100}
scale_data_list = data.select_dtypes(include=['int64', 'float64']).columnspr.preprocess_data(data, one_hot_list, reduce_uniques_dict,scale_data_list, 'final_data.csv')

In our current working directory, there will now be a new CSV file called final_data.csv which contains the preprocessed dataset. Let’s read this back in and inspect a few rows to ensure that our package has performed as expected.

data_ = pd.read_csv('final_data.csv')
data.head()

以上所述就是小编给大家介绍的《A Data Scientists Guide to Python Modules and Packages》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

A Data Scientists Guide to Python Modules and Packages

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

零基础学PHP

马忠超 / 2008-3 / 56.00元

《零基础学PHP》主要内容：PHP是一种运行于服务器端并完全跨平台的嵌入式脚本编程语言，是目前开发各类Web应用的主流语言之一。PHP因其功能强大、易学易用、可扩展性强、运行速度快和良好的开放性，而成为网站开发者的首选工具，其较高的开发效率，也给开发人员在编写Web应用程序时带来极大的便利。一起来看看《零基础学PHP》这本书的介绍吧!

码农工具