对pandas进行数据预处理的实例讲解

栏目: 数据库 · 发布时间: 7年前

内容简介：引入包和加载数据12

引入包和加载数据

import pandas as pd

import numpy as np

train_df =pd.read_csv('../datas/train.csv') # train set

test_df = pd.read_csv('../datas/test.csv') # test set

combine = [train_df, test_df]

清洗数据

查看数据维度以及类型

缺失值处理

查看object数据统计信息

数值属性离散化

计算特征与target属性之间关系

查看数据维度以及类型

查看前五条数据

print train_df.head(5)

查看每列数据类型以及nan情况

print train_df.info()

获得所有object属性

print train_data.describe(include=['O']).columns

查看object数据统计信息

查看连续数值属性基本统计情况

print train_df.describe()

查看object属性数据统计情况

print train_df.describe(include=['O'])

统计Title单列各个元素对应的个数

print train_df['Title'].value_counts()

属性列删除

train_df = train_df.drop(['Name', 'PassengerId'], axis=1)

缺失值处理

直接丢弃缺失数据列的行

print df4.dropna(axis=0,subset=['col1']) # 丢弃nan的行,subset指定查看哪几列

print df4.dropna(axis=1) # 丢弃nan的列

采用其他值填充

dataset['Cabin'] = dataset['Cabin'].fillna('U')

dataset['Title'] = dataset['Title'].fillna(0)

采用出现最频繁的值填充

freq_port = train_df.Embarked.dropna().mode()[0]

dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)

采用中位数或者平均数填充

test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)

test_df['Fare'].fillna(test_df['Fare'].dropna().mean(), inplace=True)

数值属性离散化，object属性数值化

创造一个新列，FareBand，将连续属性Fare切分成四份

train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)

查看切分后的属性与target属性Survive的关系

train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)

建立object属性映射字典

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Royalty":5, "Officer": 6}

dataset['Title'] = dataset['Title'].map(title_mapping)

计算特征与target属性之间关系

object与连续target属性之间，可以groupby均值

object与离散target属性之间，先将target数值化，然后groupby均值，或者分别条形统计图

连续属性需要先切割然后再进行groupby计算，或者pearson相关系数

print train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)

总结pandas基本操作

”'

创建df对象

””'

s1 = pd.Series([1,2,3,np.nan,4,5])

s2 = pd.Series([np.nan,1,2,3,4,5])

print s1

dates = pd.date_range(“20130101”,periods=6)

print dates

df = pd.DataFrame(np.random.rand(6,4),index=dates,columns=list(“ABCD”))

print df

df2 = pd.DataFrame({“A”:1,

‘B':pd.Timestamp(‘20130102'),

‘C':pd.Series(1,index=list(range(4)),dtype='float32'),

‘D':np.array([3]*4,dtype=np.int32),

‘E':pd.Categorical([‘test','train','test','train']),

‘F':'foo'

})

print df2.dtypes

df3 = pd.DataFrame({'col1':s1,

'col2':s2

})

print df3

'''

2.查看df数据

'''

print df3.head(2) #查看头几条

print df3.tail(3) #查看尾几条

print df.index #查看索引

print df.info() #查看非non数据条数

print type(df.values) #返回二元数组

print df3.values

print df.describe() #对每列数据进行初步的统计

print df3

print df3.sort_values(by=['col1'],axis=0,ascending=True) #按照哪几列排序

'''

3.选择数据

'''

ser_1 = df3['col1']

print type(ser_1) #pandas.core.series.Series

print df3[0:2] #前三行

print df3.loc[df3.index[0]] #通过index来访问

print df3.loc[df3.index[0],['col2']] #通过行index，和列名来唯一确定一个位置

print df3.iloc[1] #通过位置来访问

print df3.iloc[[1,2],1:2] #通过位置来访问

print "==="

print df3.loc[:,['col1','col2']].as_matrix() # 返回nunpy二元数组

print type(df3.loc[:,['col1','col2']].as_matrix())

'''

4.布尔索引，过滤数据

'''

print df3[df3.col1 >2]

df4 = df3.copy()

df4['col3']=pd.Series(['one','two','two','three','one','two'])

print df4

print df4[df4['col3'].isin(['one','two'])]

df4.loc[:,'col3']="five"

print df4

'''

5.缺失值处理，pandas将缺失值用nan代替

'''

print pd.isnull(df4)

print df4.dropna(axis=0,subset=['col1']) # 丢弃nan的行,subset指定查看哪几列

print df4.dropna(axis=1) # 丢弃nan的列

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

The Seasoned Schemer

Daniel P. Friedman、Matthias Felleisen / The MIT Press / 1995-12-21 / USD 38.00

drawings by Duane Bibbyforeword and afterword by Guy L. Steele Jr.The notion that "thinking about computing is one of the most exciting things the human mind can do" sets both The Little Schemer (form......一起来看看《The Seasoned Schemer》这本书的介绍吧!

码农工具