3 Highly Practical Operations of Pandas

栏目: IT技术 · 发布时间: 3年前

内容简介:Sample, where, isin explained in detail with examples.Pandas is a very powerful and versatile Python data analysis library that expedites the preprocessing steps of data science projects. It provides numerous functions and methods that are quite useful in

3 Highly Practical Operations of Pandas

Sample, where, isin explained in detail with examples.

Photo by Daniel Cheung on Unsplash

Pandas is a very powerful and versatile Python data analysis library that expedites the preprocessing steps of data science projects. It provides numerous functions and methods that are quite useful in data analysis.

In this post, I aim to cover some of the very handy operations that I use quite often. The topics that will be covered in this post are:

Sample

Sample method allows to select values randomly from a Series or DataFrame . It is useful when we want to select a random sample from a distribution. Consider we have a random variable whose values are stored in Series or columns of a DataFrame. We can select a part of it using loc or iloc methods but we need to specify the indices or a range for selection. However, using sample method, we can randomly select values. Before starting on examples, we import numpy and pandas:

import numpy as np
import pandas as pd

Let’s create a dataframe with 3 columns and 10000 rows:

col_a = np.random.random(10000)
col_b = np.random.randint(50, size=10000)
col_c = np.random.randn(10000)df = pd.DataFrame({
 'col_a':col_a,
 'col_b':col_b,
 'col_c':col_c
})print(df.shape)
(10000, 3)df.head()

We can select n number of values from any column:

sample1 = df.col_a.sample(n=5)sample1
3309 0.049868
7856 0.121563
3012 0.073021
9660 0.436145
8782 0.343959
Name: col_a, dtype: float64

sample()returns both the values and the indices. We specify the number of values with n parameter but we can also pass a ratio to frac parameter. For instance, 0.0005 will return 5 of 10000 values in a row:

sample2 = df.col_c.sample(frac=0.0005)sample2
8955 1.774066
8619 -0.218752
8612 0.170621
9523 -1.518800
597 1.151987
Name: col_c, dtype: float64

By default, sampling is done without replacement . Thus, each value can only be selected once. We can change this way of selection by setting replace parameter as True. Then values can be selected more than one time. Pleae note that this does not mean the sample will definitely include a value more than once. It may or may not select the same value.

sample3 = df.col_c.sample(n=5, replace=True)sample3
3775 0.898356
761 -0.758081
522 -0.221239
6586 -1.404669
5940 0.053480
Name: col_c, dtype: float64

By default, each value has the same probability to be selected. In some cases, we may want to select randomly from a specified part of a series or dataframe. For instance, we may want to skip the first 9000 rows and want to randomly select from the remaining 1000 rows. To accomplish this, we can use weights parameter.

We assign weights to each data point that indicates the probability to be selected. The weights must add up to 1.

weights = np.zeros(10000)
weights[9000:] = 0.0001sample4 = df.col_c.sample(n=5, weights=weights)sample4
9232 -0.429183
9556 -1.282052
9388 -1.041973
9868 -1.809887
9032 -0.330297
Name: col_c, dtype: float64

We set the weights of first 9000 to zero so the resulting sample only includes values after index 9000.

To obtain reproducible samples, we can use random_state parameter. If an integer value is passed to random_state, same sample will be produced every time the code is run.

sample6 = df.col_b.sample(n=5, random_state=1)sample6
9953 31
3850 47
4962 35
3886 16
5437 23
Name: col_b, dtype: int32

We can also select a column randomly by setting axis parameter as 1.

sample5 = df.sample(n=1, axis=1)sample5[:5]

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

PHP高级开发技术与应用

PHP高级开发技术与应用

曹铁群、孙一江、张永学 / 清华大学出版社 / 2002-5-1 / 32.00

作为一本介绍PHP高级开发技术的书籍,本书并不像一般介绍PHP语言的书籍那样讲述大量的语法规则,罗列大量的函数,而是着眼于PHP在Web中的实际应用,特别是PHP对最新技术的支持,比如WAP技术、XML技术等。 本书涉及到的内容主要有:高级环境配置、高级语法和应用、正则表达式、面向对象技术、高级图像技术、用PHPLIB实现模板的处理、用PHPDoc实现文档的自动生成、PHP与组件技术、PH......一起来看看 《PHP高级开发技术与应用》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具