1. 数据操作（Pandas）¶

1.1. pandas 对象¶

引入 pandas 等包，DataFrame、Series 属于常用的，所以直接引入

In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

DataFrame 对象：Pandas DataFrame 是一个表格型的数据结构，有行索引也有列索引

In [2]:

from IPython.display import Image
Image(filename='../../image/DataFrame.png', width=400)

Out[2]:

Series 对象：类似于一维数组的对象，由一组同样 type 的数组和索引组成

In [3]:

s1 = Series(range(0,4)) # -> 0, 1, 2, 3
s2 = Series(range(1,5)) # -> 1, 2, 3, 4
s3 = s1 + s2 # -> 1, 3, 5, 7
s4 = Series(['a','b'])*3 # -> 'aaa','bbb'

index 对象：即 Series 和 DataFrame 的索引

In [4]:

# 获取索引
df = DataFrame(s1)
idx = s1.index
idx = df.columns # the column index
idx = df.index # the row index

In [5]:

# 索引的一些特性
b = idx.is_monotonic_decreasing
b = idx.is_monotonic_increasing
b = idx.has_duplicates
i = idx.nlevels # multi-level indexe

In [6]:

# 索引的一些方法
a = idx.values # get as numpy array
l = idx.tolist() # get as a python list
# idx = idx.astype(dtype) # change data type
# b = idx.equals(other) # check for equality 看看是否是相同的索引

In [7]:

# union of two indexes 合并两个索引
# idx = idx.union(other)

idx1 = pd.Index([1, 2, 3, 4])
idx2 = pd.Index([3, 4, 5, 6])
idx1.union(idx2)

Out[7]:

Int64Index([1, 2, 3, 4, 5, 6], dtype='int64')

In [8]:

i = idx.nunique() # number unique labels
label = idx.min() # minimum label
label = idx.max() # maximum label

创建 Series 和 DataFrame

http://pandas.pydata.org/pandas-docs/stable/dsintro.html

1.2. DataFrame 入门¶

In [2]:

df = DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])

DataFrame 的一些实用查看方法

In [10]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
A    10 non-null float64
B    10 non-null float64
C    10 non-null float64
D    10 non-null float64
dtypes: float64(4)
memory usage: 392.0 bytes

In [11]:

n=4
dfh = df.head(n) # 看前 n 行

In [12]:

dft = df.tail(n) # 看后 n 行

In [13]:

dfs = df.describe() # 各类统计信息

In [14]:

top_left_corner_df = df.iloc[:5, :5]

In [15]:

dfT = df.T # transpose rows and cols

DataFrame index 的一些特性

In [16]:

l = df.axes # list row and col indexes
l

Out[16]:

[RangeIndex(start=0, stop=10, step=1),
 Index([u'A', u'B', u'C', u'D'], dtype='object')]

In [17]:

(r, c) = df.axes # from above

In [18]:

s = df.dtypes # Series column data types
s

Out[18]:

A    float64
B    float64
C    float64
D    float64
dtype: object

In [19]:

b = df.empty # True for empty DataFrame
b

Out[19]:

False

In [20]:

i = df.ndim # number of axes (2)
i

Out[20]:

In [21]:

t = df.shape # (row-count, column-count)
t

Out[21]:

(10, 4)

In [22]:

(r, c) = df.shape # from above
(r, c)

Out[22]:

(10, 4)

In [23]:

i = df.size # row-count * column-count
i

Out[23]:

In [24]:

a = df.values # get a numpy array for df

实用方法

In [27]:

df = DataFrame([1, 23, 3, 5, 2])

In [28]:

dfc = df.copy() # copy a DataFrame
dfr = df.rank() # rank each col (default) 把每个值的地位列出了
dfs = df.sort() # sort each col (default)
# dfc = df.astype(dtype) # type conversion

/Users/Scott/Library/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: sort(....) is deprecated, use sort_index(.....)
  app.launch_new_instance()

In [29]:

# 下面的两个方法没怎么搞懂
df.iteritems()# (col-index, Series) pairs
df.iterrows() # (row-index, Series) pairs
# example ... iterating over columns
for (name, series) in df.iteritems():
    print('Col name: ' + str(name))
    print('First value: ' +
        str(series.iat[0]) + '\n')

Col name: 0
First value: 1

通用函数

method	##
df = df.abs()	absolute values
df = df.add(o)	add df, Series or value
s = df.count()	non NA/null values
df = df.cummax()	(cols default axis)
df = df.cummin()	(cols default axis)
df = df.cumsum()	(cols default axis)
df = df.cumprod()	(cols default axis)
df = df.diff()	1st diff (col def axis)
df = df.div(o)	div by df, Series, value
df = df.dot(o)	matrix dot product
s = df.max()	max of axis (col def)
s = df.mean()	mean (col default axis)
s = df.median()	median (col default)
s = df.min()	min of axis (col def)
df = df.mul(o)	mul by df Series val
s = df.sum()	sum axis (cols default)

1.3. DataFrame Columns 列处理¶

column 其实也是一个 Series

In [2]:

df = DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
idx = df.columns # get col index
label = df.columns[0] # 1st col label
lst = df.columns.tolist() # get as a list

In [31]:

lst

Out[31]:

['A', 'B', 'C', 'D']

In [32]:

label

Out[32]:

'A'

In [33]:

idx

Out[33]:

Index([u'A', u'B', u'C', u'D'], dtype='object')

column 改名

In [34]:

# df.rename(columns={'old':'new'}, inplace=True)
# df = df.rename(columns={'a':1,'b':'x'})

选择 columns, 也就是提取列

In [ ]:

s = df['C'] # select col to Series
df = df[['C']] # select col to df
df = df[['A','B']] # select 2 or more
df = df[['C', 'B', 'A']]# change order 改变排序了
s = df[df.columns[0]] # select by number
f = df[df.columns[[0, 3, 4]] # by number
s = df.pop('C') # get col & drop from df == df['C']

用 python 特性提取列

In [4]:

s = df.A # same as s = df['A'],
# 但不能用 python 特性创建新的 columns
# df['new_col'] = df.a / df.b

添加新的 columns，添加一个 column 是极为方便的，只要能添加一组数据就行

In [6]:

df['new_col'] = range(len(df))
df['new_col'] = np.repeat(np.nan,len(df))
df['random'] = np.random.rand(len(df))
df['index_as_col'] = df.index

In [8]:

df.head(2)

Out[8]:

	A	B	C	D	new_col	random	index_as_col
0	0.458326	-1.402187	0.446208	-0.459079	NaN	0.920599	0
1	0.366833	0.618661	-0.727332	1.152775	NaN	0.503750	1

详情参考 df1[[‘b’,’c’]] = df2[[‘e’,’f’]] df3 = df1.append(other=df2)

判定函数 pd.Series.where

In [17]:

# 符合 >0 条件的保持原值，其他 =0
df['A'] = df['A'].where(df['A']>0, other=0)
# df['d']=df['a'].where(df.b!=0,other=df.c)

数据格式 转换一列的格式时非常有用。

s = df[‘col’].astype(str) # Series dtype na = df[‘col’].values # numpy array pl = df[‘col’].tolist() # python list

columns 的一些特性和方法

value = df[‘col’].dtype # type of data value = df[‘col’].size # col dimensions value = df[‘col’].count()# non-NA count value = df[‘col’].sum() value = df[‘col’].prod() value = df[‘col’].min() value = df[‘col’].max() value = df[‘col’].mean() value = df[‘col’].median() value = df[‘col’].cov(df[‘col2’]) s = df[‘col’].describe() s = df[‘col’].value_counts()

找出最小值和最大值的位置

In [33]:

df['B'].idxmax()
df['B'].idxmin()

Out[33]:

元素级方法

s = df[‘col’].isnull() s = df[‘col’].notnull() # not isnull() s = df[‘col’].astype(float) s = df[‘col’].round(decimals=0) s = df[‘col’].diff(periods=1) s = df[‘col’].shift(periods=1) s = df[‘col’].to_datetime() s = df[‘col’].fillna(0) # replace NaN w 0 s = df[‘col’].cumsum() s = df[‘col’].cumprod() s = df[‘col’].pct_change(periods=4) s = df[‘col’].rolling_sum(periods=4, window=4)

In [42]:

df = df.mul(s, axis=0) # on matched rows,相当于 * other Series 每行都与之相乘

In [47]:

df.columns.get_loc('B')

Out[47]:

In [ ]:

df = df.iloc[:, 0:2] # exclusive

获取 columns 的具体位置

In [153]:

df

Out[153]:

	year	state	pop	debt
one	2000	Ohino	1.5	NaN
two	2001	Ohino	1.7	NaN
three	2002	Ohino	3.6	NaN
four	2001	Nevada	2.4	NaN
five	2002	Nevada	2.9	NaN

下面那个好像没什么软用

In [167]:

for i in ['pop', 'state']:
    print df.columns.get_loc(i)

2
1

In [152]:

Series(df.columns)

Out[152]:

   year
  state
    pop
   debt
dtype: object

1.4. DataFrame rows 行处理¶

获取索引和标签

In [3]:

idx = df.index # get row index
label = df.index[0] # 1st row label
lst = df.index.tolist() # get as a list

改变索引或行名

In [ ]:

df.index = idx # new ad hoc index
df.index = range(len(df)) # set with list
df = df.reset_index() # replace old w new
# note: old index stored as a col in df
df = df.reindex(index=range(len(df)))
df = df.set_index(keys=['r1','r2','etc'])
df.rename(index={'old':'new'},inplace=True)

添加行

待补。

Drop row 删除行

df = df.drop(‘row_label’) df = df.drop(row1) # multi-row df = df.drop([‘row1’,’row2’]) # multi-row

查找一些行

In [23]:

# fake up some data
data = {1:[1,2,3], 2:[4,1,9], 3:[1,8,27]}
df = pd.DataFrame(data)

In [4]:

# multi-column isin
lf = {1:[1, 3], 3:[8, 27]} # look for
f = df[df[list(lf)].isin(lf).all(axis=1)] # 这里看不太懂

对行做排序

In [ ]:

df_obj.sort(columns = ‘’)#按列名进行排序
df_obj.sort_index(by=[‘’,’’])#多列排序,使用时报该函数已过时,请用sort_values
df_obj.sort_values(by=['',''])同上

索引前奏

In [122]:

df0 = DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]}, index=[3, 2, 1])
df1 = DataFrame([[1, 2, 3,], [3, 4, 5], [6, 7, 8]], index=[3, 2, 1])

In [118]:

df0

Out[118]:

	x	y
3	1	3
2	2	4
1	3	5

In [120]:

df0[1:2]

Out[120]:

	x	y
2	2	4

In [17]:

df1

Out[17]:

	0	1	2
3	1	2	3
2	3	4	5
1	6	7	8

In [18]:

df1[0]

Out[18]:

  1
  3
  6
Name: 0, dtype: int64

In [19]:

df1[0:2]

Out[19]:

	0	1	2
3	1	2	3
2	3	4	5

In [21]:

df1.ix[:, 0:2]

Out[21]:

	0	1	2
3	1	2	3
2	3	4	5
1	6	7	8

In [124]:

df0[['x','y']]

Out[124]:

	x	y
3	1	3
2	2	4
1	3	5

1.5. 索引和切片¶

整数时一般是不包含的，非整数则会包含尾巴（基于 label）

In [41]:

foo = DataFrame([4.5, 7.2, -5.3, 3.6], index=['a', 'b', 'c', 'd'])
bar = DataFrame([4.5, 7.2, -5.3, 3.6], index=range(4))

In [100]:

print(foo)
print '------'
print(bar)

     0
a  4.5
b  7.2
c -5.3
d  3.6
------
     0
0  4.5
1  7.2
2 -5.3
3  3.6

In [99]:

print foo[:2]
print '------'
print bar[:2]
print '------'
print foo[:'c']

     0
a  4.5
b  7.2
------
     0
0  4.5
1  7.2
------
     0
a  4.5
b  7.2
c -5.3

ix[::, ::] 可以接受两套切片（axis=0)横向，(axis=1)列向

In [2]:

data = {'state':['Ohino','Ohino','Ohino','Nevada','Nevada'],
        'year':[2000,2001,2002,2001,2002],
        'pop':[1.5,1.7,3.6,2.4,2.9]}

df = DataFrame(data,index=['one','two','three','four','five'],
               columns=['year','state','pop','debt'])

In [48]:

df

Out[48]:

	year	state	pop	debt
one	2000	Ohino	1.5	NaN
two	2001	Ohino	1.7	NaN
three	2002	Ohino	3.6	NaN
four	2001	Nevada	2.4	NaN
five	2002	Nevada	2.9	NaN

In [81]:

df.ix[:, 'state':'pop']

Out[81]:

	state	pop
one	Ohino	1.5
two	Ohino	1.7
three	Ohino	3.6
four	Nevada	2.4
five	Nevada	2.9

In [84]:

df.ix[1] # 切的是行，所以说 ix 默认切的行， 也就是 axis=0

Out[84]:

year      2001
state    Ohino
pop        1.7
debt       NaN
Name: two, dtype: object

非 ix

ix 可以说是 pandas 的标准切法，而没有 ix 时，情况就略复杂些了，作者说：

索引时，选取的是列
切片时，选取的是行

记住一点，如果你想看单列或少数列的索引，那么直接用 df[‘column’], 其他就

In [6]:

print(type(df['year']))
print(type(df[['year']]))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>

In [128]:

# df['one'] # 会报错，没办法这样索引,这是行
df[['year', 'state']] # 可运行

Out[128]:

	year	state
one	2000	Ohino
two	2001	Ohino
three	2002	Ohino
four	2001	Nevada
five	2002	Nevada

In [91]:

df[0:1] # 切第一行，直接 df[0] 是会报错的。而 ix 不会。

Out[91]:

	year	state	pop	debt
one	2000	Ohino	1.5	NaN

In [129]:

df['one':'two'] # 所以他也是可以整数切，也能标签切

Out[129]:

	year	state	pop	debt
one	2000	Ohino	1.5	NaN
two	2001	Ohino	1.7	NaN

In [89]:

print(df.columns.tolist())
print(df.index.tolist())

['year', 'state', 'pop', 'debt']
['one', 'two', 'three', 'four', 'five']

In [102]:

df.loc[:, 'year':'state']

Out[102]:

	year	state
one	2000	Ohino
two	2001	Ohino
three	2002	Ohino
four	2001	Nevada
five	2002	Nevada

In [104]:

df.iloc[:, 1:2]

Out[104]:

	state
one	Ohino
two	Ohino
three	Ohino
four	Nevada
five	Nevada

.loc[label] 这是严格基于标签的索引
.iloc[inte] 这是严格基于整数位置的索引
.ix[] 更像是这两种严格方式的智能整合版。

In [111]:

# df.loc[1:2] 用 label 的去切整数，自然会出错
# df.iloc['two':'three'] 也会出错

In [109]:

df.loc['two':'three']

Out[109]:

	year	state	pop	debt
two	2001	Ohino	1.7	NaN
three	2002	Ohino	3.6	NaN

In [105]:

df.iloc[1:2]

Out[105]:

	year	state	pop	debt
two	2001	Ohino	1.7	NaN

小结：

尽量写两套切片，除非是索引单列则用 df[column]
多用 iloc 和 loc, 除非你很清晰的基于标签

In [ ]:

# 待补：当标签和整数冲突时

In [ ]:

df2.info() # 说明，ix在这种非整数的整数标签上，他的切片跟loc一样，是基于标签的，而另外两个刚好相反。