1. 数据操作(Pandas)

1.1. pandas 对象

引入 pandas 等包,DataFrame、Series 属于常用的,所以直接引入

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

DataFrame 对象:Pandas DataFrame 是一个表格型的数据结构,有行索引也有列索引

In [2]:
from IPython.display import Image
Image(filename='../../image/DataFrame.png', width=400)
Out[2]:
../_images/base_01_pandas_5_0.png

Series 对象:类似于一维数组的对象,由一组同样 type 的数组和索引组成

In [3]:
s1 = Series(range(0,4)) # -> 0, 1, 2, 3
s2 = Series(range(1,5)) # -> 1, 2, 3, 4
s3 = s1 + s2 # -> 1, 3, 5, 7
s4 = Series(['a','b'])*3 # -> 'aaa','bbb'

index 对象:即 Series 和 DataFrame 的索引

In [4]:
# 获取索引
df = DataFrame(s1)
idx = s1.index
idx = df.columns # the column index
idx = df.index # the row index
In [5]:
# 索引的一些特性
b = idx.is_monotonic_decreasing
b = idx.is_monotonic_increasing
b = idx.has_duplicates
i = idx.nlevels # multi-level indexe
In [6]:
# 索引的一些方法
a = idx.values # get as numpy array
l = idx.tolist() # get as a python list
# idx = idx.astype(dtype) # change data type
# b = idx.equals(other) # check for equality 看看是否是相同的索引
In [7]:
# union of two indexes 合并两个索引
# idx = idx.union(other)

idx1 = pd.Index([1, 2, 3, 4])
idx2 = pd.Index([3, 4, 5, 6])
idx1.union(idx2)
Out[7]:
Int64Index([1, 2, 3, 4, 5, 6], dtype='int64')
In [8]:
i = idx.nunique() # number unique labels
label = idx.min() # minimum label
label = idx.max() # maximum label

创建 Series 和 DataFrame

http://pandas.pydata.org/pandas-docs/stable/dsintro.html

1.2. DataFrame 入门

In [2]:
df = DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])

DataFrame 的一些实用查看方法

In [10]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
A    10 non-null float64
B    10 non-null float64
C    10 non-null float64
D    10 non-null float64
dtypes: float64(4)
memory usage: 392.0 bytes
In [11]:
n=4
dfh = df.head(n) # 看前 n 行
In [12]:
dft = df.tail(n) # 看后 n 行
In [13]:
dfs = df.describe() # 各类统计信息
In [14]:
top_left_corner_df = df.iloc[:5, :5]
In [15]:
dfT = df.T # transpose rows and cols

DataFrame index 的一些特性

In [16]:
l = df.axes # list row and col indexes
l
Out[16]:
[RangeIndex(start=0, stop=10, step=1),
 Index([u'A', u'B', u'C', u'D'], dtype='object')]
In [17]:
(r, c) = df.axes # from above
In [18]:
s = df.dtypes # Series column data types
s
Out[18]:
A    float64
B    float64
C    float64
D    float64
dtype: object
In [19]:
b = df.empty # True for empty DataFrame
b
Out[19]:
False
In [20]:
i = df.ndim # number of axes (2)
i
Out[20]:
2
In [21]:
t = df.shape # (row-count, column-count)
t
Out[21]:
(10, 4)
In [22]:
(r, c) = df.shape # from above
(r, c)
Out[22]:
(10, 4)
In [23]:
i = df.size # row-count * column-count
i
Out[23]:
40
In [24]:
a = df.values # get a numpy array for df

实用方法

In [27]:
df = DataFrame([1, 23, 3, 5, 2])
In [28]:
dfc = df.copy() # copy a DataFrame
dfr = df.rank() # rank each col (default) 把每个值的地位列出了
dfs = df.sort() # sort each col (default)
# dfc = df.astype(dtype) # type conversion
/Users/Scott/Library/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: sort(....) is deprecated, use sort_index(.....)
  app.launch_new_instance()
In [29]:
# 下面的两个方法没怎么搞懂
df.iteritems()# (col-index, Series) pairs
df.iterrows() # (row-index, Series) pairs
# example ... iterating over columns
for (name, series) in df.iteritems():
    print('Col name: ' + str(name))
    print('First value: ' +
        str(series.iat[0]) + '\n')
Col name: 0
First value: 1

通用函数

method ##
df = df.abs() absolute values
df = df.add(o) add df, Series or value
s = df.count() non NA/null values
df = df.cummax() (cols default axis)
df = df.cummin() (cols default axis)
df = df.cumsum() (cols default axis)
df = df.cumprod() (cols default axis)
df = df.diff() 1st diff (col def axis)
df = df.div(o) div by df, Series, value
df = df.dot(o) matrix dot product
s = df.max() max of axis (col def)
s = df.mean() mean (col default axis)
s = df.median() median (col default)
s = df.min() min of axis (col def)
df = df.mul(o) mul by df Series val
s = df.sum() sum axis (cols default)

1.3. DataFrame Columns 列处理

column 其实也是一个 Series

In [2]:
df = DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
idx = df.columns # get col index
label = df.columns[0] # 1st col label
lst = df.columns.tolist() # get as a list
In [31]:
lst
Out[31]:
['A', 'B', 'C', 'D']
In [32]:
label
Out[32]:
'A'
In [33]:
idx
Out[33]:
Index([u'A', u'B', u'C', u'D'], dtype='object')

column 改名

In [34]:
# df.rename(columns={'old':'new'}, inplace=True)
# df = df.rename(columns={'a':1,'b':'x'})

选择 columns, 也就是提取列

In [ ]:
s = df['C'] # select col to Series
df = df[['C']] # select col to df
df = df[['A','B']] # select 2 or more
df = df[['C', 'B', 'A']]# change order 改变排序了
s = df[df.columns[0]] # select by number
f = df[df.columns[[0, 3, 4]] # by number
s = df.pop('C') # get col & drop from df == df['C']

python 特性提取列

In [4]:
s = df.A # same as s = df['A'],
# 但不能用 python 特性创建新的 columns
# df['new_col'] = df.a / df.b

添加新的 columns,添加一个 column 是极为方便的,只要能添加一组数据就行

In [6]:
df['new_col'] = range(len(df))
df['new_col'] = np.repeat(np.nan,len(df))
df['random'] = np.random.rand(len(df))
df['index_as_col'] = df.index
In [8]:
df.head(2)
Out[8]:
A B C D new_col random index_as_col
0 0.458326 -1.402187 0.446208 -0.459079 NaN 0.920599 0
1 0.366833 0.618661 -0.727332 1.152775 NaN 0.503750 1

详情参考 df1[[‘b’,’c’]] = df2[[‘e’,’f’]] df3 = df1.append(other=df2)

判定函数 pd.Series.where

In [17]:
# 符合 >0 条件的保持原值,其他 =0
df['A'] = df['A'].where(df['A']>0, other=0)
# df['d']=df['a'].where(df.b!=0,other=df.c)

数据格式 转换一列的格式时非常有用。

s = df[‘col’].astype(str) # Series dtype na = df[‘col’].values # numpy array pl = df[‘col’].tolist() # python list

columns 的一些特性和方法

value = df[‘col’].dtype # type of data value = df[‘col’].size # col dimensions value = df[‘col’].count()# non-NA count value = df[‘col’].sum() value = df[‘col’].prod() value = df[‘col’].min() value = df[‘col’].max() value = df[‘col’].mean() value = df[‘col’].median() value = df[‘col’].cov(df[‘col2’]) s = df[‘col’].describe() s = df[‘col’].value_counts()

找出最小值和最大值的位置

In [33]:
df['B'].idxmax()
df['B'].idxmin()
Out[33]:
7

元素级方法

s = df[‘col’].isnull() s = df[‘col’].notnull() # not isnull() s = df[‘col’].astype(float) s = df[‘col’].round(decimals=0) s = df[‘col’].diff(periods=1) s = df[‘col’].shift(periods=1) s = df[‘col’].to_datetime() s = df[‘col’].fillna(0) # replace NaN w 0 s = df[‘col’].cumsum() s = df[‘col’].cumprod() s = df[‘col’].pct_change(periods=4) s = df[‘col’].rolling_sum(periods=4, window=4)

In [42]:
df = df.mul(s, axis=0) # on matched rows,相当于 * other Series 每行都与之相乘
In [47]:
df.columns.get_loc('B')
Out[47]:
0
In [ ]:
df = df.iloc[:, 0:2] # exclusive

获取 columns 的具体位置

In [153]:
df
Out[153]:
year state pop debt
one 2000 Ohino 1.5 NaN
two 2001 Ohino 1.7 NaN
three 2002 Ohino 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN

下面那个好像没什么软用

In [167]:
for i in ['pop', 'state']:
    print df.columns.get_loc(i)
2
1
In [152]:
Series(df.columns)
Out[152]:
0     year
1    state
2      pop
3     debt
dtype: object

1.4. DataFrame rows 行处理

获取索引和标签

In [3]:
idx = df.index # get row index
label = df.index[0] # 1st row label
lst = df.index.tolist() # get as a list

改变索引或行名

In [ ]:
df.index = idx # new ad hoc index
df.index = range(len(df)) # set with list
df = df.reset_index() # replace old w new
# note: old index stored as a col in df
df = df.reindex(index=range(len(df)))
df = df.set_index(keys=['r1','r2','etc'])
df.rename(index={'old':'new'},inplace=True)

添加行

待补。

Drop row 删除行

df = df.drop(‘row_label’) df = df.drop(row1) # multi-row df = df.drop([‘row1’,’row2’]) # multi-row

查找一些行

In [23]:
# fake up some data
data = {1:[1,2,3], 2:[4,1,9], 3:[1,8,27]}
df = pd.DataFrame(data)
In [4]:
# multi-column isin
lf = {1:[1, 3], 3:[8, 27]} # look for
f = df[df[list(lf)].isin(lf).all(axis=1)] # 这里看不太懂

对行做排序

In [ ]:
df_obj.sort(columns = ‘’)#按列名进行排序
df_obj.sort_index(by=[‘’,’’])#多列排序,使用时报该函数已过时,请用sort_values
df_obj.sort_values(by=['',''])同上

索引前奏

In [122]:
df0 = DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]}, index=[3, 2, 1])
df1 = DataFrame([[1, 2, 3,], [3, 4, 5], [6, 7, 8]], index=[3, 2, 1])
In [118]:
df0
Out[118]:
x y
3 1 3
2 2 4
1 3 5
In [120]:
df0[1:2]
Out[120]:
x y
2 2 4
In [17]:
df1
Out[17]:
0 1 2
3 1 2 3
2 3 4 5
1 6 7 8
In [18]:
df1[0]
Out[18]:
3    1
2    3
1    6
Name: 0, dtype: int64
In [19]:
df1[0:2]
Out[19]:
0 1 2
3 1 2 3
2 3 4 5
In [21]:
df1.ix[:, 0:2]
Out[21]:
0 1 2
3 1 2 3
2 3 4 5
1 6 7 8
In [124]:
df0[['x','y']]
Out[124]:
x y
3 1 3
2 2 4
1 3 5

1.5. 索引和切片

整数时一般是不包含的,非整数则会包含尾巴(基于 label)

In [41]:
foo = DataFrame([4.5, 7.2, -5.3, 3.6], index=['a', 'b', 'c', 'd'])
bar = DataFrame([4.5, 7.2, -5.3, 3.6], index=range(4))
In [100]:
print(foo)
print '------'
print(bar)
     0
a  4.5
b  7.2
c -5.3
d  3.6
------
     0
0  4.5
1  7.2
2 -5.3
3  3.6
In [99]:
print foo[:2]
print '------'
print bar[:2]
print '------'
print foo[:'c']
     0
a  4.5
b  7.2
------
     0
0  4.5
1  7.2
------
     0
a  4.5
b  7.2
c -5.3

ix[::, ::] 可以接受两套切片(axis=0)横向,(axis=1)列向

In [2]:
data = {'state':['Ohino','Ohino','Ohino','Nevada','Nevada'],
        'year':[2000,2001,2002,2001,2002],
        'pop':[1.5,1.7,3.6,2.4,2.9]}

df = DataFrame(data,index=['one','two','three','four','five'],
               columns=['year','state','pop','debt'])
In [48]:
df
Out[48]:
year state pop debt
one 2000 Ohino 1.5 NaN
two 2001 Ohino 1.7 NaN
three 2002 Ohino 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
In [81]:
df.ix[:, 'state':'pop']
Out[81]:
state pop
one Ohino 1.5
two Ohino 1.7
three Ohino 3.6
four Nevada 2.4
five Nevada 2.9
In [84]:
df.ix[1] # 切的是行,所以说 ix 默认切的行, 也就是 axis=0
Out[84]:
year      2001
state    Ohino
pop        1.7
debt       NaN
Name: two, dtype: object

非 ix

ix 可以说是 pandas 的标准切法,而没有 ix 时,情况就略复杂些了,作者说:

  • 索引时,选取的是列
  • 切片时,选取的是行

记住一点,如果你想看单列或少数列的索引,那么直接用 df[‘column’], 其他就

In [6]:
print(type(df['year']))
print(type(df[['year']]))
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
In [128]:
# df['one'] # 会报错,没办法这样索引,这是行
df[['year', 'state']] # 可运行
Out[128]:
year state
one 2000 Ohino
two 2001 Ohino
three 2002 Ohino
four 2001 Nevada
five 2002 Nevada
In [91]:
df[0:1] # 切第一行,直接 df[0] 是会报错的。而 ix 不会。
Out[91]:
year state pop debt
one 2000 Ohino 1.5 NaN
In [129]:
df['one':'two'] # 所以他也是可以整数切,也能标签切
Out[129]:
year state pop debt
one 2000 Ohino 1.5 NaN
two 2001 Ohino 1.7 NaN
In [89]:
print(df.columns.tolist())
print(df.index.tolist())
['year', 'state', 'pop', 'debt']
['one', 'two', 'three', 'four', 'five']
In [102]:
df.loc[:, 'year':'state']
Out[102]:
year state
one 2000 Ohino
two 2001 Ohino
three 2002 Ohino
four 2001 Nevada
five 2002 Nevada
In [104]:
df.iloc[:, 1:2]
Out[104]:
state
one Ohino
two Ohino
three Ohino
four Nevada
five Nevada
  • .loc[label] 这是严格基于标签的索引
  • .iloc[inte] 这是严格基于整数位置的索引
  • .ix[] 更像是这两种严格方式的智能整合版。
In [111]:
# df.loc[1:2] 用 label 的去切整数,自然会出错
# df.iloc['two':'three'] 也会出错
In [109]:
df.loc['two':'three']
Out[109]:
year state pop debt
two 2001 Ohino 1.7 NaN
three 2002 Ohino 3.6 NaN
In [105]:
df.iloc[1:2]
Out[105]:
year state pop debt
two 2001 Ohino 1.7 NaN

小结:

  • 尽量写两套切片,除非是索引单列则用 df[column]
  • 多用 ilocloc, 除非你很清晰的基于标签
In [ ]:
# 待补:当标签和整数冲突时
In [ ]:
df2.info() # 说明,ix在这种非整数的整数标签上,他的切片跟loc一样,是基于标签的,而另外两个刚好相反。