1. 数据操作(Pandas)¶
1.1. pandas 对象¶
引入 pandas 等包,DataFrame、Series 属于常用的,所以直接引入
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
DataFrame 对象:Pandas DataFrame 是一个表格型的数据结构,有行索引也有列索引
In [2]:
from IPython.display import Image
Image(filename='../../image/DataFrame.png', width=400)
Out[2]:

Series 对象:类似于一维数组的对象,由一组同样 type 的数组和索引组成
In [3]:
s1 = Series(range(0,4)) # -> 0, 1, 2, 3
s2 = Series(range(1,5)) # -> 1, 2, 3, 4
s3 = s1 + s2 # -> 1, 3, 5, 7
s4 = Series(['a','b'])*3 # -> 'aaa','bbb'
index 对象:即 Series 和 DataFrame 的索引
In [4]:
# 获取索引
df = DataFrame(s1)
idx = s1.index
idx = df.columns # the column index
idx = df.index # the row index
In [5]:
# 索引的一些特性
b = idx.is_monotonic_decreasing
b = idx.is_monotonic_increasing
b = idx.has_duplicates
i = idx.nlevels # multi-level indexe
In [6]:
# 索引的一些方法
a = idx.values # get as numpy array
l = idx.tolist() # get as a python list
# idx = idx.astype(dtype) # change data type
# b = idx.equals(other) # check for equality 看看是否是相同的索引
In [7]:
# union of two indexes 合并两个索引
# idx = idx.union(other)
idx1 = pd.Index([1, 2, 3, 4])
idx2 = pd.Index([3, 4, 5, 6])
idx1.union(idx2)
Out[7]:
Int64Index([1, 2, 3, 4, 5, 6], dtype='int64')
In [8]:
i = idx.nunique() # number unique labels
label = idx.min() # minimum label
label = idx.max() # maximum label
创建 Series 和 DataFrame
1.2. DataFrame 入门¶
In [2]:
df = DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
DataFrame 的一些实用查看方法
In [10]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
A 10 non-null float64
B 10 non-null float64
C 10 non-null float64
D 10 non-null float64
dtypes: float64(4)
memory usage: 392.0 bytes
In [11]:
n=4
dfh = df.head(n) # 看前 n 行
In [12]:
dft = df.tail(n) # 看后 n 行
In [13]:
dfs = df.describe() # 各类统计信息
In [14]:
top_left_corner_df = df.iloc[:5, :5]
In [15]:
dfT = df.T # transpose rows and cols
DataFrame index 的一些特性
In [16]:
l = df.axes # list row and col indexes
l
Out[16]:
[RangeIndex(start=0, stop=10, step=1),
Index([u'A', u'B', u'C', u'D'], dtype='object')]
In [17]:
(r, c) = df.axes # from above
In [18]:
s = df.dtypes # Series column data types
s
Out[18]:
A float64
B float64
C float64
D float64
dtype: object
In [19]:
b = df.empty # True for empty DataFrame
b
Out[19]:
False
In [20]:
i = df.ndim # number of axes (2)
i
Out[20]:
2
In [21]:
t = df.shape # (row-count, column-count)
t
Out[21]:
(10, 4)
In [22]:
(r, c) = df.shape # from above
(r, c)
Out[22]:
(10, 4)
In [23]:
i = df.size # row-count * column-count
i
Out[23]:
40
In [24]:
a = df.values # get a numpy array for df
实用方法
In [27]:
df = DataFrame([1, 23, 3, 5, 2])
In [28]:
dfc = df.copy() # copy a DataFrame
dfr = df.rank() # rank each col (default) 把每个值的地位列出了
dfs = df.sort() # sort each col (default)
# dfc = df.astype(dtype) # type conversion
/Users/Scott/Library/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: sort(....) is deprecated, use sort_index(.....)
app.launch_new_instance()
In [29]:
# 下面的两个方法没怎么搞懂
df.iteritems()# (col-index, Series) pairs
df.iterrows() # (row-index, Series) pairs
# example ... iterating over columns
for (name, series) in df.iteritems():
print('Col name: ' + str(name))
print('First value: ' +
str(series.iat[0]) + '\n')
Col name: 0
First value: 1
通用函数
method | ## |
---|---|
df = df.abs() | absolute values |
df = df.add(o) | add df, Series or value |
s = df.count() | non NA/null values |
df = df.cummax() | (cols default axis) |
df = df.cummin() | (cols default axis) |
df = df.cumsum() | (cols default axis) |
df = df.cumprod() | (cols default axis) |
df = df.diff() | 1st diff (col def axis) |
df = df.div(o) | div by df, Series, value |
df = df.dot(o) | matrix dot product |
s = df.max() | max of axis (col def) |
s = df.mean() | mean (col default axis) |
s = df.median() | median (col default) |
s = df.min() | min of axis (col def) |
df = df.mul(o) | mul by df Series val |
s = df.sum() | sum axis (cols default) |
1.3. DataFrame Columns 列处理¶
column 其实也是一个 Series
In [2]:
df = DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
idx = df.columns # get col index
label = df.columns[0] # 1st col label
lst = df.columns.tolist() # get as a list
In [31]:
lst
Out[31]:
['A', 'B', 'C', 'D']
In [32]:
label
Out[32]:
'A'
In [33]:
idx
Out[33]:
Index([u'A', u'B', u'C', u'D'], dtype='object')
column 改名
In [34]:
# df.rename(columns={'old':'new'}, inplace=True)
# df = df.rename(columns={'a':1,'b':'x'})
选择 columns, 也就是提取列
In [ ]:
s = df['C'] # select col to Series
df = df[['C']] # select col to df
df = df[['A','B']] # select 2 or more
df = df[['C', 'B', 'A']]# change order 改变排序了
s = df[df.columns[0]] # select by number
f = df[df.columns[[0, 3, 4]] # by number
s = df.pop('C') # get col & drop from df == df['C']
用 python 特性提取列
In [4]:
s = df.A # same as s = df['A'],
# 但不能用 python 特性创建新的 columns
# df['new_col'] = df.a / df.b
添加新的 columns,添加一个 column 是极为方便的,只要能添加一组数据就行
In [6]:
df['new_col'] = range(len(df))
df['new_col'] = np.repeat(np.nan,len(df))
df['random'] = np.random.rand(len(df))
df['index_as_col'] = df.index
In [8]:
df.head(2)
Out[8]:
A | B | C | D | new_col | random | index_as_col | |
---|---|---|---|---|---|---|---|
0 | 0.458326 | -1.402187 | 0.446208 | -0.459079 | NaN | 0.920599 | 0 |
1 | 0.366833 | 0.618661 | -0.727332 | 1.152775 | NaN | 0.503750 | 1 |
详情参考 df1[[‘b’,’c’]] = df2[[‘e’,’f’]] df3 = df1.append(other=df2)
判定函数 pd.Series.where
In [17]:
# 符合 >0 条件的保持原值,其他 =0
df['A'] = df['A'].where(df['A']>0, other=0)
# df['d']=df['a'].where(df.b!=0,other=df.c)
数据格式 转换一列的格式时非常有用。
s = df[‘col’].astype(str) # Series dtype na = df[‘col’].values # numpy array pl = df[‘col’].tolist() # python list
columns 的一些特性和方法
value = df[‘col’].dtype # type of data value = df[‘col’].size # col dimensions value = df[‘col’].count()# non-NA count value = df[‘col’].sum() value = df[‘col’].prod() value = df[‘col’].min() value = df[‘col’].max() value = df[‘col’].mean() value = df[‘col’].median() value = df[‘col’].cov(df[‘col2’]) s = df[‘col’].describe() s = df[‘col’].value_counts()
找出最小值和最大值的位置
In [33]:
df['B'].idxmax()
df['B'].idxmin()
Out[33]:
7
元素级方法
s = df[‘col’].isnull() s = df[‘col’].notnull() # not isnull() s = df[‘col’].astype(float) s = df[‘col’].round(decimals=0) s = df[‘col’].diff(periods=1) s = df[‘col’].shift(periods=1) s = df[‘col’].to_datetime() s = df[‘col’].fillna(0) # replace NaN w 0 s = df[‘col’].cumsum() s = df[‘col’].cumprod() s = df[‘col’].pct_change(periods=4) s = df[‘col’].rolling_sum(periods=4, window=4)
In [42]:
df = df.mul(s, axis=0) # on matched rows,相当于 * other Series 每行都与之相乘
In [47]:
df.columns.get_loc('B')
Out[47]:
0
In [ ]:
df = df.iloc[:, 0:2] # exclusive
获取 columns 的具体位置
In [153]:
df
Out[153]:
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohino | 1.5 | NaN |
two | 2001 | Ohino | 1.7 | NaN |
three | 2002 | Ohino | 3.6 | NaN |
four | 2001 | Nevada | 2.4 | NaN |
five | 2002 | Nevada | 2.9 | NaN |
下面那个好像没什么软用
In [167]:
for i in ['pop', 'state']:
print df.columns.get_loc(i)
2
1
In [152]:
Series(df.columns)
Out[152]:
0 year
1 state
2 pop
3 debt
dtype: object
1.4. DataFrame rows 行处理¶
获取索引和标签
In [3]:
idx = df.index # get row index
label = df.index[0] # 1st row label
lst = df.index.tolist() # get as a list
改变索引或行名
In [ ]:
df.index = idx # new ad hoc index
df.index = range(len(df)) # set with list
df = df.reset_index() # replace old w new
# note: old index stored as a col in df
df = df.reindex(index=range(len(df)))
df = df.set_index(keys=['r1','r2','etc'])
df.rename(index={'old':'new'},inplace=True)
添加行
待补。
Drop row 删除行
df = df.drop(‘row_label’) df = df.drop(row1) # multi-row df = df.drop([‘row1’,’row2’]) # multi-row
查找一些行
In [23]:
# fake up some data
data = {1:[1,2,3], 2:[4,1,9], 3:[1,8,27]}
df = pd.DataFrame(data)
In [4]:
# multi-column isin
lf = {1:[1, 3], 3:[8, 27]} # look for
f = df[df[list(lf)].isin(lf).all(axis=1)] # 这里看不太懂
对行做排序
In [ ]:
df_obj.sort(columns = ‘’)#按列名进行排序
df_obj.sort_index(by=[‘’,’’])#多列排序,使用时报该函数已过时,请用sort_values
df_obj.sort_values(by=['',''])同上
索引前奏
In [122]:
df0 = DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]}, index=[3, 2, 1])
df1 = DataFrame([[1, 2, 3,], [3, 4, 5], [6, 7, 8]], index=[3, 2, 1])
In [118]:
df0
Out[118]:
x | y | |
---|---|---|
3 | 1 | 3 |
2 | 2 | 4 |
1 | 3 | 5 |
In [120]:
df0[1:2]
Out[120]:
x | y | |
---|---|---|
2 | 2 | 4 |
In [17]:
df1
Out[17]:
0 | 1 | 2 | |
---|---|---|---|
3 | 1 | 2 | 3 |
2 | 3 | 4 | 5 |
1 | 6 | 7 | 8 |
In [18]:
df1[0]
Out[18]:
3 1
2 3
1 6
Name: 0, dtype: int64
In [19]:
df1[0:2]
Out[19]:
0 | 1 | 2 | |
---|---|---|---|
3 | 1 | 2 | 3 |
2 | 3 | 4 | 5 |
In [21]:
df1.ix[:, 0:2]
Out[21]:
0 | 1 | 2 | |
---|---|---|---|
3 | 1 | 2 | 3 |
2 | 3 | 4 | 5 |
1 | 6 | 7 | 8 |
In [124]:
df0[['x','y']]
Out[124]:
x | y | |
---|---|---|
3 | 1 | 3 |
2 | 2 | 4 |
1 | 3 | 5 |
1.5. 索引和切片¶
整数时一般是不包含的,非整数则会包含尾巴(基于 label)
In [41]:
foo = DataFrame([4.5, 7.2, -5.3, 3.6], index=['a', 'b', 'c', 'd'])
bar = DataFrame([4.5, 7.2, -5.3, 3.6], index=range(4))
In [100]:
print(foo)
print '------'
print(bar)
0
a 4.5
b 7.2
c -5.3
d 3.6
------
0
0 4.5
1 7.2
2 -5.3
3 3.6
In [99]:
print foo[:2]
print '------'
print bar[:2]
print '------'
print foo[:'c']
0
a 4.5
b 7.2
------
0
0 4.5
1 7.2
------
0
a 4.5
b 7.2
c -5.3
ix[::, ::]
可以接受两套切片(axis=0)横向,(axis=1)列向
In [2]:
data = {'state':['Ohino','Ohino','Ohino','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]}
df = DataFrame(data,index=['one','two','three','four','five'],
columns=['year','state','pop','debt'])
In [48]:
df
Out[48]:
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohino | 1.5 | NaN |
two | 2001 | Ohino | 1.7 | NaN |
three | 2002 | Ohino | 3.6 | NaN |
four | 2001 | Nevada | 2.4 | NaN |
five | 2002 | Nevada | 2.9 | NaN |
In [81]:
df.ix[:, 'state':'pop']
Out[81]:
state | pop | |
---|---|---|
one | Ohino | 1.5 |
two | Ohino | 1.7 |
three | Ohino | 3.6 |
four | Nevada | 2.4 |
five | Nevada | 2.9 |
In [84]:
df.ix[1] # 切的是行,所以说 ix 默认切的行, 也就是 axis=0
Out[84]:
year 2001
state Ohino
pop 1.7
debt NaN
Name: two, dtype: object
非 ix
ix 可以说是 pandas 的标准切法,而没有 ix 时,情况就略复杂些了,作者说:
- 索引时,选取的是列
- 切片时,选取的是行
记住一点,如果你想看单列或少数列的索引,那么直接用 df[‘column’], 其他就
In [6]:
print(type(df['year']))
print(type(df[['year']]))
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
In [128]:
# df['one'] # 会报错,没办法这样索引,这是行
df[['year', 'state']] # 可运行
Out[128]:
year | state | |
---|---|---|
one | 2000 | Ohino |
two | 2001 | Ohino |
three | 2002 | Ohino |
four | 2001 | Nevada |
five | 2002 | Nevada |
In [91]:
df[0:1] # 切第一行,直接 df[0] 是会报错的。而 ix 不会。
Out[91]:
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohino | 1.5 | NaN |
In [129]:
df['one':'two'] # 所以他也是可以整数切,也能标签切
Out[129]:
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohino | 1.5 | NaN |
two | 2001 | Ohino | 1.7 | NaN |
In [89]:
print(df.columns.tolist())
print(df.index.tolist())
['year', 'state', 'pop', 'debt']
['one', 'two', 'three', 'four', 'five']
In [102]:
df.loc[:, 'year':'state']
Out[102]:
year | state | |
---|---|---|
one | 2000 | Ohino |
two | 2001 | Ohino |
three | 2002 | Ohino |
four | 2001 | Nevada |
five | 2002 | Nevada |
In [104]:
df.iloc[:, 1:2]
Out[104]:
state | |
---|---|
one | Ohino |
two | Ohino |
three | Ohino |
four | Nevada |
five | Nevada |
.loc[label]
这是严格基于标签的索引.iloc[inte]
这是严格基于整数位置的索引.ix[]
更像是这两种严格方式的智能整合版。
In [111]:
# df.loc[1:2] 用 label 的去切整数,自然会出错
# df.iloc['two':'three'] 也会出错
In [109]:
df.loc['two':'three']
Out[109]:
year | state | pop | debt | |
---|---|---|---|---|
two | 2001 | Ohino | 1.7 | NaN |
three | 2002 | Ohino | 3.6 | NaN |
In [105]:
df.iloc[1:2]
Out[105]:
year | state | pop | debt | |
---|---|---|---|---|
two | 2001 | Ohino | 1.7 | NaN |
小结:
- 尽量写两套切片,除非是索引单列则用
df[column]
- 多用
iloc
和loc
, 除非你很清晰的基于标签
In [ ]:
# 待补:当标签和整数冲突时
In [ ]:
df2.info() # 说明,ix在这种非整数的整数标签上,他的切片跟loc一样,是基于标签的,而另外两个刚好相反。