5.1. Pandas en 10 minutes¶
#Pour intégrer les graphes à votre notebook, il suffit de faire
%matplotlib inline
from jyquickhelper import add_notebook_menu
add_notebook_menu()
On importe générallement les librairies suivantes
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
5.1.1. Création d’objets¶
On créé une ‘Series’ en lui passant une liste de valeurs, en laissant pandas créer un index d’entiers
s = pd.Series([1,3,5,np.nan,6,8])
s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
On créé un DataFrame en passant un array numpy, avec un index sur sur une date et des colonnes labellisées
dates = pd.date_range('20130101', periods=6)
print(dates)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | 0.531261 | 1.298587 | 1.879986 | -0.510287 |
2013-01-02 | 0.113870 | 1.119906 | -0.071703 | -1.090445 |
2013-01-03 | 0.807842 | -0.426218 | -0.328528 | -0.254736 |
2013-01-04 | 1.533555 | -0.750685 | -0.051212 | 0.188978 |
2013-01-05 | -0.828470 | -0.627016 | 1.664534 | -1.728745 |
2013-01-06 | 0.800437 | -0.675739 | -1.049957 | 0.052375 |
On peut également créer un DataFrame en passant un dictionnaire d’objets qui peut être converti en sorte de série.
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
df2
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
1 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
2 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
3 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
Chaque colonne a son propre dtypes
df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
On peut afficher les premières lignes et les dernières
print(df.head())
print(df.tail())
A B C D
2013-01-01 0.531261 1.298587 1.879986 -0.510287
2013-01-02 0.113870 1.119906 -0.071703 -1.090445
2013-01-03 0.807842 -0.426218 -0.328528 -0.254736
2013-01-04 1.533555 -0.750685 -0.051212 0.188978
2013-01-05 -0.828470 -0.627016 1.664534 -1.728745
A B C D
2013-01-02 0.113870 1.119906 -0.071703 -1.090445
2013-01-03 0.807842 -0.426218 -0.328528 -0.254736
2013-01-04 1.533555 -0.750685 -0.051212 0.188978
2013-01-05 -0.828470 -0.627016 1.664534 -1.728745
2013-01-06 0.800437 -0.675739 -1.049957 0.052375
On peut afficher l’index, les colonnes et les données numpy
print(df.index)
print(df.columns)
print(df.values)
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D'], dtype='object')
[[ 0.53126106 1.29858738 1.87998645 -0.51028693]
[ 0.11386978 1.11990561 -0.07170277 -1.0904453 ]
[ 0.80784175 -0.42621808 -0.32852846 -0.25473621]
[ 1.53355516 -0.75068514 -0.05121214 0.18897817]
[-0.82846957 -0.62701575 1.66453418 -1.72874545]
[ 0.80043734 -0.67573851 -1.049957 0.0523755 ]]
La méthode describe permet d’afficher un résumé des données
df.describe()
A | B | C | D | |
---|---|---|---|---|
count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
mean | 0.493083 | -0.010194 | 0.340520 | -0.557143 |
std | 0.795975 | 0.952350 | 1.168538 | 0.731972 |
min | -0.828470 | -0.750685 | -1.049957 | -1.728745 |
25% | 0.218218 | -0.663558 | -0.264322 | -0.945406 |
50% | 0.665849 | -0.526617 | -0.061457 | -0.382512 |
75% | 0.805991 | 0.733375 | 1.235598 | -0.024402 |
max | 1.533555 | 1.298587 | 1.879986 | 0.188978 |
On peut faire la transposée, trier en fonction d’un axe ou des valeurs
print(df.T)
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.531261 0.113870 0.807842 1.533555 -0.828470 0.800437
B 1.298587 1.119906 -0.426218 -0.750685 -0.627016 -0.675739
C 1.879986 -0.071703 -0.328528 -0.051212 1.664534 -1.049957
D -0.510287 -1.090445 -0.254736 0.188978 -1.728745 0.052375
df.sort_index(axis=1, ascending=False)
D | C | B | A | |
---|---|---|---|---|
2013-01-01 | -0.510287 | 1.879986 | 1.298587 | 0.531261 |
2013-01-02 | -1.090445 | -0.071703 | 1.119906 | 0.113870 |
2013-01-03 | -0.254736 | -0.328528 | -0.426218 | 0.807842 |
2013-01-04 | 0.188978 | -0.051212 | -0.750685 | 1.533555 |
2013-01-05 | -1.728745 | 1.664534 | -0.627016 | -0.828470 |
2013-01-06 | 0.052375 | -1.049957 | -0.675739 | 0.800437 |
df.sort_values(by='B')
A | B | C | D | |
---|---|---|---|---|
2013-01-04 | 1.533555 | -0.750685 | -0.051212 | 0.188978 |
2013-01-06 | 0.800437 | -0.675739 | -1.049957 | 0.052375 |
2013-01-05 | -0.828470 | -0.627016 | 1.664534 | -1.728745 |
2013-01-03 | 0.807842 | -0.426218 | -0.328528 | -0.254736 |
2013-01-02 | 0.113870 | 1.119906 | -0.071703 | -1.090445 |
2013-01-01 | 0.531261 | 1.298587 | 1.879986 | -0.510287 |
5.1.2. Selection des données¶
5.1.2.1. Getting¶
Selection d’une colonne (équivalent à df.A)
print(df['A'])
print(df[0:3])
print(df['20130102':'20130104'])
2013-01-01 0.531261
2013-01-02 0.113870
2013-01-03 0.807842
2013-01-04 1.533555
2013-01-05 -0.828470
2013-01-06 0.800437
Freq: D, Name: A, dtype: float64
A B C D
2013-01-01 0.531261 1.298587 1.879986 -0.510287
2013-01-02 0.113870 1.119906 -0.071703 -1.090445
2013-01-03 0.807842 -0.426218 -0.328528 -0.254736
A B C D
2013-01-02 0.113870 1.119906 -0.071703 -1.090445
2013-01-03 0.807842 -0.426218 -0.328528 -0.254736
2013-01-04 1.533555 -0.750685 -0.051212 0.188978
5.1.2.2. Selection par Label¶
En utilisant un label
df.loc[dates[0]]
A 0.531261
B 1.298587
C 1.879986
D -0.510287
Name: 2013-01-01 00:00:00, dtype: float64
Selection de plusieurs axes par label
df.loc[:,['A','B']]
A | B | |
---|---|---|
2013-01-01 | 0.531261 | 1.298587 |
2013-01-02 | 0.113870 | 1.119906 |
2013-01-03 | 0.807842 | -0.426218 |
2013-01-04 | 1.533555 | -0.750685 |
2013-01-05 | -0.828470 | -0.627016 |
2013-01-06 | 0.800437 | -0.675739 |
Avec le label slicing, les deux points de terminaisons sont INCLUS
df.loc['20130102':'20130104', ['A','B']]
A | B | |
---|---|---|
2013-01-02 | 0.113870 | 1.119906 |
2013-01-03 | 0.807842 | -0.426218 |
2013-01-04 | 1.533555 | -0.750685 |
Obtenir une valeur scalaire
df.loc[dates[0],'A']
0.5312610562902266
Acces plus rapide (méthode équivalente à la précédente)
df.at[dates[0],'A']
0.5312610562902266
5.1.2.3. Selection par position¶
Integer :
df.iloc[3]
A 1.533555
B -0.750685
C -0.051212
D 0.188978
Name: 2013-01-04 00:00:00, dtype: float64
Tranches d’entiers, similaire à numpy
df.iloc[3:5,0:2]
A | B | |
---|---|---|
2013-01-04 | 1.533555 | -0.750685 |
2013-01-05 | -0.828470 | -0.627016 |
Par liste d’entiers
df.iloc[[1,2,4],[0,2]]
A | C | |
---|---|---|
2013-01-02 | 0.113870 | -0.071703 |
2013-01-03 | 0.807842 | -0.328528 |
2013-01-05 | -0.828470 | 1.664534 |
Découpage de ligne explicite
df.iloc[1:3,:]
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | 0.113870 | 1.119906 | -0.071703 | -1.090445 |
2013-01-03 | 0.807842 | -0.426218 | -0.328528 | -0.254736 |
Obtenir une valeur explicitement
df.iloc[1,1]
1.1199056145150672
Acces rapide au scalaire
df.iat[1,1]
1.1199056145150672
5.1.2.4. Indexation booléenne¶
En utilisant une valeur sur une colonne :
df[df.A > 0]
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | 0.531261 | 1.298587 | 1.879986 | -0.510287 |
2013-01-02 | 0.113870 | 1.119906 | -0.071703 | -1.090445 |
2013-01-03 | 0.807842 | -0.426218 | -0.328528 | -0.254736 |
2013-01-04 | 1.533555 | -0.750685 | -0.051212 | 0.188978 |
2013-01-06 | 0.800437 | -0.675739 | -1.049957 | 0.052375 |
Opérateur where :
df[df > 0]
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | 0.531261 | 1.298587 | 1.879986 | NaN |
2013-01-02 | 0.113870 | 1.119906 | NaN | NaN |
2013-01-03 | 0.807842 | NaN | NaN | NaN |
2013-01-04 | 1.533555 | NaN | NaN | 0.188978 |
2013-01-05 | NaN | NaN | 1.664534 | NaN |
2013-01-06 | 0.800437 | NaN | NaN | 0.052375 |
Pour filter, on utilise la méthode isin()
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
print(df2)
df2[df2['E'].isin(['two','four'])]
A B C D E
2013-01-01 0.531261 1.298587 1.879986 -0.510287 one
2013-01-02 0.113870 1.119906 -0.071703 -1.090445 one
2013-01-03 0.807842 -0.426218 -0.328528 -0.254736 two
2013-01-04 1.533555 -0.750685 -0.051212 0.188978 three
2013-01-05 -0.828470 -0.627016 1.664534 -1.728745 four
2013-01-06 0.800437 -0.675739 -1.049957 0.052375 three
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-03 | 0.807842 | -0.426218 | -0.328528 | -0.254736 | two |
2013-01-05 | -0.828470 | -0.627016 | 1.664534 | -1.728745 | four |
5.1.2.5. Ajouter / modifier valeurs / colonnes¶
Ajouter une nouvelle colonne automatiquement aligne les données par index.
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
print(s1)
df['F'] = s1
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
Modifier une valeur par label
df.at[dates[0],'A'] = 0
df.loc[dates[0],'A']
0.0
Modifier une valeur par position
df.iat[0,1] = 0
Modifier une valeur en assignant un tableau numpy
df.loc[:,'D'] = np.array([5] * len(df))
df
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | 1.879986 | 5 | NaN |
2013-01-02 | 0.113870 | 1.119906 | -0.071703 | 5 | 1.0 |
2013-01-03 | 0.807842 | -0.426218 | -0.328528 | 5 | 2.0 |
2013-01-04 | 1.533555 | -0.750685 | -0.051212 | 5 | 3.0 |
2013-01-05 | -0.828470 | -0.627016 | 1.664534 | 5 | 4.0 |
2013-01-06 | 0.800437 | -0.675739 | -1.049957 | 5 | 5.0 |
5.1.2.6. Gérer les données manquantes¶
Pandas utilise le type np.nan pour représenter les valeurs manquantes. Ce n’est pas codé pour faire des calculs.
Reindex permet de changer/ajouter/supprimer les index d’un axe. Cette fonction retourne une copie des données
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | 1.879986 | 5 | NaN | 1.0 |
2013-01-02 | 0.113870 | 1.119906 | -0.071703 | 5 | 1.0 | 1.0 |
2013-01-03 | 0.807842 | -0.426218 | -0.328528 | 5 | 2.0 | NaN |
2013-01-04 | 1.533555 | -0.750685 | -0.051212 | 5 | 3.0 | NaN |
Pour supprimer les lignes contenant des NaN :
df1.dropna(how='any')
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2013-01-02 | 0.11387 | 1.119906 | -0.071703 | 5 | 1.0 | 1.0 |
Remplacement des valeurs manquantes
df1.fillna(value=5)
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | 1.879986 | 5 | 5.0 | 1.0 |
2013-01-02 | 0.113870 | 1.119906 | -0.071703 | 5 | 1.0 | 1.0 |
2013-01-03 | 0.807842 | -0.426218 | -0.328528 | 5 | 2.0 | 5.0 |
2013-01-04 | 1.533555 | -0.750685 | -0.051212 | 5 | 3.0 | 5.0 |
Obtenir le masque de booléen de l’emplacement des nan
pd.isnull(df1)
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2013-01-01 | False | False | False | False | True | False |
2013-01-02 | False | False | False | False | False | False |
2013-01-03 | False | False | False | False | False | True |
2013-01-04 | False | False | False | False | False | True |
5.1.3. Opérations¶
5.1.3.1. Stats¶
Les opérations excluent généralement les données manquantes.
print(df.mean())
print(df.mean(1)) #Autre axe
A 0.404539
B -0.226625
C 0.340520
D 5.000000
F 3.000000
dtype: float64
2013-01-01 1.719997
2013-01-02 1.432415
2013-01-03 1.410619
2013-01-04 1.746332
2013-01-05 1.841810
2013-01-06 1.814948
Freq: D, dtype: float64
Situation avec des objets de dimmension différentes. En plus, pandas va automatiquement étendre la donnée sur la dimension spécifiée
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
print(s)
df.sub(s, axis='index')
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | NaN | NaN | NaN | NaN | NaN |
2013-01-02 | NaN | NaN | NaN | NaN | NaN |
2013-01-03 | -0.192158 | -1.426218 | -1.328528 | 4.0 | 1.0 |
2013-01-04 | -1.466445 | -3.750685 | -3.051212 | 2.0 | 0.0 |
2013-01-05 | -5.828470 | -5.627016 | -3.335466 | 0.0 | -1.0 |
2013-01-06 | NaN | NaN | NaN | NaN | NaN |
help(df.sub)
Help on method sub in module pandas.core.ops:
sub(other, axis='columns', level=None, fill_value=None) method of pandas.core.frame.DataFrame instance
Get Subtraction of dataframe and other, element-wise (binary operator `sub`).
Equivalent to ``dataframe - other``, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, `rsub`.
Among flexible wrappers (`add`, `sub`, `mul`, `div`, `mod`, `pow`) to
arithmetic operators: `+`, `-`, `*`, `/`, `//`, `%`, `**`.
Parameters
----------
other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
axis : {0 or 'index', 1 or 'columns'}
Whether to compare by the index (0 or 'index') or columns
(1 or 'columns'). For Series input, axis to match Series index on.
level : int or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value : float or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Returns
-------
DataFrame
Result of the arithmetic operation.
See Also
--------
DataFrame.add : Add DataFrames.
DataFrame.sub : Subtract DataFrames.
DataFrame.mul : Multiply DataFrames.
DataFrame.div : Divide DataFrames (float division).
DataFrame.truediv : Divide DataFrames (float division).
DataFrame.floordiv : Divide DataFrames (integer division).
DataFrame.mod : Calculate modulo (remainder after division).
DataFrame.pow : Calculate exponential power.
Notes
-----
Mismatched indices will be unioned together.
Examples
--------
>>> df = pd.DataFrame({'angles': [0, 3, 4],
... 'degrees': [360, 180, 360]},
... index=['circle', 'triangle', 'rectangle'])
>>> df
angles degrees
circle 0 360
triangle 3 180
rectangle 4 360
Add a scalar with operator version which return the same
results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df.sub([1, 2], axis='columns')
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
... axis='index')
angles degrees
circle -1 359
triangle 2 179
rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]},
... index=['circle', 'triangle', 'rectangle'])
>>> other
angles
circle 0
triangle 3
rectangle 4
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
>>> df.mul(other, fill_value=0)
angles degrees
circle 0 0.0
triangle 9 0.0
rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
... 'degrees': [360, 180, 360, 360, 540, 720]},
... index=[['A', 'A', 'A', 'B', 'B', 'B'],
... ['circle', 'triangle', 'rectangle',
... 'square', 'pentagon', 'hexagon']])
>>> df_multindex
angles degrees
A circle 0 360
triangle 3 180
rectangle 4 360
B square 4 360
pentagon 5 540
hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0)
angles degrees
A circle NaN 1.0
triangle 1.0 1.0
rectangle 1.0 1.0
B square 0.0 0.0
pentagon 0.0 0.0
hexagon 0.0 0.0
df.sub(s, axis='index')
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | NaN | NaN | NaN | NaN | NaN |
2013-01-02 | NaN | NaN | NaN | NaN | NaN |
2013-01-03 | -0.192158 | -1.426218 | -1.328528 | 4.0 | 1.0 |
2013-01-04 | -1.466445 | -3.750685 | -3.051212 | 2.0 | 0.0 |
2013-01-05 | -5.828470 | -5.627016 | -3.335466 | 0.0 | -1.0 |
2013-01-06 | NaN | NaN | NaN | NaN | NaN |
5.1.3.2. Apply¶
Appliquer des foncitons aux données
df.apply(np.cumsum)
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | 1.879986 | 5 | NaN |
2013-01-02 | 0.113870 | 1.119906 | 1.808284 | 10 | 1.0 |
2013-01-03 | 0.921712 | 0.693688 | 1.479755 | 15 | 3.0 |
2013-01-04 | 2.455267 | -0.056998 | 1.428543 | 20 | 6.0 |
2013-01-05 | 1.626797 | -0.684013 | 3.093077 | 25 | 10.0 |
2013-01-06 | 2.427234 | -1.359752 | 2.043120 | 30 | 15.0 |
df.apply((lambda x: x.max() - x.min()))
A 2.362025
B 1.870591
C 2.929943
D 0.000000
F 4.000000
dtype: float64
5.1.3.3. Histogramme¶
s = pd.Series(np.random.randint(0, 7, size=10))
print(s)
print(s.value_counts())
0 2
1 3
2 1
3 5
4 3
5 6
6 5
7 1
8 4
9 2
dtype: int64
1 2
2 2
3 2
5 2
4 1
6 1
dtype: int64
5.1.3.4. Methodes String¶
Les séries sont équipées de méthodes pour traiter les strings avec l’attribut str qui rend facile la manipulation de chaque élémen d’un tableau. On utilise régulièrement des expressions régulières.
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
5.1.4. Regrouper¶
5.1.4.1. Concaténation¶
Pandas fournit des methodes pour facilement combiner des Series, DatFrame et des Panel objets avec des types variés de set logique pour les indexes et des fonctionnalités d’algèbre dans le cas de jointure / regroupement
On peut concaténer des objets pandas avec concat()
df = pd.DataFrame(np.random.randn(10, 4))
print(df)
0 1 2 3
0 -0.145638 -0.339968 -0.120956 1.200960
1 1.710535 1.260141 1.108210 0.005057
2 0.339810 -0.467186 -0.178729 -1.128419
3 -0.321449 1.771127 -0.441905 0.424739
4 0.059993 0.373694 -0.521661 0.706102
5 0.005170 0.400061 0.947412 1.130690
6 -0.964596 -0.459291 -0.479300 -2.245681
7 -0.920954 0.787782 0.012169 1.357360
8 0.032657 -0.366058 0.238168 1.674530
9 1.156401 1.466012 0.882983 0.726695
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -0.145638 | -0.339968 | -0.120956 | 1.200960 |
1 | 1.710535 | 1.260141 | 1.108210 | 0.005057 |
2 | 0.339810 | -0.467186 | -0.178729 | -1.128419 |
3 | -0.321449 | 1.771127 | -0.441905 | 0.424739 |
4 | 0.059993 | 0.373694 | -0.521661 | 0.706102 |
5 | 0.005170 | 0.400061 | 0.947412 | 1.130690 |
6 | -0.964596 | -0.459291 | -0.479300 | -2.245681 |
7 | -0.920954 | 0.787782 | 0.012169 | 1.357360 |
8 | 0.032657 | -0.366058 | 0.238168 | 1.674530 |
9 | 1.156401 | 1.466012 | 0.882983 | 0.726695 |
5.1.4.2. Jointures¶
On peut merger à la manière de requete SQL.
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
print(left)
print(right)
print(pd.merge(left, right, on='key'))
key lval
0 foo 1
1 foo 2
key rval
0 foo 4
1 foo 5
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
5.1.4.3. Append¶
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df
A | B | C | D | |
---|---|---|---|---|
0 | -0.695328 | -0.995854 | 0.908458 | 1.678616 |
1 | -0.521819 | 0.464408 | 0.396498 | 0.674066 |
2 | 0.635238 | 0.909905 | -0.714702 | 1.185353 |
3 | -0.243663 | -0.329383 | 1.602974 | -0.874731 |
4 | 0.280138 | -1.063717 | 0.305626 | 0.595742 |
5 | 1.382979 | -0.800939 | 1.230337 | -0.922623 |
6 | -0.041754 | 1.021744 | 0.737046 | 0.445048 |
7 | -0.537346 | -1.302530 | -0.196090 | 1.214839 |
s = df.iloc[3]
df.append(s, ignore_index=True)
A | B | C | D | |
---|---|---|---|---|
0 | -0.695328 | -0.995854 | 0.908458 | 1.678616 |
1 | -0.521819 | 0.464408 | 0.396498 | 0.674066 |
2 | 0.635238 | 0.909905 | -0.714702 | 1.185353 |
3 | -0.243663 | -0.329383 | 1.602974 | -0.874731 |
4 | 0.280138 | -1.063717 | 0.305626 | 0.595742 |
5 | 1.382979 | -0.800939 | 1.230337 | -0.922623 |
6 | -0.041754 | 1.021744 | 0.737046 | 0.445048 |
7 | -0.537346 | -1.302530 | -0.196090 | 1.214839 |
8 | -0.243663 | -0.329383 | 1.602974 | -0.874731 |
5.1.4.4. Groupement¶
Le regroupement comprend les étapes suivantes :
Séparation de la donnée en groupes
Appliquer une fonction a chaque group indépendamment
Combiner les resultats dans une structure de données
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df
A | B | C | D | |
---|---|---|---|---|
0 | foo | one | 1.095588 | 0.637490 |
1 | bar | one | 0.567613 | -0.555713 |
2 | foo | two | -0.316915 | 0.437783 |
3 | bar | three | -1.247033 | 0.435075 |
4 | foo | two | 0.233856 | -1.218486 |
5 | bar | two | -0.904002 | 0.940587 |
6 | foo | one | 0.713043 | -0.154975 |
7 | foo | three | 0.324294 | 0.339278 |
Groupement et somme des groupes
df.groupby('A').sum()
C | D | |
---|---|---|
A | ||
bar | -1.583423 | 0.819950 |
foo | 2.049867 | 0.041091 |
Groupement de multiple colonnes
df.groupby(['A','B']).sum()
C | D | ||
---|---|---|---|
A | B | ||
bar | one | 0.567613 | -0.555713 |
three | -1.247033 | 0.435075 | |
two | -0.904002 | 0.940587 | |
foo | one | 1.808631 | 0.482516 |
three | 0.324294 | 0.339278 | |
two | -0.083059 | -0.780702 |
5.1.4.5. Reformation¶
Stack
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two',
'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2
A | B | ||
---|---|---|---|
first | second | ||
bar | one | 0.904984 | 2.099247 |
two | 1.172413 | -0.388574 | |
baz | one | -0.984071 | -1.504685 |
two | -1.141553 | 0.203926 |
La méthode stack() compresses un level dans les colonnes du dataframe
stacked = df2.stack()
stacked
first second
bar one A 0.904984
B 2.099247
two A 1.172413
B -0.388574
baz one A -0.984071
B -1.504685
two A -1.141553
B 0.203926
dtype: float64
Avec une ‘stacked’ dataframe ou série, l’opération inverse est unstack()