5.1. Pandas en 10 minutes

 #Pour intégrer les graphes à votre notebook, il suffit de faire
%matplotlib inline

from jyquickhelper import add_notebook_menu
add_notebook_menu()
run previous cell, wait for 2 seconds

On importe générallement les librairies suivantes

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

5.1.1. Création d’objets

On créé une ‘Series’ en lui passant une liste de valeurs, en laissant pandas créer un index d’entiers

s = pd.Series([1,3,5,np.nan,6,8])
s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

On créé un DataFrame en passant un array numpy, avec un index sur sur une date et des colonnes labellisées

dates = pd.date_range('20130101', periods=6)
print(dates)

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
A B C D
2013-01-01 0.531261 1.298587 1.879986 -0.510287
2013-01-02 0.113870 1.119906 -0.071703 -1.090445
2013-01-03 0.807842 -0.426218 -0.328528 -0.254736
2013-01-04 1.533555 -0.750685 -0.051212 0.188978
2013-01-05 -0.828470 -0.627016 1.664534 -1.728745
2013-01-06 0.800437 -0.675739 -1.049957 0.052375

On peut également créer un DataFrame en passant un dictionnaire d’objets qui peut être converti en sorte de série.

df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })

df2
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo

Chaque colonne a son propre dtypes

df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

On peut afficher les premières lignes et les dernières

print(df.head())
print(df.tail())
                   A         B         C         D
2013-01-01  0.531261  1.298587  1.879986 -0.510287
2013-01-02  0.113870  1.119906 -0.071703 -1.090445
2013-01-03  0.807842 -0.426218 -0.328528 -0.254736
2013-01-04  1.533555 -0.750685 -0.051212  0.188978
2013-01-05 -0.828470 -0.627016  1.664534 -1.728745
                   A         B         C         D
2013-01-02  0.113870  1.119906 -0.071703 -1.090445
2013-01-03  0.807842 -0.426218 -0.328528 -0.254736
2013-01-04  1.533555 -0.750685 -0.051212  0.188978
2013-01-05 -0.828470 -0.627016  1.664534 -1.728745
2013-01-06  0.800437 -0.675739 -1.049957  0.052375

On peut afficher l’index, les colonnes et les données numpy

print(df.index)
print(df.columns)
print(df.values)
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D'], dtype='object')
[[ 0.53126106  1.29858738  1.87998645 -0.51028693]
 [ 0.11386978  1.11990561 -0.07170277 -1.0904453 ]
 [ 0.80784175 -0.42621808 -0.32852846 -0.25473621]
 [ 1.53355516 -0.75068514 -0.05121214  0.18897817]
 [-0.82846957 -0.62701575  1.66453418 -1.72874545]
 [ 0.80043734 -0.67573851 -1.049957    0.0523755 ]]

La méthode describe permet d’afficher un résumé des données

df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.493083 -0.010194 0.340520 -0.557143
std 0.795975 0.952350 1.168538 0.731972
min -0.828470 -0.750685 -1.049957 -1.728745
25% 0.218218 -0.663558 -0.264322 -0.945406
50% 0.665849 -0.526617 -0.061457 -0.382512
75% 0.805991 0.733375 1.235598 -0.024402
max 1.533555 1.298587 1.879986 0.188978

On peut faire la transposée, trier en fonction d’un axe ou des valeurs

print(df.T)
   2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
A    0.531261    0.113870    0.807842    1.533555   -0.828470    0.800437
B    1.298587    1.119906   -0.426218   -0.750685   -0.627016   -0.675739
C    1.879986   -0.071703   -0.328528   -0.051212    1.664534   -1.049957
D   -0.510287   -1.090445   -0.254736    0.188978   -1.728745    0.052375
df.sort_index(axis=1, ascending=False)
D C B A
2013-01-01 -0.510287 1.879986 1.298587 0.531261
2013-01-02 -1.090445 -0.071703 1.119906 0.113870
2013-01-03 -0.254736 -0.328528 -0.426218 0.807842
2013-01-04 0.188978 -0.051212 -0.750685 1.533555
2013-01-05 -1.728745 1.664534 -0.627016 -0.828470
2013-01-06 0.052375 -1.049957 -0.675739 0.800437
df.sort_values(by='B')
A B C D
2013-01-04 1.533555 -0.750685 -0.051212 0.188978
2013-01-06 0.800437 -0.675739 -1.049957 0.052375
2013-01-05 -0.828470 -0.627016 1.664534 -1.728745
2013-01-03 0.807842 -0.426218 -0.328528 -0.254736
2013-01-02 0.113870 1.119906 -0.071703 -1.090445
2013-01-01 0.531261 1.298587 1.879986 -0.510287

5.1.2. Selection des données

5.1.2.1. Getting

Selection d’une colonne (équivalent à df.A)

print(df['A'])
print(df[0:3])
print(df['20130102':'20130104'])
2013-01-01    0.531261
2013-01-02    0.113870
2013-01-03    0.807842
2013-01-04    1.533555
2013-01-05   -0.828470
2013-01-06    0.800437
Freq: D, Name: A, dtype: float64
                   A         B         C         D
2013-01-01  0.531261  1.298587  1.879986 -0.510287
2013-01-02  0.113870  1.119906 -0.071703 -1.090445
2013-01-03  0.807842 -0.426218 -0.328528 -0.254736
                   A         B         C         D
2013-01-02  0.113870  1.119906 -0.071703 -1.090445
2013-01-03  0.807842 -0.426218 -0.328528 -0.254736
2013-01-04  1.533555 -0.750685 -0.051212  0.188978

5.1.2.2. Selection par Label

En utilisant un label

df.loc[dates[0]]
A    0.531261
B    1.298587
C    1.879986
D   -0.510287
Name: 2013-01-01 00:00:00, dtype: float64

Selection de plusieurs axes par label

df.loc[:,['A','B']]
A B
2013-01-01 0.531261 1.298587
2013-01-02 0.113870 1.119906
2013-01-03 0.807842 -0.426218
2013-01-04 1.533555 -0.750685
2013-01-05 -0.828470 -0.627016
2013-01-06 0.800437 -0.675739

Avec le label slicing, les deux points de terminaisons sont INCLUS

df.loc['20130102':'20130104', ['A','B']]
A B
2013-01-02 0.113870 1.119906
2013-01-03 0.807842 -0.426218
2013-01-04 1.533555 -0.750685

Obtenir une valeur scalaire

df.loc[dates[0],'A']
0.5312610562902266

Acces plus rapide (méthode équivalente à la précédente)

df.at[dates[0],'A']
0.5312610562902266

5.1.2.3. Selection par position

Integer :

df.iloc[3]
A    1.533555
B   -0.750685
C   -0.051212
D    0.188978
Name: 2013-01-04 00:00:00, dtype: float64

Tranches d’entiers, similaire à numpy

df.iloc[3:5,0:2]
A B
2013-01-04 1.533555 -0.750685
2013-01-05 -0.828470 -0.627016

Par liste d’entiers

df.iloc[[1,2,4],[0,2]]
A C
2013-01-02 0.113870 -0.071703
2013-01-03 0.807842 -0.328528
2013-01-05 -0.828470 1.664534

Découpage de ligne explicite

df.iloc[1:3,:]
A B C D
2013-01-02 0.113870 1.119906 -0.071703 -1.090445
2013-01-03 0.807842 -0.426218 -0.328528 -0.254736

Obtenir une valeur explicitement

df.iloc[1,1]
1.1199056145150672

Acces rapide au scalaire

df.iat[1,1]
1.1199056145150672

5.1.2.4. Indexation booléenne

En utilisant une valeur sur une colonne :

df[df.A > 0]
A B C D
2013-01-01 0.531261 1.298587 1.879986 -0.510287
2013-01-02 0.113870 1.119906 -0.071703 -1.090445
2013-01-03 0.807842 -0.426218 -0.328528 -0.254736
2013-01-04 1.533555 -0.750685 -0.051212 0.188978
2013-01-06 0.800437 -0.675739 -1.049957 0.052375

Opérateur where :

df[df > 0]
A B C D
2013-01-01 0.531261 1.298587 1.879986 NaN
2013-01-02 0.113870 1.119906 NaN NaN
2013-01-03 0.807842 NaN NaN NaN
2013-01-04 1.533555 NaN NaN 0.188978
2013-01-05 NaN NaN 1.664534 NaN
2013-01-06 0.800437 NaN NaN 0.052375

Pour filter, on utilise la méthode isin()

df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
print(df2)

df2[df2['E'].isin(['two','four'])]
                   A         B         C         D      E
2013-01-01  0.531261  1.298587  1.879986 -0.510287    one
2013-01-02  0.113870  1.119906 -0.071703 -1.090445    one
2013-01-03  0.807842 -0.426218 -0.328528 -0.254736    two
2013-01-04  1.533555 -0.750685 -0.051212  0.188978  three
2013-01-05 -0.828470 -0.627016  1.664534 -1.728745   four
2013-01-06  0.800437 -0.675739 -1.049957  0.052375  three
A B C D E
2013-01-03 0.807842 -0.426218 -0.328528 -0.254736 two
2013-01-05 -0.828470 -0.627016 1.664534 -1.728745 four

5.1.2.5. Ajouter / modifier valeurs / colonnes

Ajouter une nouvelle colonne automatiquement aligne les données par index.

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
print(s1)
df['F'] = s1
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

Modifier une valeur par label

df.at[dates[0],'A'] = 0
df.loc[dates[0],'A']
0.0

Modifier une valeur par position

df.iat[0,1] = 0

Modifier une valeur en assignant un tableau numpy

df.loc[:,'D'] = np.array([5] * len(df))
df
A B C D F
2013-01-01 0.000000 0.000000 1.879986 5 NaN
2013-01-02 0.113870 1.119906 -0.071703 5 1.0
2013-01-03 0.807842 -0.426218 -0.328528 5 2.0
2013-01-04 1.533555 -0.750685 -0.051212 5 3.0
2013-01-05 -0.828470 -0.627016 1.664534 5 4.0
2013-01-06 0.800437 -0.675739 -1.049957 5 5.0

5.1.2.6. Gérer les données manquantes

Pandas utilise le type np.nan pour représenter les valeurs manquantes. Ce n’est pas codé pour faire des calculs.

Reindex permet de changer/ajouter/supprimer les index d’un axe. Cette fonction retourne une copie des données

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1
A B C D F E
2013-01-01 0.000000 0.000000 1.879986 5 NaN 1.0
2013-01-02 0.113870 1.119906 -0.071703 5 1.0 1.0
2013-01-03 0.807842 -0.426218 -0.328528 5 2.0 NaN
2013-01-04 1.533555 -0.750685 -0.051212 5 3.0 NaN

Pour supprimer les lignes contenant des NaN :

df1.dropna(how='any')
A B C D F E
2013-01-02 0.11387 1.119906 -0.071703 5 1.0 1.0

Remplacement des valeurs manquantes

df1.fillna(value=5)
A B C D F E
2013-01-01 0.000000 0.000000 1.879986 5 5.0 1.0
2013-01-02 0.113870 1.119906 -0.071703 5 1.0 1.0
2013-01-03 0.807842 -0.426218 -0.328528 5 2.0 5.0
2013-01-04 1.533555 -0.750685 -0.051212 5 3.0 5.0

Obtenir le masque de booléen de l’emplacement des nan

pd.isnull(df1)
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True

5.1.3. Opérations

5.1.3.1. Stats

Les opérations excluent généralement les données manquantes.

print(df.mean())
print(df.mean(1)) #Autre axe
A    0.404539
B   -0.226625
C    0.340520
D    5.000000
F    3.000000
dtype: float64
2013-01-01    1.719997
2013-01-02    1.432415
2013-01-03    1.410619
2013-01-04    1.746332
2013-01-05    1.841810
2013-01-06    1.814948
Freq: D, dtype: float64

Situation avec des objets de dimmension différentes. En plus, pandas va automatiquement étendre la donnée sur la dimension spécifiée

s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)

print(s)

df.sub(s, axis='index')
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64
A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -0.192158 -1.426218 -1.328528 4.0 1.0
2013-01-04 -1.466445 -3.750685 -3.051212 2.0 0.0
2013-01-05 -5.828470 -5.627016 -3.335466 0.0 -1.0
2013-01-06 NaN NaN NaN NaN NaN
help(df.sub)
Help on method sub in module pandas.core.ops:

sub(other, axis='columns', level=None, fill_value=None) method of pandas.core.frame.DataFrame instance
    Get Subtraction of dataframe and other, element-wise (binary operator `sub`).
    
    Equivalent to ``dataframe - other``, but with support to substitute a fill_value
    for missing data in one of the inputs. With reverse version, `rsub`.
    
    Among flexible wrappers (`add`, `sub`, `mul`, `div`, `mod`, `pow`) to
    arithmetic operators: `+`, `-`, `*`, `/`, `//`, `%`, `**`.
    
    Parameters
    ----------
    other : scalar, sequence, Series, or DataFrame
        Any single or multiple element data structure, or list-like object.
    axis : {0 or 'index', 1 or 'columns'}
        Whether to compare by the index (0 or 'index') or columns
        (1 or 'columns'). For Series input, axis to match Series index on.
    level : int or label
        Broadcast across a level, matching Index values on the
        passed MultiIndex level.
    fill_value : float or None, default None
        Fill existing missing (NaN) values, and any new element needed for
        successful DataFrame alignment, with this value before computation.
        If data in both corresponding DataFrame locations is missing
        the result will be missing.
    
    Returns
    -------
    DataFrame
        Result of the arithmetic operation.
    
    See Also
    --------
    DataFrame.add : Add DataFrames.
    DataFrame.sub : Subtract DataFrames.
    DataFrame.mul : Multiply DataFrames.
    DataFrame.div : Divide DataFrames (float division).
    DataFrame.truediv : Divide DataFrames (float division).
    DataFrame.floordiv : Divide DataFrames (integer division).
    DataFrame.mod : Calculate modulo (remainder after division).
    DataFrame.pow : Calculate exponential power.
    
    Notes
    -----
    Mismatched indices will be unioned together.
    
    Examples
    --------
    >>> df = pd.DataFrame({'angles': [0, 3, 4],
    ...                    'degrees': [360, 180, 360]},
    ...                   index=['circle', 'triangle', 'rectangle'])
    >>> df
               angles  degrees
    circle          0      360
    triangle        3      180
    rectangle       4      360
    
    Add a scalar with operator version which return the same
    results.
    
    >>> df + 1
               angles  degrees
    circle          1      361
    triangle        4      181
    rectangle       5      361
    
    >>> df.add(1)
               angles  degrees
    circle          1      361
    triangle        4      181
    rectangle       5      361
    
    Divide by constant with reverse version.
    
    >>> df.div(10)
               angles  degrees
    circle        0.0     36.0
    triangle      0.3     18.0
    rectangle     0.4     36.0
    
    >>> df.rdiv(10)
                 angles   degrees
    circle          inf  0.027778
    triangle   3.333333  0.055556
    rectangle  2.500000  0.027778
    
    Subtract a list and Series by axis with operator version.
    
    >>> df - [1, 2]
               angles  degrees
    circle         -1      358
    triangle        2      178
    rectangle       3      358
    
    >>> df.sub([1, 2], axis='columns')
               angles  degrees
    circle         -1      358
    triangle        2      178
    rectangle       3      358
    
    >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
    ...        axis='index')
               angles  degrees
    circle         -1      359
    triangle        2      179
    rectangle       3      359
    
    Multiply a DataFrame of different shape with operator version.
    
    >>> other = pd.DataFrame({'angles': [0, 3, 4]},
    ...                      index=['circle', 'triangle', 'rectangle'])
    >>> other
               angles
    circle          0
    triangle        3
    rectangle       4
    
    >>> df * other
               angles  degrees
    circle          0      NaN
    triangle        9      NaN
    rectangle      16      NaN
    
    >>> df.mul(other, fill_value=0)
               angles  degrees
    circle          0      0.0
    triangle        9      0.0
    rectangle      16      0.0
    
    Divide by a MultiIndex by level.
    
    >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
    ...                              'degrees': [360, 180, 360, 360, 540, 720]},
    ...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
    ...                                    ['circle', 'triangle', 'rectangle',
    ...                                     'square', 'pentagon', 'hexagon']])
    >>> df_multindex
                 angles  degrees
    A circle          0      360
      triangle        3      180
      rectangle       4      360
    B square          4      360
      pentagon        5      540
      hexagon         6      720
    
    >>> df.div(df_multindex, level=1, fill_value=0)
                 angles  degrees
    A circle        NaN      1.0
      triangle      1.0      1.0
      rectangle     1.0      1.0
    B square        0.0      0.0
      pentagon      0.0      0.0
      hexagon       0.0      0.0
df.sub(s, axis='index')
A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -0.192158 -1.426218 -1.328528 4.0 1.0
2013-01-04 -1.466445 -3.750685 -3.051212 2.0 0.0
2013-01-05 -5.828470 -5.627016 -3.335466 0.0 -1.0
2013-01-06 NaN NaN NaN NaN NaN

5.1.3.2. Apply

Appliquer des foncitons aux données

df.apply(np.cumsum)
A B C D F
2013-01-01 0.000000 0.000000 1.879986 5 NaN
2013-01-02 0.113870 1.119906 1.808284 10 1.0
2013-01-03 0.921712 0.693688 1.479755 15 3.0
2013-01-04 2.455267 -0.056998 1.428543 20 6.0
2013-01-05 1.626797 -0.684013 3.093077 25 10.0
2013-01-06 2.427234 -1.359752 2.043120 30 15.0
df.apply((lambda x: x.max() - x.min()))
A    2.362025
B    1.870591
C    2.929943
D    0.000000
F    4.000000
dtype: float64

5.1.3.3. Histogramme

s = pd.Series(np.random.randint(0, 7, size=10))
print(s)
print(s.value_counts())
0    2
1    3
2    1
3    5
4    3
5    6
6    5
7    1
8    4
9    2
dtype: int64
1    2
2    2
3    2
5    2
4    1
6    1
dtype: int64

5.1.3.4. Methodes String

Les séries sont équipées de méthodes pour traiter les strings avec l’attribut str qui rend facile la manipulation de chaque élémen d’un tableau. On utilise régulièrement des expressions régulières.

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

5.1.4. Regrouper

5.1.4.1. Concaténation

Pandas fournit des methodes pour facilement combiner des Series, DatFrame et des Panel objets avec des types variés de set logique pour les indexes et des fonctionnalités d’algèbre dans le cas de jointure / regroupement

On peut concaténer des objets pandas avec concat()

df = pd.DataFrame(np.random.randn(10, 4))
print(df)
          0         1         2         3
0 -0.145638 -0.339968 -0.120956  1.200960
1  1.710535  1.260141  1.108210  0.005057
2  0.339810 -0.467186 -0.178729 -1.128419
3 -0.321449  1.771127 -0.441905  0.424739
4  0.059993  0.373694 -0.521661  0.706102
5  0.005170  0.400061  0.947412  1.130690
6 -0.964596 -0.459291 -0.479300 -2.245681
7 -0.920954  0.787782  0.012169  1.357360
8  0.032657 -0.366058  0.238168  1.674530
9  1.156401  1.466012  0.882983  0.726695
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)
0 1 2 3
0 -0.145638 -0.339968 -0.120956 1.200960
1 1.710535 1.260141 1.108210 0.005057
2 0.339810 -0.467186 -0.178729 -1.128419
3 -0.321449 1.771127 -0.441905 0.424739
4 0.059993 0.373694 -0.521661 0.706102
5 0.005170 0.400061 0.947412 1.130690
6 -0.964596 -0.459291 -0.479300 -2.245681
7 -0.920954 0.787782 0.012169 1.357360
8 0.032657 -0.366058 0.238168 1.674530
9 1.156401 1.466012 0.882983 0.726695

5.1.4.2. Jointures

On peut merger à la manière de requete SQL.

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

print(left)
print(right)

print(pd.merge(left, right, on='key'))
   key  lval
0  foo     1
1  foo     2
   key  rval
0  foo     4
1  foo     5
   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5

5.1.4.3. Append

df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df
A B C D
0 -0.695328 -0.995854 0.908458 1.678616
1 -0.521819 0.464408 0.396498 0.674066
2 0.635238 0.909905 -0.714702 1.185353
3 -0.243663 -0.329383 1.602974 -0.874731
4 0.280138 -1.063717 0.305626 0.595742
5 1.382979 -0.800939 1.230337 -0.922623
6 -0.041754 1.021744 0.737046 0.445048
7 -0.537346 -1.302530 -0.196090 1.214839
s = df.iloc[3]
df.append(s, ignore_index=True)
A B C D
0 -0.695328 -0.995854 0.908458 1.678616
1 -0.521819 0.464408 0.396498 0.674066
2 0.635238 0.909905 -0.714702 1.185353
3 -0.243663 -0.329383 1.602974 -0.874731
4 0.280138 -1.063717 0.305626 0.595742
5 1.382979 -0.800939 1.230337 -0.922623
6 -0.041754 1.021744 0.737046 0.445048
7 -0.537346 -1.302530 -0.196090 1.214839
8 -0.243663 -0.329383 1.602974 -0.874731

5.1.4.4. Groupement

Le regroupement comprend les étapes suivantes :

  • Séparation de la donnée en groupes

  • Appliquer une fonction a chaque group indépendamment

  • Combiner les resultats dans une structure de données

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df
A B C D
0 foo one 1.095588 0.637490
1 bar one 0.567613 -0.555713
2 foo two -0.316915 0.437783
3 bar three -1.247033 0.435075
4 foo two 0.233856 -1.218486
5 bar two -0.904002 0.940587
6 foo one 0.713043 -0.154975
7 foo three 0.324294 0.339278

Groupement et somme des groupes

df.groupby('A').sum()
C D
A
bar -1.583423 0.819950
foo 2.049867 0.041091

Groupement de multiple colonnes

df.groupby(['A','B']).sum()
C D
A B
bar one 0.567613 -0.555713
three -1.247033 0.435075
two -0.904002 0.940587
foo one 1.808631 0.482516
three 0.324294 0.339278
two -0.083059 -0.780702

5.1.4.5. Reformation

Stack

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two',
'one', 'two', 'one', 'two']]))

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2
A B
first second
bar one 0.904984 2.099247
two 1.172413 -0.388574
baz one -0.984071 -1.504685
two -1.141553 0.203926

La méthode stack() compresses un level dans les colonnes du dataframe

stacked = df2.stack()
stacked
first  second   
bar    one     A    0.904984
               B    2.099247
       two     A    1.172413
               B   -0.388574
baz    one     A   -0.984071
               B   -1.504685
       two     A   -1.141553
               B    0.203926
dtype: float64

Avec une ‘stacked’ dataframe ou série, l’opération inverse est unstack()