Pandas 101

DeepLearning
ENG
code
tuto
Author

Julien Combes

Published

December 23, 2025

This is a file i send to people so they have some code sample they can copy paste to apply to their own problems. If you need something that is not in this page, or you spot a mistake (likely to have some we never know) let me know and this page will grow as people ask me to update it :).

Getting the data

In most cases you will already have a file produced by some measurement system. Pandas can read any type of files (csv (separated by comas or semi-colons or any character), excel files and the new but efficient parquet files.)

Everything happens with the pandas.read_* methods :

import pandas as pd

pd.read_csv("files.csv",sep = ",", index_col = 0...)
pd.read_excel("file.xslx")
pd.read_parquet("file.parquet")

The sep argument tells pandas what is the characters that splits columns.

Toy data

Things with cars ! it must be good hehehe

This first step is only getting the data and is not important. We use openml to load data. It is a rich collection of datasets that is very useful to get toy data.

import numpy as np
import pandas as pd
import openml

datasets_list = openml.datasets.list_datasets(output_format="dataframe", status="active", tag="manufacturing")


dataset = openml.datasets.get_dataset(248)
X, y, is_categorical, feat_names = dataset.get_data(
    dataset_format="dataframe", target=dataset.default_target_attribute)

Understanding the data

The tables

Pandas provides multiple ways to check our dataset. Here is an not-exhaustive list of atributes we could have wanted to use if we were students learning the tool to master it :

  • X.shape : number of rows and columns
  • X.info() : Names, types of columns and number of misisng values
  • X.describe() : get a statistical summary of each numerical columns like extremums, quantiles …
  • X.head(n) : show the n first rows of your dataset

But we are here to be efficient and get information we want in an easy way. I will then show you the only tool you need to get to know your data

from skrub import TableReport

TableReport(X)
Processing column   1 / 25Processing column   2 / 25Processing column   3 / 25Processing column   4 / 25Processing column   5 / 25Processing column   6 / 25Processing column   7 / 25Processing column   8 / 25Processing column   9 / 25Processing column  10 / 25Processing column  11 / 25Processing column  12 / 25Processing column  13 / 25Processing column  14 / 25Processing column  15 / 25Processing column  16 / 25Processing column  17 / 25Processing column  18 / 25Processing column  19 / 25Processing column  20 / 25Processing column  21 / 25Processing column  22 / 25Processing column  23 / 25Processing column  24 / 25Processing column  25 / 25

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

It produces an interactive view in html of your table that you can explore interactively. You have all you need with dimensions, statistical summaries, classic graphs with distribution of numerical data etc etc.. It all you could have done with only pandas but more.

The graphs

The available plotting tools are numerous. If you want an idea of what is possible with python here is a gallery with a lot of nice examples. I will focus with basics but most useful ploting solution.

The library we will use is seaborn that is supposed to work on dataframes. It is based on matplotlib that is the basics plotting brick of python but quite hard to use. Seaborn get things done quickly and nicely.

We will see plotting by asking questions :

What is the proportion of each car maker ?

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8,4))
sns.histplot(
    X, 
    x = "make",
    stat = "density"
)
plt.xticks(rotation = 90);

Mutating the data

Pandas provides a lot of very usefull methods to filter, modify delete items from a dataframe.

Warning

The philosophy in data science is to NEVER modifying by hand the data you are studying.

All your processing should be done in one place and clear.

For example here, we will see the most popular operation you might need in your work.

Let’s remember what the dataframe looks like :

X
normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 94.235189 volkswagen diesel turbo four hatchback fwd front 96.575655 173.369361 ... 89.249734 2bbl 3.122279 3.295534 9.219909 57.578864 5988.161816 31.814042 34.505390 8605.484961
1 106.648646 volvo diesel turbo four sedan rwd front 109.846269 196.214268 ... 111.024531 mfi 3.293933 3.392775 8.637036 113.705719 4490.621234 25.817650 35.053335 26633.568618
2 118.472162 nissan gas std four wagon fwd front 95.024314 165.909864 ... 96.317551 mpfi 3.051725 3.248152 9.000000 117.757103 4799.168653 22.783708 27.760480 5900.715970
3 62.717600 porsche gas std two hatchback rwd front 100.174878 182.499698 ... 144.228109 1bbl 3.238181 3.241202 9.278929 85.762326 5086.244824 14.949801 24.464281 8641.179741
4 105.808352 honda gas std four sedan fwd front 96.792849 172.594242 ... 95.829115 2bbl 3.296874 3.506498 7.068202 73.158430 4417.985102 20.781260 19.686021 9831.490432
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
999995 135.343962 toyota gas std two hatchback rwd front 96.798115 172.721704 ... 141.139493 2bbl 3.874087 3.239200 9.284329 82.423102 4795.277229 18.662406 31.668771 8697.083987
999996 117.541993 alfa-romero gas std two sedan rwd front 95.471999 172.728564 ... 94.672790 mpfi 3.206900 3.219425 8.490094 108.563509 5831.253661 17.524008 19.988458 15168.656670
999997 92.971613 mercedes-benz diesel turbo four wagon fwd front 106.470022 198.734017 ... 142.729616 idi 3.360154 2.570015 9.376938 103.584382 4820.162755 16.110262 34.389708 5811.145033
999998 195.508001 audi gas std two hatchback fwd front 98.163839 175.115516 ... 90.780708 mpfi 3.108470 3.098080 8.625507 186.347715 5148.094022 24.794536 28.859866 30190.403231
999999 102.256740 mercedes-benz diesel std four sedan 4wd front 99.880484 154.602788 ... 100.256336 idi 3.290561 3.536298 22.934508 73.317623 4777.166405 23.020483 26.392033 7951.146933

1000000 rows × 25 columns

and see what the prices look like by brand :

plt.figure(figsize=(8,4))
sns.boxplot(
    X, 
    x = "make",
    y = "price",
)
plt.xticks(rotation = 90);

Filtering rows

We have negative prices this looks like errors, lets start with filtering rows.

Pandas has a very handy method : query. This method eats a string that contains conditions, in this string you can use columns name as variables and external variables from your code like this.

price_error_threshold = 0
X.query("price >= @price_error_threshold")
normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 94.235189 volkswagen diesel turbo four hatchback fwd front 96.575655 173.369361 ... 89.249734 2bbl 3.122279 3.295534 9.219909 57.578864 5988.161816 31.814042 34.505390 8605.484961
1 106.648646 volvo diesel turbo four sedan rwd front 109.846269 196.214268 ... 111.024531 mfi 3.293933 3.392775 8.637036 113.705719 4490.621234 25.817650 35.053335 26633.568618
2 118.472162 nissan gas std four wagon fwd front 95.024314 165.909864 ... 96.317551 mpfi 3.051725 3.248152 9.000000 117.757103 4799.168653 22.783708 27.760480 5900.715970
3 62.717600 porsche gas std two hatchback rwd front 100.174878 182.499698 ... 144.228109 1bbl 3.238181 3.241202 9.278929 85.762326 5086.244824 14.949801 24.464281 8641.179741
4 105.808352 honda gas std four sedan fwd front 96.792849 172.594242 ... 95.829115 2bbl 3.296874 3.506498 7.068202 73.158430 4417.985102 20.781260 19.686021 9831.490432
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
999995 135.343962 toyota gas std two hatchback rwd front 96.798115 172.721704 ... 141.139493 2bbl 3.874087 3.239200 9.284329 82.423102 4795.277229 18.662406 31.668771 8697.083987
999996 117.541993 alfa-romero gas std two sedan rwd front 95.471999 172.728564 ... 94.672790 mpfi 3.206900 3.219425 8.490094 108.563509 5831.253661 17.524008 19.988458 15168.656670
999997 92.971613 mercedes-benz diesel turbo four wagon fwd front 106.470022 198.734017 ... 142.729616 idi 3.360154 2.570015 9.376938 103.584382 4820.162755 16.110262 34.389708 5811.145033
999998 195.508001 audi gas std two hatchback fwd front 98.163839 175.115516 ... 90.780708 mpfi 3.108470 3.098080 8.625507 186.347715 5148.094022 24.794536 28.859866 30190.403231
999999 102.256740 mercedes-benz diesel std four sedan 4wd front 99.880484 154.602788 ... 100.256336 idi 3.290561 3.536298 22.934508 73.317623 4777.166405 23.020483 26.392033 7951.146933

999875 rows × 25 columns

You see that the table only contains 999 875 elements. Some items have been removed. Query is a function that returns a mutated copy of the dataframe. So if you want the dataframe that have been modified, you need to assign it to a new variable like that.

new_df = X.query("price >= @price_error_threshold")
new_df
normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 94.235189 volkswagen diesel turbo four hatchback fwd front 96.575655 173.369361 ... 89.249734 2bbl 3.122279 3.295534 9.219909 57.578864 5988.161816 31.814042 34.505390 8605.484961
1 106.648646 volvo diesel turbo four sedan rwd front 109.846269 196.214268 ... 111.024531 mfi 3.293933 3.392775 8.637036 113.705719 4490.621234 25.817650 35.053335 26633.568618
2 118.472162 nissan gas std four wagon fwd front 95.024314 165.909864 ... 96.317551 mpfi 3.051725 3.248152 9.000000 117.757103 4799.168653 22.783708 27.760480 5900.715970
3 62.717600 porsche gas std two hatchback rwd front 100.174878 182.499698 ... 144.228109 1bbl 3.238181 3.241202 9.278929 85.762326 5086.244824 14.949801 24.464281 8641.179741
4 105.808352 honda gas std four sedan fwd front 96.792849 172.594242 ... 95.829115 2bbl 3.296874 3.506498 7.068202 73.158430 4417.985102 20.781260 19.686021 9831.490432
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
999995 135.343962 toyota gas std two hatchback rwd front 96.798115 172.721704 ... 141.139493 2bbl 3.874087 3.239200 9.284329 82.423102 4795.277229 18.662406 31.668771 8697.083987
999996 117.541993 alfa-romero gas std two sedan rwd front 95.471999 172.728564 ... 94.672790 mpfi 3.206900 3.219425 8.490094 108.563509 5831.253661 17.524008 19.988458 15168.656670
999997 92.971613 mercedes-benz diesel turbo four wagon fwd front 106.470022 198.734017 ... 142.729616 idi 3.360154 2.570015 9.376938 103.584382 4820.162755 16.110262 34.389708 5811.145033
999998 195.508001 audi gas std two hatchback fwd front 98.163839 175.115516 ... 90.780708 mpfi 3.108470 3.098080 8.625507 186.347715 5148.094022 24.794536 28.859866 30190.403231
999999 102.256740 mercedes-benz diesel std four sedan 4wd front 99.880484 154.602788 ... 100.256336 idi 3.290561 3.536298 22.934508 73.317623 4777.166405 23.020483 26.392033 7951.146933

999875 rows × 25 columns

I don’t know what is the cleanest way to apply multiple filters, but i would recommend you to apply those in a sequential manner in order to be as clean as possible like that :

(
    X
    .query("price > 0")
    .query("make == 'porsche'")
    .query("aspiration== 'turbo'")
    .query("horsepower > 200")
)
normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
3190 185.876800 porsche gas turbo two wagon fwd rear 105.204967 199.830908 ... 94.192687 spfi 3.077697 3.299996 8.572648 203.141156 5247.807116 21.175022 20.079863 16825.034664
7706 169.592917 porsche gas turbo two hardtop rwd rear 94.166153 170.515142 ... 103.474633 spdi 3.343738 3.320666 9.000000 221.580768 5125.481092 24.691797 29.570073 32931.723172
13028 80.884942 porsche gas turbo four sedan rwd front 111.122111 189.999299 ... 165.584111 2bbl 2.938367 3.272089 9.357822 227.389179 4334.123031 18.965615 26.375387 16813.144257
16276 73.890834 porsche gas turbo two hatchback rwd front 109.779892 178.833958 ... 96.387741 mfi 3.651589 3.190689 9.000000 219.965458 5299.653454 19.982165 19.999320 14079.390642
27734 73.767420 porsche gas turbo two hatchback rwd front 106.649009 169.153863 ... 89.379986 mpfi 2.969864 3.084899 9.649326 206.358553 5823.265443 19.884336 24.375660 14600.078186
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
977214 74.475238 porsche gas turbo two sedan rwd front 102.583083 178.282952 ... 119.496931 mpfi 2.909086 2.523007 9.333629 206.509929 5628.363546 21.398545 23.528196 35183.174092
980475 174.669041 porsche gas turbo two convertible rwd front 96.008550 169.611921 ... 114.554297 spdi 3.556880 3.964338 7.101980 228.480555 6094.263138 17.284775 23.653810 14130.291772
981505 87.402943 porsche diesel turbo four sedan rwd front 96.954457 184.067197 ... 95.343936 idi 3.312339 2.429638 9.531729 241.163621 4485.750026 16.386203 28.877237 24801.794845
989078 117.694568 porsche gas turbo four hardtop 4wd front 109.253923 187.952848 ... 135.856749 mfi 3.510137 2.881988 9.431905 217.754979 5080.599992 33.395213 31.847391 14830.463212
996572 97.112480 porsche gas turbo four sedan rwd front 105.481551 201.169575 ... 104.631448 mpfi 3.137802 2.439404 16.990415 228.311859 5363.232239 18.427064 20.890147 16440.108451

299 rows × 25 columns

I wonder how are the horsepowers distributed for each brands :

plt.figure(figsize=(8,4))
sns.histplot(
    new_df,
    x = "horsepower",
    hue = "make"
)

We can see an interesting pattern in the horsepower distribution ! there are peaks but we have to much brands to see a pattern, lets only compare some of them :

plt.figure(figsize=(8,4))
sns.histplot(
    (
        new_df
        .query("make.isin(['peugot','porsche','bmw'])")
        .assign(make  = lambda df_: df_.make.astype(object))
    ),
    x = "horsepower",
    hue = "make",
    stat = "density",
)

Nothing interesting here. Let’s see how to mutate columns. You could want to create columns according to a processing of some other columns.

TipAnonymous function

There is an important part of python that is usually quite useless but we will have to use it here : Anonymous function. It is when you want to use a function wit a unique usage. It is used when a function take another function as argument.

Adding columns

In pandas, we use anonymous functions in various cases let’s see examples with the assign method :

(
    X.assign(volume = X.height * X.width * X.length)
)
normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length ... fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price volume
0 94.235189 volkswagen diesel turbo four hatchback fwd front 96.575655 173.369361 ... 2bbl 3.122279 3.295534 9.219909 57.578864 5988.161816 31.814042 34.505390 8605.484961 619292.173905
1 106.648646 volvo diesel turbo four sedan rwd front 109.846269 196.214268 ... mfi 3.293933 3.392775 8.637036 113.705719 4490.621234 25.817650 35.053335 26633.568618 730511.069668
2 118.472162 nissan gas std four wagon fwd front 95.024314 165.909864 ... mpfi 3.051725 3.248152 9.000000 117.757103 4799.168653 22.783708 27.760480 5900.715970 566642.367482
3 62.717600 porsche gas std two hatchback rwd front 100.174878 182.499698 ... 1bbl 3.238181 3.241202 9.278929 85.762326 5086.244824 14.949801 24.464281 8641.179741 633282.290282
4 105.808352 honda gas std four sedan fwd front 96.792849 172.594242 ... 2bbl 3.296874 3.506498 7.068202 73.158430 4417.985102 20.781260 19.686021 9831.490432 617652.454060
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
999995 135.343962 toyota gas std two hatchback rwd front 96.798115 172.721704 ... 2bbl 3.874087 3.239200 9.284329 82.423102 4795.277229 18.662406 31.668771 8697.083987 623103.391042
999996 117.541993 alfa-romero gas std two sedan rwd front 95.471999 172.728564 ... mpfi 3.206900 3.219425 8.490094 108.563509 5831.253661 17.524008 19.988458 15168.656670 582494.176541
999997 92.971613 mercedes-benz diesel turbo four wagon fwd front 106.470022 198.734017 ... idi 3.360154 2.570015 9.376938 103.584382 4820.162755 16.110262 34.389708 5811.145033 758006.025497
999998 195.508001 audi gas std two hatchback fwd front 98.163839 175.115516 ... mpfi 3.108470 3.098080 8.625507 186.347715 5148.094022 24.794536 28.859866 30190.403231 639722.547021
999999 102.256740 mercedes-benz diesel std four sedan 4wd front 99.880484 154.602788 ... idi 3.290561 3.536298 22.934508 73.317623 4777.166405 23.020483 26.392033 7951.146933 550103.184730

1000000 rows × 26 columns

This computation works because the we add the column to X, so X.height is the column of X. But in most cases we want to chain operation from the same dataframe like so :

(
    X
    .query("make == 'volvo'")
    .assign(volume = lambda df_: df_.height * df_.width * df_.length)
)
normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length ... fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price volume
1 106.648646 volvo diesel turbo four sedan rwd front 109.846269 196.214268 ... mfi 3.293933 3.392775 8.637036 113.705719 4490.621234 25.817650 35.053335 26633.568618 730511.069668
9 98.278149 volvo gas std four hatchback rwd front 113.533132 192.410446 ... mpfi 3.694279 3.083717 8.902826 164.839204 5006.561572 19.564961 20.105817 14736.190072 629304.174602
45 124.123256 volvo gas std four hardtop rwd front 99.312390 190.208244 ... 2bbl 3.608365 3.225135 7.737136 66.343513 4757.383809 28.393852 25.772739 10499.685299 663866.518027
51 152.911817 volvo gas std two hatchback rwd front 95.413365 174.632362 ... spdi 3.084798 2.667112 10.227535 214.732626 5136.182992 19.265381 23.848923 6782.556436 673043.970292
66 92.550045 volvo gas std two sedan rwd front 101.931293 197.169953 ... mpfi 2.899100 3.101075 7.748074 120.767322 5186.690201 22.828964 39.515080 14098.589700 750008.948305
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
999868 84.057334 volvo gas std four sedan rwd front 105.603092 189.582948 ... mpfi 3.421972 2.382150 6.177299 77.153412 4172.803042 18.686774 20.241031 5961.977685 717292.274429
999930 96.830261 volvo gas std four sedan fwd front 108.412834 195.293053 ... 2bbl 3.337004 3.898358 9.000000 100.049448 6351.331883 23.539770 33.046702 8267.274693 653497.733018
999936 91.291390 volvo gas turbo four sedan rwd front 95.221499 173.070373 ... spdi 3.332252 3.277535 15.727727 153.916788 5037.995696 20.484083 28.600490 7752.476237 531707.168849
999952 93.885399 volvo gas std four wagon fwd front 106.762610 199.221956 ... mpfi 2.957202 3.721575 9.000000 128.319584 5189.472664 28.267098 35.193040 32660.749929 735475.748951
999976 218.868694 volvo gas std four wagon 4wd front 105.205247 174.310172 ... 2bbl 3.519628 3.096898 17.079278 92.305693 5951.340505 28.094625 29.207487 10869.807786 565988.124681

42242 rows × 26 columns

Here, the assign method take as input an anonymous function that process a dataframe. Indeed, in the previous cell, we filter some rows of X with the query step, so when the assign will be run, the columns of the dataframe at this step will be different from the columns of X (the new dataframe has less rows). So, in assign it is required to use the anonymous function syntax like this :

X.assign(new_col = (X.col1 * X.col2)/ X.col3 + 7)

by : 

X.assign(new_col = lambda df_: (df_.col1 * df_.col2)/ df_.col3 + 7)

Strings

Strings columns are very common and you might want to process strings in your daily life :). Pandas can help you with that !

Here we will process the make column with some example but its not very usefull since the dataframe is already clean. But the function you will need are the same !

(
    X.make.str.split("-", expand = True)
)
0 1
0 volkswagen None
1 volvo None
2 nissan None
3 porsche None
4 honda None
... ... ...
999995 toyota None
999996 alfa romero
999997 mercedes benz
999998 audi None
999999 mercedes benz

1000000 rows × 2 columns

Here we want to create 2 columns from a single column, we split it by a pattern (here its “-”) and create new columns from the first splitted column. If you know you only want one part of the splited string its easier (keeping the first element in this example) :

(
    X.make.str.split("-").str[0]
)
0         volkswagen
1              volvo
2             nissan
3            porsche
4              honda
             ...    
999995        toyota
999996          alfa
999997      mercedes
999998          audi
999999      mercedes
Name: make, Length: 1000000, dtype: object

You can filter rows by the fact a a string cell contains or not a string pattern like that :

(
    X.make.str.contains("benz")
)
0         False
1         False
2         False
3         False
4         False
          ...  
999995    False
999996    False
999997     True
999998    False
999999     True
Name: make, Length: 1000000, dtype: bool

This creates a boolean masks that is usable in the query method

(
    X.query("make.str.contains('benz')")
)
normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
24 78.145246 mercedes-benz gas std four sedan rwd front 98.627816 184.178243 ... 87.152894 2bbl 3.619763 3.428984 9.000000 116.724261 5546.241459 44.471921 39.600851 13506.102886
27 192.138251 mercedes-benz diesel std four sedan 4wd front 101.633589 155.579818 ... 99.760274 spdi 3.144883 3.948666 9.000000 81.175532 6140.981220 24.632908 28.725574 6952.149473
40 92.975466 mercedes-benz gas std four sedan rwd front 97.569063 165.257082 ... 77.373192 1bbl 3.217990 3.070064 9.000000 78.085362 5082.690039 24.088293 30.222645 14855.739089
80 98.843142 mercedes-benz gas turbo four sedan rwd front 110.920185 189.493656 ... 97.019759 mpfi 3.584555 3.083670 9.451520 73.311237 4567.253302 40.911565 49.135100 29070.307145
97 82.214150 mercedes-benz gas std four hatchback fwd front 94.886181 167.832934 ... 102.136645 2bbl 2.892585 3.131031 7.245234 72.781963 4647.368061 29.180213 33.731709 6339.999308
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
999943 73.726165 mercedes-benz gas turbo two sedan fwd front 108.369176 184.750858 ... 124.948909 mpfi 3.626866 3.381363 7.545847 118.186202 4550.313865 24.183442 31.209608 7865.394691
999945 69.088197 mercedes-benz gas std four sedan rwd front 112.227270 193.674685 ... 133.136227 mpfi 3.210714 3.429679 9.267030 131.604353 5343.495652 24.026830 30.508760 13145.043462
999946 77.974004 mercedes-benz gas std four wagon rwd front 113.444076 202.028214 ... 94.055280 mpfi 3.172689 3.557503 8.599569 145.838478 5264.356983 18.653069 27.266048 25412.610927
999997 92.971613 mercedes-benz diesel turbo four wagon fwd front 106.470022 198.734017 ... 142.729616 idi 3.360154 2.570015 9.376938 103.584382 4820.162755 16.110262 34.389708 5811.145033
999999 102.256740 mercedes-benz diesel std four sedan 4wd front 99.880484 154.602788 ... 100.256336 idi 3.290561 3.536298 22.934508 73.317623 4777.166405 23.020483 26.392033 7951.146933

42537 rows × 25 columns

So nice isn’t it ?!

Chain all operations

The goal is to have all the processing into one single chaining operations like so :

X_cleaned = (
    X
    .query("price > 0")
    .assign(volume = lambda df_: df_.width * df_.height * df_.length)
    .query("volume <= volume.quantile(.25)")
    .query("make.isin(['nissan', 'peugot'])")
    .query("aspiration == 'turbo'")
)

Grouping

You might want to get statistical summary for sub categories of individuals or merge brands, items, people or whatever. Grouping in very easy and modular in pandas as you can see here :)

This is the syntax i prefer because you can do multiple grouping from the same column.

grouped = (
    X
    .query("price >0")
    .assign(volume = lambda df_: df_.width * df_.height * df_.length)
    .assign(price_per_cubic_whateverunit = lambda df_: df_.price/df_.volume)
    .groupby(by = "make")
    .agg(
        mean_length = ("length", lambda x: x.mean()),
        max_price = ("price", lambda s: s.max()),
        price_std = ("price", lambda s: s.std()),
        median_ppcw = ("price_per_cubic_whateverunit", lambda x: x.median())

    )
    .sort_values(by = "median_ppcw", ascending = False)
)
grouped.head()
/tmp/ipykernel_10753/1344792098.py:2: FutureWarning:

The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
mean_length max_price price_std median_ppcw
make
peugot 181.020325 57085.028930 8495.916584 0.019026
volvo 179.189219 59808.006804 8661.579270 0.018844
mercedes-benz 178.606458 58246.449115 8391.706783 0.018640
porsche 173.588336 57863.174442 7903.200039 0.018223
bmw 177.121137 63918.819174 8276.675082 0.018152

If you want the best volume/price ratio you should probably get a peugot or a volvo.

Indexing

Columns

This is the most useful part (last but not least as they say). If you have a big dataframe and you only want a subset of it how can you do ?

You can either index by indices or by names. The simplest way is to subset by names where you only want to select some columns :

X.loc[:,["make", "fuel-type"]]
make fuel-type
0 volkswagen diesel
1 volvo diesel
2 nissan gas
3 porsche gas
4 honda gas
... ... ...
999995 toyota gas
999996 alfa-romero gas
999997 mercedes-benz diesel
999998 audi gas
999999 mercedes-benz diesel

1000000 rows × 2 columns

If you only want one column (called a pandas series) you can just use the dot notation :

X.make
0            volkswagen
1                 volvo
2                nissan
3               porsche
4                 honda
              ...      
999995           toyota
999996      alfa-romero
999997    mercedes-benz
999998             audi
999999    mercedes-benz
Name: make, Length: 1000000, dtype: category
Categories (22, object): ['alfa-romero' < 'audi' < 'bmw' < 'chevrolet' ... 'subaru' < 'toyota' < 'volkswagen' < 'volvo']

Here you selects of the rows witht he colomn, and all the columns with names included in the list provided to the loc method. If you want to select all columns with some string pattern you can use the filter method :

X.filter(like = "engine")
engine-location engine-type engine-size
0 front ohc 89.249734
1 front ohc 111.024531
2 front ohc 96.317551
3 front ohcv 144.228109
4 front ohc 95.829115
... ... ... ...
999995 front ohc 141.139493
999996 front ohcv 94.672790
999997 front ohc 142.729616
999998 front ohcf 90.780708
999999 front ohcf 100.256336

1000000 rows × 3 columns

If you want to select only column with some required data types you can do :

X.select_dtypes("number") # can ask for  object | category 
normalized-losses wheel-base length width height curb-weight engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 94.235189 96.575655 173.369361 65.329051 54.678538 2193.379366 89.249734 3.122279 3.295534 9.219909 57.578864 5988.161816 31.814042 34.505390 8605.484961
1 106.648646 109.846269 196.214268 68.502701 54.348620 3211.189054 111.024531 3.293933 3.392775 8.637036 113.705719 4490.621234 25.817650 35.053335 26633.568618
2 118.472162 95.024314 165.909864 64.804938 52.702199 2098.071603 96.317551 3.051725 3.248152 9.000000 117.757103 4799.168653 22.783708 27.760480 5900.715970
3 62.717600 100.174878 182.499698 68.385729 50.742249 1879.045004 144.228109 3.238181 3.241202 9.278929 85.762326 5086.244824 14.949801 24.464281 8641.179741
4 105.808352 96.792849 172.594242 64.200921 55.741238 2353.389802 95.829115 3.296874 3.506498 7.068202 73.158430 4417.985102 20.781260 19.686021 9831.490432
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
999995 135.343962 96.798115 172.721704 62.991738 57.270320 2847.303204 141.139493 3.874087 3.239200 9.284329 82.423102 4795.277229 18.662406 31.668771 8697.083987
999996 117.541993 95.471999 172.728564 66.455439 50.745425 2847.057699 94.672790 3.206900 3.219425 8.490094 108.563509 5831.253661 17.524008 19.988458 15168.656670
999997 92.971613 106.470022 198.734017 64.394394 59.231453 2148.588288 142.729616 3.360154 2.570015 9.376938 103.584382 4820.162755 16.110262 34.389708 5811.145033
999998 195.508001 98.163839 175.115516 66.112063 55.256875 2833.256423 90.780708 3.108470 3.098080 8.625507 186.347715 5148.094022 24.794536 28.859866 30190.403231
999999 102.256740 99.880484 154.602788 66.169296 53.773750 2307.190623 100.256336 3.290561 3.536298 22.934508 73.317623 4777.166405 23.020483 26.392033 7951.146933

1000000 rows × 15 columns

Rows

Well, here, you can select rows but you can’t realy subset rows, you can randomly select subset of all rows with the sample method :

X.sample(10)
normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
83491 197.093361 mazda gas turbo two hatchback fwd front 95.392973 167.255735 ... 103.801811 2bbl 3.394902 2.910523 8.886576 74.517819 4681.796429 27.635116 31.926938 10536.424230
248481 140.913852 honda gas std two hatchback fwd front 91.151295 167.453680 ... 137.662655 4bbl 3.885386 3.125268 4.766288 104.927774 5064.190596 16.684367 25.377746 13711.342405
264461 157.157563 dodge gas std two hatchback fwd front 95.103127 172.814581 ... 97.826295 mpfi 2.857029 3.090588 9.000000 88.371563 4794.769624 24.555449 29.280932 7401.049958
980399 142.112089 nissan gas std four sedan fwd front 95.835780 163.238924 ... 105.968642 2bbl 3.115666 3.290441 9.337480 63.626596 5025.679546 32.574230 29.867445 5885.838859
272428 150.544061 renault gas std four sedan rwd front 98.160223 171.300787 ... 84.592308 mpfi 3.223281 3.397151 9.000000 125.014972 4796.745507 35.736208 34.503752 8268.356363
662081 183.936993 jaguar gas std four sedan fwd front 97.548845 174.557561 ... 119.096962 mpfi 3.245586 3.275552 6.911752 113.727681 4760.689718 19.076518 25.700852 5151.664116
890312 93.550394 peugot gas std four sedan rwd front 103.853837 174.907061 ... 103.304423 4bbl 3.101952 3.445152 9.000000 96.615613 4665.775701 15.941203 38.151445 6502.788269
968200 80.420236 toyota gas std four wagon fwd front 95.960047 171.224074 ... 113.124693 mpfi 3.186838 3.012290 8.757598 83.501401 4810.249324 31.619324 33.326111 14121.629337
999680 96.063490 saab gas std four sedan 4wd front 101.410403 180.917694 ... 108.259378 mpfi 3.489375 3.160075 9.000000 123.392444 5117.331306 17.244871 32.839045 10515.510403
281398 139.321670 honda gas turbo two hardtop fwd front 92.661431 153.480296 ... 97.622381 2bbl 3.493441 3.310877 18.896945 90.876814 5646.011784 26.594382 33.536734 9685.452250

10 rows × 25 columns

Shape of the data

Once upon a time, people were fighting about what was the best between long and wide format (differences and explanations explained here).

The best format to give to stats software is the long format, i.e. one row represents one point of measure.

The two methods we will need here are the pivot_table (going from long to wide) and the melt (doing the opposite (wide -> long)).

Lets see with some examples !

We will show it with a new dataframe with heart rate of 14 people from fitbit :

dataset = openml.datasets.get_dataset(46103)
X, y, is_categorical, feat_names = dataset.get_data(
    dataset_format="dataframe", target=dataset.default_target_attribute)


TableReport(X)
Processing column   1 / 3Processing column   2 / 3Processing column   3 / 3

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

The base format is the long format, that is the format we would want for statistical analysis, but i don’t know, we could want time as index and each individuals as columns, and each heart rate has values.

The pivot methods allow us to do it. The pivot and pivot_table are similar, but the pivot method doesn’t allow duplicates values so you are sure every row doesn’t disappear in the process.

wide_df = X.pivot(
    values = "Value",
    index = "Time",
    columns = "Id"
)
TableReport(wide_df)
Processing column   1 / 14Processing column   2 / 14Processing column   3 / 14Processing column   4 / 14Processing column   5 / 14Processing column   6 / 14Processing column   7 / 14Processing column   8 / 14Processing column   9 / 14Processing column  10 / 14Processing column  11 / 14Processing column  12 / 14Processing column  13 / 14Processing column  14 / 14

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

We can see that there are a lot of missing values, because each people have not recorded their measurements at the same time, so most of the datapoints doens’t match.

In the table report, we can see distribution of heart rates and see that some people are in better shape than others.

To come back with the original long form we use the melt method:

(
    wide_df
    .reset_index()
    .melt(id_vars = "Time")
)
Time Id value
0 3/29/2016 10:00:00 AM 2022484408 NaN
1 3/29/2016 10:00:05 PM 2022484408 NaN
2 3/29/2016 10:00:10 PM 2022484408 NaN
3 3/29/2016 10:00:15 AM 2022484408 NaN
4 3/29/2016 10:00:15 PM 2022484408 NaN
... ... ... ...
7148353 4/9/2016 9:59:55 AM 8877689391 74.0
7148354 4/9/2016 9:59:55 PM 8877689391 NaN
7148355 4/9/2016 9:59:57 PM 8877689391 83.0
7148356 4/9/2016 9:59:58 PM 8877689391 NaN
7148357 4/9/2016 9:59:59 PM 8877689391 NaN

7148358 rows × 3 columns

We see that we have way more rows than before, its because it kepts the NaN values, we can drop them with the dropna method :

(
    wide_df
    .reset_index()
    .melt(id_vars = "Time")
    .dropna(subset = ["value"])
)
Time Id value
35721 4/1/2016 10:00:00 AM 2022484408 98.0
35728 4/1/2016 10:00:05 AM 2022484408 97.0
35741 4/1/2016 10:00:15 AM 2022484408 98.0
35746 4/1/2016 10:00:20 AM 2022484408 99.0
35752 4/1/2016 10:00:25 AM 2022484408 101.0
... ... ... ...
7148343 4/9/2016 9:59:42 PM 8877689391 87.0
7148345 4/9/2016 9:59:45 AM 8877689391 73.0
7148350 4/9/2016 9:59:52 PM 8877689391 86.0
7148353 4/9/2016 9:59:55 AM 8877689391 74.0
7148355 4/9/2016 9:59:57 PM 8877689391 83.0

1154681 rows × 3 columns

And now we have the same data as before.

Time

This is not my speciality, but pandas is very good to deal with dates and temporal data. Lets see a bit of it here :) (the doc, better than this tuto is foundable here). We have the chance to get fitbit data with heart rates of 13 people during days.

We see that pandas treats the time colomn as an object, that is not what we want, let’s convert this column to date :)

cleaned_df = (
    X
    .assign(time_col = lambda df_: pd.to_datetime(df_.Time))
    .set_index("time_col")
)
cleaned_df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1154681 entries, 2016-04-01 07:54:00 to 2016-04-12 09:46:45
Data columns (total 3 columns):
 #   Column  Non-Null Count    Dtype   
---  ------  --------------    -----   
 0   Id      1154681 non-null  category
 1   Time    1154681 non-null  object  
 2   Value   1154681 non-null  uint8   
dtypes: category(1), object(1), uint8(1)
memory usage: 19.8+ MB

Now it is cool ! We got where we wanted to go.

sns.lineplot(
    cleaned_df,
    x = cleaned_df.index,
    y = "Value",
    hue = "Id"
)
plt.xticks(rotation = 45);

TODO : I don’t know what to say about it ! MB :/

Processing bonus

Sometimes you get a lof of files in some directory that you want to merge into one single dataframe. This is actually very simple and doable in a single beautiful and nice for-loop héhé.

Lets say you have a directory like that :

├── merge.py
└── data
    ├── car_1_day_1.csv
    ├── car_2_day_1.csv
    ├── car_1_day_2.csv

In your merge.py file you will interate through you data directory to load each csv as a pandas dataframe. all those file will be merged so they have to contain information to differentiate them.

in this case i will add a column in dataframes of each csv files given the information contained in the filename.

from pathlib import Path

csv_files = Path("data").glob("*.csv")

csv_file_list = []

for file in csv_files:
    car_id = file.split(".")[0].split("_")[1]
    day_id = file.split(".")[0].split("_")[2]
    file_dataframe = (
        pd.read_csv(file)
        .assign(car_id = car_id)
        .assign(day_id = day_id)
    )
    csv_file_list.append(file_dataframe)


final_dataframe = pd.concat(csv_file_list)
final_dataframe.to_csv("final_dataframe.csv")

final_dataframe is a big csv with all the content of each sub dataframe. Note that all the dataframe should have the same columns as they will be merged vertically.

Joining orizontally according to a primary key is completely faisible as well but is not the purpose of this article. If you need it don’t hesitate to contact me ! # Bonus how to model off of this ?

Lets use skrub !

Skrub is a subset of scikit-learn that is supposed to work on dataframes. It is soooooooo nice ! Lets see some example that could be transposed to any other problems.

Lets go back with the car price prediction

First, ML models require number as input, not strings or anything else. You could work on each variable and items to understand how to encode it best, but the most parcimonious processing is provided by skrub with the tablevectorizer function.

from skrub import TableVectorizer
import skrub
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

dataset = openml.datasets.get_dataset(248)
X, y, is_categorical, feat_names = dataset.get_data(
    dataset_format="dataframe", target=dataset.default_target_attribute)


data_var = skrub.var("data", X)

X = data_var.drop(columns="price").skb.mark_as_X()
y = data_var["price"].skb.mark_as_y()

usable_df is the table we will use for our statistical modelling :). To say afew words about TableVectorizer primitives, it will OneHot encode categorical variables, and label encode categories with a high number of modalities.

Lets start with the dataops utilities ! We will define our X and y that will be used later to cross validate our estimators.

Now the modelling part, we are in regression so, the baseline should be the linear regression, lets import some regression estimators from sklearn.

We can use skrubs DataOps capabilities to do the model selection, we will compare the baseline against the simple model and the default sota regressor using 3 splits of cross validation.

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import GradientBoostingRegressor, HistGradientBoostingRegressor
from sklearn.dummy import DummyRegressor
from tqdm.auto import tqdm
import plotly.express as px
model = skrub.choose_from([LinearRegression(), HistGradientBoostingRegressor(), DummyRegressor()])

preds = X.skb.apply(TableVectorizer()).skb.apply(model, y=y)

search = preds.skb.make_randomized_search(
    random_state=1, 
    fitted=True, 
    n_iter=3,
    cv = 2
)


results = search.results_
px.bar(
    results.assign(model = results.iloc[:,0]),
    x= "model",
    y = "mean_test_score"
)

The HPO is supposed to be plot using the following syntax, but it looks like i have some bug with the plotly plotting layout.

I put it here anyway until i find a fix :

search.plot_results()