This is a file i send to people so they have some code sample they can copy paste to apply to their own problems. If you need something that is not in this page, or you spot a mistake (likely to have some we never know) let me know and this page will grow as people ask me to update it :).
Getting the data
In most cases you will already have a file produced by some measurement system. Pandas can read any type of files (csv (separated by comas or semi-colons or any character), excel files and the new but efficient parquet files.)
Everything happens with the pandas.read_* methods :
import pandas as pdpd.read_csv("files.csv",sep =",", index_col =0...)pd.read_excel("file.xslx")pd.read_parquet("file.parquet")
The sep argument tells pandas what is the characters that splits columns.
Toy data
Things with cars ! it must be good hehehe
This first step is only getting the data and is not important. We use openml to load data. It is a rich collection of datasets that is very useful to get toy data.
import numpy as npimport pandas as pdimport openmldatasets_list = openml.datasets.list_datasets(output_format="dataframe", status="active", tag="manufacturing")dataset = openml.datasets.get_dataset(248)X, y, is_categorical, feat_names = dataset.get_data( dataset_format="dataframe", target=dataset.default_target_attribute)
Understanding the data
The tables
Pandas provides multiple ways to check our dataset. Here is an not-exhaustive list of atributes we could have wanted to use if we were students learning the tool to master it :
X.shape : number of rows and columns
X.info() : Names, types of columns and number of misisng values
X.describe() : get a statistical summary of each numerical columns like extremums, quantiles …
X.head(n) : show the n first rows of your dataset
But we are here to be efficient and get information we want in an easy way. I will then show you the only tool you need to get to know your data
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
normalized-losses
Float64DType
False
0 (0.0%)
995250 (99.5%)
115.
35.9
37.6
103.
274.
1
make
CategoricalDtype
False
0 (0.0%)
22 (< 0.1%)
2
fuel-type
CategoricalDtype
False
0 (0.0%)
2 (< 0.1%)
3
aspiration
CategoricalDtype
False
0 (0.0%)
2 (< 0.1%)
4
num-of-doors
CategoricalDtype
False
0 (0.0%)
2 (< 0.1%)
5
body-style
CategoricalDtype
False
0 (0.0%)
5 (< 0.1%)
6
drive-wheels
CategoricalDtype
False
0 (0.0%)
3 (< 0.1%)
7
engine-location
CategoricalDtype
False
0 (0.0%)
2 (< 0.1%)
8
wheel-base
Float64DType
False
0 (0.0%)
966458 (96.6%)
98.7
6.03
83.1
97.0
127.
9
length
Float64DType
False
0 (0.0%)
985867 (98.6%)
174.
11.9
134.
173.
217.
10
width
Float64DType
False
0 (0.0%)
923762 (92.4%)
65.8
2.11
60.3
65.5
75.6
11
height
Float64DType
False
0 (0.0%)
943102 (94.3%)
53.8
2.46
47.0
53.9
62.5
12
curb-weight
Float64DType
False
0 (0.0%)
999676 (100.0%)
2.53e+03
510.
1.51e+03
2.43e+03
4.72e+03
13
engine-type
CategoricalDtype
False
0 (0.0%)
7 (< 0.1%)
14
num-of-cylinders
CategoricalDtype
False
0 (0.0%)
7 (< 0.1%)
15
engine-size
Float64DType
False
0 (0.0%)
994182 (99.4%)
123.
39.2
9.89
112.
381.
16
fuel-system
CategoricalDtype
False
0 (0.0%)
8 (< 0.1%)
17
bore
Float64DType
False
0 (0.0%)
621757 (62.2%)
3.32
0.268
2.55
3.32
4.07
18
stroke
Float64DType
False
0 (0.0%)
608190 (60.8%)
3.25
0.322
1.71
3.28
4.46
19
compression-ratio
Float64DType
False
0 (0.0%)
675386 (67.5%)
9.98
3.83
-11.8
9.00
42.2
20
horsepower
Float64DType
False
0 (0.0%)
994784 (99.5%)
103.
38.8
39.5
94.5
304.
21
peak-rpm
Float64DType
False
0 (0.0%)
999433 (99.9%)
5.13e+03
543.
3.41e+03
5.09e+03
6.93e+03
22
city-mpg
Float64DType
False
0 (0.0%)
973545 (97.4%)
24.7
6.41
11.1
24.1
54.8
23
highway-mpg
Float64DType
False
0 (0.0%)
974625 (97.5%)
30.0
6.43
11.5
29.9
61.7
24
price
Float64DType
False
0 (0.0%)
999968 (100.0%)
1.30e+04
7.74e+03
-1.20e+04
1.03e+04
6.39e+04
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
It produces an interactive view in html of your table that you can explore interactively. You have all you need with dimensions, statistical summaries, classic graphs with distribution of numerical data etc etc.. It all you could have done with only pandas but more.
The graphs
The available plotting tools are numerous. If you want an idea of what is possible with python here is a gallery with a lot of nice examples. I will focus with basics but most useful ploting solution.
The library we will use is seaborn that is supposed to work on dataframes. It is based on matplotlib that is the basics plotting brick of python but quite hard to use. Seaborn get things done quickly and nicely.
We will see plotting by asking questions :
What is the proportion of each car maker ?
import seaborn as snsimport matplotlib.pyplot as pltplt.figure(figsize=(8,4))sns.histplot( X, x ="make", stat ="density")plt.xticks(rotation =90);
Mutating the data
Pandas provides a lot of very usefull methods to filter, modify delete items from a dataframe.
Warning
The philosophy in data science is to NEVER modifying by hand the data you are studying.
All your processing should be done in one place and clear.
For example here, we will see the most popular operation you might need in your work.
Let’s remember what the dataframe looks like :
X
normalized-losses
make
fuel-type
aspiration
num-of-doors
body-style
drive-wheels
engine-location
wheel-base
length
...
engine-size
fuel-system
bore
stroke
compression-ratio
horsepower
peak-rpm
city-mpg
highway-mpg
price
0
94.235189
volkswagen
diesel
turbo
four
hatchback
fwd
front
96.575655
173.369361
...
89.249734
2bbl
3.122279
3.295534
9.219909
57.578864
5988.161816
31.814042
34.505390
8605.484961
1
106.648646
volvo
diesel
turbo
four
sedan
rwd
front
109.846269
196.214268
...
111.024531
mfi
3.293933
3.392775
8.637036
113.705719
4490.621234
25.817650
35.053335
26633.568618
2
118.472162
nissan
gas
std
four
wagon
fwd
front
95.024314
165.909864
...
96.317551
mpfi
3.051725
3.248152
9.000000
117.757103
4799.168653
22.783708
27.760480
5900.715970
3
62.717600
porsche
gas
std
two
hatchback
rwd
front
100.174878
182.499698
...
144.228109
1bbl
3.238181
3.241202
9.278929
85.762326
5086.244824
14.949801
24.464281
8641.179741
4
105.808352
honda
gas
std
four
sedan
fwd
front
96.792849
172.594242
...
95.829115
2bbl
3.296874
3.506498
7.068202
73.158430
4417.985102
20.781260
19.686021
9831.490432
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
999995
135.343962
toyota
gas
std
two
hatchback
rwd
front
96.798115
172.721704
...
141.139493
2bbl
3.874087
3.239200
9.284329
82.423102
4795.277229
18.662406
31.668771
8697.083987
999996
117.541993
alfa-romero
gas
std
two
sedan
rwd
front
95.471999
172.728564
...
94.672790
mpfi
3.206900
3.219425
8.490094
108.563509
5831.253661
17.524008
19.988458
15168.656670
999997
92.971613
mercedes-benz
diesel
turbo
four
wagon
fwd
front
106.470022
198.734017
...
142.729616
idi
3.360154
2.570015
9.376938
103.584382
4820.162755
16.110262
34.389708
5811.145033
999998
195.508001
audi
gas
std
two
hatchback
fwd
front
98.163839
175.115516
...
90.780708
mpfi
3.108470
3.098080
8.625507
186.347715
5148.094022
24.794536
28.859866
30190.403231
999999
102.256740
mercedes-benz
diesel
std
four
sedan
4wd
front
99.880484
154.602788
...
100.256336
idi
3.290561
3.536298
22.934508
73.317623
4777.166405
23.020483
26.392033
7951.146933
1000000 rows × 25 columns
and see what the prices look like by brand :
plt.figure(figsize=(8,4))sns.boxplot( X, x ="make", y ="price",)plt.xticks(rotation =90);
Filtering rows
We have negative prices this looks like errors, lets start with filtering rows.
Pandas has a very handy method : query. This method eats a string that contains conditions, in this string you can use columns name as variables and external variables from your code like this.
You see that the table only contains 999 875 elements. Some items have been removed. Query is a function that returns a mutated copy of the dataframe. So if you want the dataframe that have been modified, you need to assign it to a new variable like that.
I don’t know what is the cleanest way to apply multiple filters, but i would recommend you to apply those in a sequential manner in order to be as clean as possible like that :
I wonder how are the horsepowers distributed for each brands :
plt.figure(figsize=(8,4))sns.histplot( new_df, x ="horsepower", hue ="make")
We can see an interesting pattern in the horsepower distribution ! there are peaks but we have to much brands to see a pattern, lets only compare some of them :
plt.figure(figsize=(8,4))sns.histplot( ( new_df .query("make.isin(['peugot','porsche','bmw'])") .assign(make =lambda df_: df_.make.astype(object)) ), x ="horsepower", hue ="make", stat ="density",)
Nothing interesting here. Let’s see how to mutate columns. You could want to create columns according to a processing of some other columns.
TipAnonymous function
There is an important part of python that is usually quite useless but we will have to use it here : Anonymous function. It is when you want to use a function wit a unique usage. It is used when a function take another function as argument.
Adding columns
In pandas, we use anonymous functions in various cases let’s see examples with the assign method :
This computation works because the we add the column to X, so X.height is the column of X. But in most cases we want to chain operation from the same dataframe like so :
Here, the assign method take as input an anonymous function that process a dataframe. Indeed, in the previous cell, we filter some rows of X with the query step, so when the assign will be run, the columns of the dataframe at this step will be different from the columns of X (the new dataframe has less rows). So, in assign it is required to use the anonymous function syntax like this :
Strings columns are very common and you might want to process strings in your daily life :). Pandas can help you with that !
Here we will process the make column with some example but its not very usefull since the dataframe is already clean. But the function you will need are the same !
( X.make.str.split("-", expand =True))
0
1
0
volkswagen
None
1
volvo
None
2
nissan
None
3
porsche
None
4
honda
None
...
...
...
999995
toyota
None
999996
alfa
romero
999997
mercedes
benz
999998
audi
None
999999
mercedes
benz
1000000 rows × 2 columns
Here we want to create 2 columns from a single column, we split it by a pattern (here its “-”) and create new columns from the first splitted column. If you know you only want one part of the splited string its easier (keeping the first element in this example) :
You might want to get statistical summary for sub categories of individuals or merge brands, items, people or whatever. Grouping in very easy and modular in pandas as you can see here :)
This is the syntax i prefer because you can do multiple grouping from the same column.
/tmp/ipykernel_10753/1344792098.py:2: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
mean_length
max_price
price_std
median_ppcw
make
peugot
181.020325
57085.028930
8495.916584
0.019026
volvo
179.189219
59808.006804
8661.579270
0.018844
mercedes-benz
178.606458
58246.449115
8391.706783
0.018640
porsche
173.588336
57863.174442
7903.200039
0.018223
bmw
177.121137
63918.819174
8276.675082
0.018152
If you want the best volume/price ratio you should probably get a peugot or a volvo.
Indexing
Columns
This is the most useful part (last but not least as they say). If you have a big dataframe and you only want a subset of it how can you do ?
You can either index by indices or by names. The simplest way is to subset by names where you only want to select some columns :
X.loc[:,["make", "fuel-type"]]
make
fuel-type
0
volkswagen
diesel
1
volvo
diesel
2
nissan
gas
3
porsche
gas
4
honda
gas
...
...
...
999995
toyota
gas
999996
alfa-romero
gas
999997
mercedes-benz
diesel
999998
audi
gas
999999
mercedes-benz
diesel
1000000 rows × 2 columns
If you only want one column (called a pandas series) you can just use the dot notation :
Here you selects of the rows witht he colomn, and all the columns with names included in the list provided to the loc method. If you want to select all columns with some string pattern you can use the filter method :
X.filter(like ="engine")
engine-location
engine-type
engine-size
0
front
ohc
89.249734
1
front
ohc
111.024531
2
front
ohc
96.317551
3
front
ohcv
144.228109
4
front
ohc
95.829115
...
...
...
...
999995
front
ohc
141.139493
999996
front
ohcv
94.672790
999997
front
ohc
142.729616
999998
front
ohcf
90.780708
999999
front
ohcf
100.256336
1000000 rows × 3 columns
If you want to select only column with some required data types you can do :
X.select_dtypes("number") # can ask for object | category
normalized-losses
wheel-base
length
width
height
curb-weight
engine-size
bore
stroke
compression-ratio
horsepower
peak-rpm
city-mpg
highway-mpg
price
0
94.235189
96.575655
173.369361
65.329051
54.678538
2193.379366
89.249734
3.122279
3.295534
9.219909
57.578864
5988.161816
31.814042
34.505390
8605.484961
1
106.648646
109.846269
196.214268
68.502701
54.348620
3211.189054
111.024531
3.293933
3.392775
8.637036
113.705719
4490.621234
25.817650
35.053335
26633.568618
2
118.472162
95.024314
165.909864
64.804938
52.702199
2098.071603
96.317551
3.051725
3.248152
9.000000
117.757103
4799.168653
22.783708
27.760480
5900.715970
3
62.717600
100.174878
182.499698
68.385729
50.742249
1879.045004
144.228109
3.238181
3.241202
9.278929
85.762326
5086.244824
14.949801
24.464281
8641.179741
4
105.808352
96.792849
172.594242
64.200921
55.741238
2353.389802
95.829115
3.296874
3.506498
7.068202
73.158430
4417.985102
20.781260
19.686021
9831.490432
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
999995
135.343962
96.798115
172.721704
62.991738
57.270320
2847.303204
141.139493
3.874087
3.239200
9.284329
82.423102
4795.277229
18.662406
31.668771
8697.083987
999996
117.541993
95.471999
172.728564
66.455439
50.745425
2847.057699
94.672790
3.206900
3.219425
8.490094
108.563509
5831.253661
17.524008
19.988458
15168.656670
999997
92.971613
106.470022
198.734017
64.394394
59.231453
2148.588288
142.729616
3.360154
2.570015
9.376938
103.584382
4820.162755
16.110262
34.389708
5811.145033
999998
195.508001
98.163839
175.115516
66.112063
55.256875
2833.256423
90.780708
3.108470
3.098080
8.625507
186.347715
5148.094022
24.794536
28.859866
30190.403231
999999
102.256740
99.880484
154.602788
66.169296
53.773750
2307.190623
100.256336
3.290561
3.536298
22.934508
73.317623
4777.166405
23.020483
26.392033
7951.146933
1000000 rows × 15 columns
Rows
Well, here, you can select rows but you can’t realy subset rows, you can randomly select subset of all rows with the sample method :
X.sample(10)
normalized-losses
make
fuel-type
aspiration
num-of-doors
body-style
drive-wheels
engine-location
wheel-base
length
...
engine-size
fuel-system
bore
stroke
compression-ratio
horsepower
peak-rpm
city-mpg
highway-mpg
price
83491
197.093361
mazda
gas
turbo
two
hatchback
fwd
front
95.392973
167.255735
...
103.801811
2bbl
3.394902
2.910523
8.886576
74.517819
4681.796429
27.635116
31.926938
10536.424230
248481
140.913852
honda
gas
std
two
hatchback
fwd
front
91.151295
167.453680
...
137.662655
4bbl
3.885386
3.125268
4.766288
104.927774
5064.190596
16.684367
25.377746
13711.342405
264461
157.157563
dodge
gas
std
two
hatchback
fwd
front
95.103127
172.814581
...
97.826295
mpfi
2.857029
3.090588
9.000000
88.371563
4794.769624
24.555449
29.280932
7401.049958
980399
142.112089
nissan
gas
std
four
sedan
fwd
front
95.835780
163.238924
...
105.968642
2bbl
3.115666
3.290441
9.337480
63.626596
5025.679546
32.574230
29.867445
5885.838859
272428
150.544061
renault
gas
std
four
sedan
rwd
front
98.160223
171.300787
...
84.592308
mpfi
3.223281
3.397151
9.000000
125.014972
4796.745507
35.736208
34.503752
8268.356363
662081
183.936993
jaguar
gas
std
four
sedan
fwd
front
97.548845
174.557561
...
119.096962
mpfi
3.245586
3.275552
6.911752
113.727681
4760.689718
19.076518
25.700852
5151.664116
890312
93.550394
peugot
gas
std
four
sedan
rwd
front
103.853837
174.907061
...
103.304423
4bbl
3.101952
3.445152
9.000000
96.615613
4665.775701
15.941203
38.151445
6502.788269
968200
80.420236
toyota
gas
std
four
wagon
fwd
front
95.960047
171.224074
...
113.124693
mpfi
3.186838
3.012290
8.757598
83.501401
4810.249324
31.619324
33.326111
14121.629337
999680
96.063490
saab
gas
std
four
sedan
4wd
front
101.410403
180.917694
...
108.259378
mpfi
3.489375
3.160075
9.000000
123.392444
5117.331306
17.244871
32.839045
10515.510403
281398
139.321670
honda
gas
turbo
two
hardtop
fwd
front
92.661431
153.480296
...
97.622381
2bbl
3.493441
3.310877
18.896945
90.876814
5646.011784
26.594382
33.536734
9685.452250
10 rows × 25 columns
Shape of the data
Once upon a time, people were fighting about what was the best between long and wide format (differences and explanations explained here).
The best format to give to stats software is the long format, i.e. one row represents one point of measure.
The two methods we will need here are the pivot_table (going from long to wide) and the melt (doing the opposite (wide -> long)).
Lets see with some examples !
We will show it with a new dataframe with heart rate of 14 people from fitbit :
dataset = openml.datasets.get_dataset(46103)X, y, is_categorical, feat_names = dataset.get_data( dataset_format="dataframe", target=dataset.default_target_attribute)TableReport(X)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
Id
CategoricalDtype
True
0 (0.0%)
14 (< 0.1%)
1
Time
ObjectDType
False
0 (0.0%)
510597 (44.2%)
2
Value
UInt8DType
False
0 (0.0%)
148 (< 0.1%)
79.8
18.7
36
77
185
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
The base format is the long format, that is the format we would want for statistical analysis, but i don’t know, we could want time as index and each individuals as columns, and each heart rate has values.
The pivot methods allow us to do it. The pivot and pivot_table are similar, but the pivot method doesn’t allow duplicates values so you are sure every row doesn’t disappear in the process.
wide_df = X.pivot( values ="Value", index ="Time", columns ="Id")TableReport(wide_df)
Click a table cell for more info about its column.
Id
2022484408
2026352035
2347167796
4020332650
4558609924
5553957443
5577150313
6117666160
6391747486
6775888955
6962181067
7007744171
8792009665
8877689391
Time
3/29/2016 10:00:00 AM
65.0
3/29/2016 10:00:05 PM
67.0
3/29/2016 10:00:10 PM
67.0
3/29/2016 10:00:15 AM
68.0
3/29/2016 10:00:15 PM
67.0
4/9/2016 9:59:55 AM
89.0
87.0
108.
86.0
74.0
4/9/2016 9:59:55 PM
88.0
75.0
84.0
4/9/2016 9:59:57 PM
83.0
4/9/2016 9:59:58 PM
88.0
4/9/2016 9:59:59 PM
86.0
2022484408
Float64DType
Null values
454,114 (88.9%)
Unique values
132 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
81.7 ±
18.1
Median ± IQR
78.0 ±
21.0
Min | Max
46.0 |
182.
2026352035
Float64DType
Null values
510,158 (99.9%)
Unique values
23 (< 0.1%)
Mean ± Std
65.8 ±
5.54
Median ± IQR
63.0 ±
8.00
Min | Max
57.0 |
80.0
2347167796
Float64DType
Null values
389,793 (76.3%)
Unique values
124 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
76.1 ±
13.5
Median ± IQR
73.0 ±
16.0
Min | Max
51.0 |
182.
4020332650
Float64DType
Null values
226,803 (44.4%)
Unique values
106 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
81.9 ±
14.6
Median ± IQR
84.0 ±
23.0
Min | Max
46.0 |
151.
4558609924
Float64DType
Null values
441,258 (86.4%)
Unique values
108 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
79.6 ±
12.7
Median ± IQR
79.0 ±
15.0
Min | Max
50.0 |
158.
5553957443
Float64DType
Null values
412,800 (80.8%)
Unique values
100 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
69.4 ±
14.9
Median ± IQR
65.0 ±
13.0
Min | Max
48.0 |
152.
5577150313
Float64DType
Null values
422,948 (82.8%)
Unique values
141 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
65.0 ±
20.2
Median ± IQR
58.0 ±
17.0
Min | Max
36.0 |
176.
6117666160
Float64DType
Null values
456,931 (89.5%)
Unique values
79 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
83.5 ±
13.8
Median ± IQR
84.0 ±
21.0
Min | Max
52.0 |
130.
6391747486
Float64DType
Null values
506,850 (99.3%)
Unique values
75 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
84.1 ±
10.0
Median ± IQR
83.0 ±
10.0
Min | Max
49.0 |
133.
6775888955
Float64DType
Null values
475,497 (93.1%)
Unique values
104 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
97.7 ±
15.4
Median ± IQR
98.0 ±
25.0
Min | Max
59.0 |
185.
6962181067
Float64DType
Null values
384,722 (75.3%)
Unique values
129 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
80.7 ±
19.7
Median ± IQR
74.0 ±
23.0
Min | Max
48.0 |
176.
7007744171
Float64DType
Null values
445,811 (87.3%)
Unique values
103 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
89.9 ±
15.3
Median ± IQR
88.0 ±
22.0
Min | Max
56.0 |
158.
8792009665
Float64DType
Null values
440,510 (86.3%)
Unique values
100 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
76.3 ±
12.8
Median ± IQR
74.0 ±
15.0
Min | Max
45.0 |
163.
8877689391
Float64DType
Null values
425,482 (83.3%)
Unique values
133 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
87.6 ±
31.2
Median ± IQR
76.0 ±
26.0
Min | Max
47.0 |
179.
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
2022484408
Float64DType
False
454114 (88.9%)
132 (< 0.1%)
81.7
18.1
46.0
78.0
182.
1
2026352035
Float64DType
False
510158 (99.9%)
23 (< 0.1%)
65.8
5.54
57.0
63.0
80.0
2
2347167796
Float64DType
False
389793 (76.3%)
124 (< 0.1%)
76.1
13.5
51.0
73.0
182.
3
4020332650
Float64DType
False
226803 (44.4%)
106 (< 0.1%)
81.9
14.6
46.0
84.0
151.
4
4558609924
Float64DType
False
441258 (86.4%)
108 (< 0.1%)
79.6
12.7
50.0
79.0
158.
5
5553957443
Float64DType
False
412800 (80.8%)
100 (< 0.1%)
69.4
14.9
48.0
65.0
152.
6
5577150313
Float64DType
False
422948 (82.8%)
141 (< 0.1%)
65.0
20.2
36.0
58.0
176.
7
6117666160
Float64DType
False
456931 (89.5%)
79 (< 0.1%)
83.5
13.8
52.0
84.0
130.
8
6391747486
Float64DType
False
506850 (99.3%)
75 (< 0.1%)
84.1
10.0
49.0
83.0
133.
9
6775888955
Float64DType
False
475497 (93.1%)
104 (< 0.1%)
97.7
15.4
59.0
98.0
185.
10
6962181067
Float64DType
False
384722 (75.3%)
129 (< 0.1%)
80.7
19.7
48.0
74.0
176.
11
7007744171
Float64DType
False
445811 (87.3%)
103 (< 0.1%)
89.9
15.3
56.0
88.0
158.
12
8792009665
Float64DType
False
440510 (86.3%)
100 (< 0.1%)
76.3
12.8
45.0
74.0
163.
13
8877689391
Float64DType
False
425482 (83.3%)
133 (< 0.1%)
87.6
31.2
47.0
76.0
179.
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
2022484408
Float64DType
Null values
454,114 (88.9%)
Unique values
132 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
81.7 ±
18.1
Median ± IQR
78.0 ±
21.0
Min | Max
46.0 |
182.
2026352035
Float64DType
Null values
510,158 (99.9%)
Unique values
23 (< 0.1%)
Mean ± Std
65.8 ±
5.54
Median ± IQR
63.0 ±
8.00
Min | Max
57.0 |
80.0
2347167796
Float64DType
Null values
389,793 (76.3%)
Unique values
124 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
76.1 ±
13.5
Median ± IQR
73.0 ±
16.0
Min | Max
51.0 |
182.
4020332650
Float64DType
Null values
226,803 (44.4%)
Unique values
106 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
81.9 ±
14.6
Median ± IQR
84.0 ±
23.0
Min | Max
46.0 |
151.
4558609924
Float64DType
Null values
441,258 (86.4%)
Unique values
108 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
79.6 ±
12.7
Median ± IQR
79.0 ±
15.0
Min | Max
50.0 |
158.
5553957443
Float64DType
Null values
412,800 (80.8%)
Unique values
100 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
69.4 ±
14.9
Median ± IQR
65.0 ±
13.0
Min | Max
48.0 |
152.
5577150313
Float64DType
Null values
422,948 (82.8%)
Unique values
141 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
65.0 ±
20.2
Median ± IQR
58.0 ±
17.0
Min | Max
36.0 |
176.
6117666160
Float64DType
Null values
456,931 (89.5%)
Unique values
79 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
83.5 ±
13.8
Median ± IQR
84.0 ±
21.0
Min | Max
52.0 |
130.
6391747486
Float64DType
Null values
506,850 (99.3%)
Unique values
75 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
84.1 ±
10.0
Median ± IQR
83.0 ±
10.0
Min | Max
49.0 |
133.
6775888955
Float64DType
Null values
475,497 (93.1%)
Unique values
104 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
97.7 ±
15.4
Median ± IQR
98.0 ±
25.0
Min | Max
59.0 |
185.
6962181067
Float64DType
Null values
384,722 (75.3%)
Unique values
129 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
80.7 ±
19.7
Median ± IQR
74.0 ±
23.0
Min | Max
48.0 |
176.
7007744171
Float64DType
Null values
445,811 (87.3%)
Unique values
103 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
89.9 ±
15.3
Median ± IQR
88.0 ±
22.0
Min | Max
56.0 |
158.
8792009665
Float64DType
Null values
440,510 (86.3%)
Unique values
100 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
76.3 ±
12.8
Median ± IQR
74.0 ±
15.0
Min | Max
45.0 |
163.
8877689391
Float64DType
Null values
425,482 (83.3%)
Unique values
133 (< 0.1%)
This column has a high cardinality (> 40).
Mean ± Std
87.6 ±
31.2
Median ± IQR
76.0 ±
26.0
Min | Max
47.0 |
179.
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
We can see that there are a lot of missing values, because each people have not recorded their measurements at the same time, so most of the datapoints doens’t match.
In the table report, we can see distribution of heart rates and see that some people are in better shape than others.
To come back with the original long form we use the melt method:
( wide_df .reset_index() .melt(id_vars ="Time"))
Time
Id
value
0
3/29/2016 10:00:00 AM
2022484408
NaN
1
3/29/2016 10:00:05 PM
2022484408
NaN
2
3/29/2016 10:00:10 PM
2022484408
NaN
3
3/29/2016 10:00:15 AM
2022484408
NaN
4
3/29/2016 10:00:15 PM
2022484408
NaN
...
...
...
...
7148353
4/9/2016 9:59:55 AM
8877689391
74.0
7148354
4/9/2016 9:59:55 PM
8877689391
NaN
7148355
4/9/2016 9:59:57 PM
8877689391
83.0
7148356
4/9/2016 9:59:58 PM
8877689391
NaN
7148357
4/9/2016 9:59:59 PM
8877689391
NaN
7148358 rows × 3 columns
We see that we have way more rows than before, its because it kepts the NaN values, we can drop them with the dropna method :
This is not my speciality, but pandas is very good to deal with dates and temporal data. Lets see a bit of it here :) (the doc, better than this tuto is foundable here). We have the chance to get fitbit data with heart rates of 13 people during days.
We see that pandas treats the time colomn as an object, that is not what we want, let’s convert this column to date :)
cleaned_df = ( X .assign(time_col =lambda df_: pd.to_datetime(df_.Time)) .set_index("time_col"))cleaned_df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1154681 entries, 2016-04-01 07:54:00 to 2016-04-12 09:46:45
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1154681 non-null category
1 Time 1154681 non-null object
2 Value 1154681 non-null uint8
dtypes: category(1), object(1), uint8(1)
memory usage: 19.8+ MB
Now it is cool ! We got where we wanted to go.
sns.lineplot( cleaned_df, x = cleaned_df.index, y ="Value", hue ="Id")plt.xticks(rotation =45);
TODO : I don’t know what to say about it ! MB :/
Processing bonus
Sometimes you get a lof of files in some directory that you want to merge into one single dataframe. This is actually very simple and doable in a single beautiful and nice for-loop héhé.
Lets say you have a directory like that :
├── merge.py
└── data
├── car_1_day_1.csv
├── car_2_day_1.csv
├── car_1_day_2.csv
In your merge.py file you will interate through you data directory to load each csv as a pandas dataframe. all those file will be merged so they have to contain information to differentiate them.
in this case i will add a column in dataframes of each csv files given the information contained in the filename.
final_dataframe is a big csv with all the content of each sub dataframe. Note that all the dataframe should have the same columns as they will be merged vertically.
Joining orizontally according to a primary key is completely faisible as well but is not the purpose of this article. If you need it don’t hesitate to contact me ! # Bonus how to model off of this ?
Lets use skrub !
Skrub is a subset of scikit-learn that is supposed to work on dataframes. It is soooooooo nice ! Lets see some example that could be transposed to any other problems.
Lets go back with the car price prediction
First, ML models require number as input, not strings or anything else. You could work on each variable and items to understand how to encode it best, but the most parcimonious processing is provided by skrub with the tablevectorizer function.
usable_df is the table we will use for our statistical modelling :). To say afew words about TableVectorizer primitives, it will OneHot encode categorical variables, and label encode categories with a high number of modalities.
Lets start with the dataops utilities ! We will define our X and y that will be used later to cross validate our estimators.
Now the modelling part, we are in regression so, the baseline should be the linear regression, lets import some regression estimators from sklearn.
We can use skrubs DataOps capabilities to do the model selection, we will compare the baseline against the simple model and the default sota regressor using 3 splits of cross validation.
The HPO is supposed to be plot using the following syntax, but it looks like i have some bug with the plotly plotting layout.
I put it here anyway until i find a fix :
search.plot_results()
Source Code
---title: "Pandas 101"author: "Julien Combes"date: "2025-12-23"categories: [DeepLearning, ENG, code, tuto]image: "memepython.png"---This is a file i send to people so they have some code sample they can copy paste to apply to their own problems. If you need something that is not in this page, or you spot a mistake (likely to have some we never know) let me know and this page will grow as people ask me to update it :).# Getting the dataIn most cases you will already have a file produced by some measurement system. Pandas can read any type of files (csv (separated by comas or semi-colons or any character), excel files and the new but efficient parquet files.)Everything happens with the pandas.read_* methods : ```pythonimport pandas as pdpd.read_csv("files.csv",sep =",", index_col =0...)pd.read_excel("file.xslx")pd.read_parquet("file.parquet")```The sep argument tells pandas what is the characters that splits columns.## Toy dataThings with cars ! it must be good heheheThis first step is only getting the data and is not important. We use [openml](https://docs.openml.org/data/use/) to load data. It is a rich collection of datasets that is very useful to get toy data.```{python}import numpy as npimport pandas as pdimport openmldatasets_list = openml.datasets.list_datasets(output_format="dataframe", status="active", tag="manufacturing")dataset = openml.datasets.get_dataset(248)X, y, is_categorical, feat_names = dataset.get_data( dataset_format="dataframe", target=dataset.default_target_attribute)```# Understanding the data## The tablesPandas provides multiple ways to check our dataset.Here is an not-exhaustive list of atributes we could have wanted to use if we were students learning the tool to master it : - X.shape : number of rows and columns- X.info() : Names, types of columns and number of misisng values - X.describe() : get a statistical summary of each numerical columns like extremums, quantiles ...- X.head(n) : show the n first rows of your datasetBut we are here to be efficient and get information we want in an easy way. I will then show you the only tool you need to get to know your data```{python}from skrub import TableReportTableReport(X)```It produces an interactive view in html of your table that you can explore interactively. You have all you need with dimensions, statistical summaries, classic graphs with distribution of numerical data etc etc.. It all you could have done with only pandas but more. ## The graphsThe available plotting tools are numerous. If you want an idea of what is possible with python here is [a gallery](https://python-graph-gallery.com/) with a lot of nice examples. I will focus with basics but most useful ploting solution.The library we will use is [seaborn](https://seaborn.pydata.org/) that is supposed to work on dataframes. It is based on matplotlib that is the basics plotting brick of python but quite hard to use. Seaborn get things done quickly and nicely.We will see plotting by asking questions : ### What is the proportion of each car maker ? ```{python}import seaborn as snsimport matplotlib.pyplot as pltplt.figure(figsize=(8,4))sns.histplot( X, x ="make", stat ="density")plt.xticks(rotation =90);```# Mutating the dataPandas provides a lot of very usefull methods to filter, modify delete items from a dataframe.::: {.callout-warning}The philosophy in data science is to NEVER modifying by hand the data you are studying.All your processing should be done in one place and clear.:::For example here, we will see the most popular operation you might need in your work.Let's remember what the dataframe looks like : ```{python}X```and see what the prices look like by brand :```{python}plt.figure(figsize=(8,4))sns.boxplot( X, x ="make", y ="price",)plt.xticks(rotation =90);```## Filtering rowsWe have negative prices this looks like errors, lets start with filtering rows.Pandas has a very handy method : query. This method eats a string that contains conditions, in this string you can use columns name as variables and external variables from your code like this.```{python}price_error_threshold =0X.query("price >= @price_error_threshold")```You see that the table only contains 999 875 elements. Some items have been removed. Query is a function that returns a mutated copy of the dataframe. So if you want the dataframe that have been modified, you need to assign it to a new variable like that.```{python}new_df = X.query("price >= @price_error_threshold")new_df```I don't know what is the cleanest way to apply multiple filters, but i would recommend you to apply those in a sequential manner in order to be as clean as possible like that :```{python}( X .query("price > 0") .query("make == 'porsche'") .query("aspiration== 'turbo'") .query("horsepower > 200"))```I wonder how are the horsepowers distributed for each brands : ```{python}plt.figure(figsize=(8,4))sns.histplot( new_df, x ="horsepower", hue ="make")```We can see an interesting pattern in the horsepower distribution ! there are peaks but we have to much brands to see a pattern, lets only compare some of them : ```{python}plt.figure(figsize=(8,4))sns.histplot( ( new_df .query("make.isin(['peugot','porsche','bmw'])") .assign(make =lambda df_: df_.make.astype(object)) ), x ="horsepower", hue ="make", stat ="density",)```Nothing interesting here. Let's see how to mutate columns. You could want to create columns according to a processing of some other columns. ::: {.callout-tip title="Anonymous function"}There is an important part of python that is usually quite useless but we will have to use it here : Anonymous function. It is when you want to use a function wit a unique usage. It is used when a function take another function as argument.:::## Adding columnsIn pandas, we use anonymous functions in various cases let's see examples with the assign method :```{python}( X.assign(volume = X.height * X.width * X.length))```This computation works because the we add the column to X, so X.height is the column of X. But in most cases we want to chain operation from the same dataframe like so :```{python}( X .query("make == 'volvo'") .assign(volume =lambda df_: df_.height * df_.width * df_.length))```Here, the assign method take as input an anonymous function that process a dataframe. Indeed, in the previous cell, we filter some rows of X with the query step, so when the assign will be run, the columns of the dataframe at this step will be different from the columns of X (the new dataframe has less rows). So, in assign it is required to use the anonymous function syntax like this :```pythonX.assign(new_col = (X.col1 * X.col2)/ X.col3 +7)by : X.assign(new_col =lambda df_: (df_.col1 * df_.col2)/ df_.col3 +7)```## StringsStrings columns are very common and you might want to process strings in your daily life :). Pandas can help you with that ! Here we will process the make column with some example but its not very usefull since the dataframe is already clean. But the function you will need are the same ! ```{python}( X.make.str.split("-", expand =True))```Here we want to create 2 columns from a single column, we split it by a pattern (here its "-") and create new columns from the first splitted column. If you know you only want one part of the splited string its easier (keeping the first element in this example) :```{python}( X.make.str.split("-").str[0])```You can filter rows by the fact a a string cell contains or not a string pattern like that : ```{python}( X.make.str.contains("benz"))```This creates a boolean masks that is usable in the query method ```{python}( X.query("make.str.contains('benz')"))```So nice isn't it ?! ## Chain all operationsThe goal is to have all the processing into one single chaining operations like so : ```{python}X_cleaned = ( X .query("price > 0") .assign(volume =lambda df_: df_.width * df_.height * df_.length) .query("volume <= volume.quantile(.25)") .query("make.isin(['nissan', 'peugot'])") .query("aspiration == 'turbo'"))```## GroupingYou might want to get statistical summary for sub categories of individuals or merge brands, items, people or whatever. Grouping in very easy and modular in pandas as you can see here :)This is the syntax i prefer because you can do multiple grouping from the same column.```{python}grouped = ( X .query("price >0") .assign(volume =lambda df_: df_.width * df_.height * df_.length) .assign(price_per_cubic_whateverunit =lambda df_: df_.price/df_.volume) .groupby(by ="make") .agg( mean_length = ("length", lambda x: x.mean()), max_price = ("price", lambda s: s.max()), price_std = ("price", lambda s: s.std()), median_ppcw = ("price_per_cubic_whateverunit", lambda x: x.median()) ) .sort_values(by ="median_ppcw", ascending =False))grouped.head()```If you want the best volume/price ratio you should probably get a peugot or a volvo.## Indexing ### Columns This is the most useful part (last but not least as they say).If you have a big dataframe and you only want a subset of it how can you do ? You can either index by indices or by names. The simplest way is to subset by names where you only want to select some columns : ```{python}X.loc[:,["make", "fuel-type"]]```If you only want one column (called a pandas series) you can just use the dot notation : ```{python}X.make```Here you selects of the rows witht he colomn, and all the columns with names included in the list provided to the loc method. If you want to **select all columns with some string pattern** you can use the filter method : ```{python}X.filter(like ="engine")```If you want to select only column with some required data types you can do :```{python}X.select_dtypes("number") # can ask for object | category ```### RowsWell, here, you can select rows but you can't realy subset rows, you can randomly select subset of all rows with the sample method : ```{python}X.sample(10)```# Shape of the dataOnce upon a time, people were fighting about what was the best between long and wide format (differences and explanations explained [here](https://seaborn.pydata.org/tutorial/data_structure.html)). The best format to give to stats software is the long format, i.e. one row represents one point of measure. The two methods we will need here are the pivot_table (going from long to wide) and the melt (doing the opposite (wide -> long)).Lets see with some examples ! We will show it with a new dataframe with heart rate of 14 people from fitbit : ```{python}dataset = openml.datasets.get_dataset(46103)X, y, is_categorical, feat_names = dataset.get_data( dataset_format="dataframe", target=dataset.default_target_attribute)TableReport(X)```The base format is the long format, that is the format we would want for statistical analysis, but i don't know, we could want time as index and each individuals as columns, and each heart rate has values.The pivot methods allow us to do it. The pivot and pivot_table are similar, but the pivot method doesn't allow duplicates values so you are sure every row doesn't disappear in the process. ```{python}wide_df = X.pivot( values ="Value", index ="Time", columns ="Id")TableReport(wide_df)```We can see that there are a lot of missing values, because each people have not recorded their measurements at the same time, so most of the datapoints doens't match.In the table report, we can see distribution of heart rates and see that some people are in better shape than others.To come back with the original long form we use the melt method: ```{python}( wide_df .reset_index() .melt(id_vars ="Time"))```We see that we have way more rows than before, its because it kepts the NaN values, we can drop them with the dropna method : ```{python}( wide_df .reset_index() .melt(id_vars ="Time") .dropna(subset = ["value"]))```And now we have the same data as before.# TimeThis is not my speciality, but pandas is very good to deal with dates and temporal data. Lets see a bit of it here :) (the doc, better than this tuto is foundable [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)).We have the chance to get fitbit data with heart rates of 13 people during days.We see that pandas treats the time colomn as an object, that is not what we want, let's convert this column to date :) ```{python}cleaned_df = ( X .assign(time_col =lambda df_: pd.to_datetime(df_.Time)) .set_index("time_col"))cleaned_df.info()```Now it is cool ! We got where we wanted to go. ```{python}sns.lineplot( cleaned_df, x = cleaned_df.index, y ="Value", hue ="Id")plt.xticks(rotation =45);```TODO : I don't know what to say about it ! MB :/# Processing bonusSometimes you get a lof of files in some directory that you want to merge into one single dataframe. This is actually very simple and doable in a single beautiful and nice for-loop héhé.Lets say you have a directory like that :```├── merge.py└── data ├── car_1_day_1.csv ├── car_2_day_1.csv ├── car_1_day_2.csv```In your merge.py file you will interate through you data directory to load each csv as a pandas dataframe. all those file will be merged so they have to contain information to differentiate them.in this case i will add a column in dataframes of each csv files given the information contained in the filename.```pythonfrom pathlib import Pathcsv_files = Path("data").glob("*.csv")csv_file_list = []forfilein csv_files: car_id =file.split(".")[0].split("_")[1] day_id =file.split(".")[0].split("_")[2] file_dataframe = ( pd.read_csv(file) .assign(car_id = car_id) .assign(day_id = day_id) ) csv_file_list.append(file_dataframe)final_dataframe = pd.concat(csv_file_list)final_dataframe.to_csv("final_dataframe.csv")```final_dataframe is a big csv with all the content of each sub dataframe. Note that all the dataframe should have the same columns as they will be merged vertically. Joining orizontally according to a primary key is completely faisible as well but is not the purpose of this article. If you need it don't hesitate to contact me ! # Bonus how to model off of this ? Lets use skrub ! Skrub is a subset of scikit-learn that is supposed to work on dataframes. It is soooooooo nice ! Lets see some example that could be transposed to any other problems.Lets go back with the car price predictionFirst, ML models require number as input, not strings or anything else. You could work on each variable and items to understand how to encode it best, but the most parcimonious processing is provided by skrub with the tablevectorizer function.```{python}from skrub import TableVectorizerimport skrubfrom sklearn.model_selection import train_test_splitfrom sklearn.decomposition import PCAfrom sklearn.preprocessing import StandardScalerdataset = openml.datasets.get_dataset(248)X, y, is_categorical, feat_names = dataset.get_data( dataset_format="dataframe", target=dataset.default_target_attribute)data_var = skrub.var("data", X)X = data_var.drop(columns="price").skb.mark_as_X()y = data_var["price"].skb.mark_as_y()```usable_df is the table we will use for our statistical modelling :). To say afew words about TableVectorizer primitives, it will OneHot encode categorical variables, and label encode categories with a high number of modalities.Lets start with the dataops utilities ! We will define our X and y that will be used later to cross validate our estimators.Now the modelling part, we are in regression so, the baseline should be the linear regression, lets import some regression estimators from sklearn.We can use skrubs DataOps capabilities to do the model selection, we will compare the baseline against the simple model and the default sota regressor using 3 splits of cross validation.```{python}#| column: pagefrom sklearn.linear_model import LinearRegression, Lasso, Ridgefrom sklearn.ensemble import GradientBoostingRegressor, HistGradientBoostingRegressorfrom sklearn.dummy import DummyRegressorfrom tqdm.auto import tqdmimport plotly.express as pxmodel = skrub.choose_from([LinearRegression(), HistGradientBoostingRegressor(), DummyRegressor()])preds = X.skb.apply(TableVectorizer()).skb.apply(model, y=y)search = preds.skb.make_randomized_search( random_state=1, fitted=True, n_iter=3, cv =2)results = search.results_px.bar( results.assign(model = results.iloc[:,0]), x="model", y ="mean_test_score")```The HPO is supposed to be plot using the following syntax, but it looks like i have some bug with the plotly plotting layout. I put it here anyway until i find a fix : ```{python}search.plot_results()```