This is a file i send to people so they have some code sample they can copy paste to apply to their own problems. If you need something that is not in this page, or you spot a mistake (likely to have some we never know) let me know and this page will grow as people ask me to update it :).

Getting the data

In most cases you will already have a file produced by some measurement system. Pandas can read any type of files (csv (separated by comas or semi-colons or any character), excel files and the new but efficient parquet files.)

Everything happens with the pandas.read_* methods :

import pandas as pd

pd.read_csv("files.csv",sep = ",", index_col = 0...)
pd.read_excel("file.xslx")
pd.read_parquet("file.parquet")

The sep argument tells pandas what is the characters that splits columns.

Toy data

Things with cars ! it must be good hehehe

This first step is only getting the data and is not important. We use openml to load data. It is a rich collection of datasets that is very useful to get toy data.

import numpy as np
import pandas as pd
import openml

datasets_list = openml.datasets.list_datasets(output_format="dataframe", status="active", tag="manufacturing")


dataset = openml.datasets.get_dataset(248)
X, y, is_categorical, feat_names = dataset.get_data(
    dataset_format="dataframe", target=dataset.default_target_attribute)

Understanding the data

The tables

Pandas provides multiple ways to check our dataset. Here is an not-exhaustive list of atributes we could have wanted to use if we were students learning the tool to master it :

X.shape : number of rows and columns
X.info() : Names, types of columns and number of misisng values
X.describe() : get a statistical summary of each numerical columns like extremums, quantiles …
X.head(n) : show the n first rows of your dataset

But we are here to be efficient and get information we want in an easy way. I will then show you the only tool you need to get to know your data

from skrub import TableReport

TableReport(X)

Processing column   1 / 25Processing column   2 / 25Processing column   3 / 25Processing column   4 / 25Processing column   5 / 25Processing column   6 / 25Processing column   7 / 25Processing column   8 / 25Processing column   9 / 25Processing column  10 / 25Processing column  11 / 25Processing column  12 / 25Processing column  13 / 25Processing column  14 / 25Processing column  15 / 25Processing column  16 / 25Processing column  17 / 25Processing column  18 / 25Processing column  19 / 25Processing column  20 / 25Processing column  21 / 25Processing column  22 / 25Processing column  23 / 25Processing column  24 / 25Processing column  25 / 25

Column	Column name	dtype	Is sorted	Unique values	Mean	Std	Min	Median	Max
0	normalized-losses	Float64DType	False	995250 (99.5%)	115.	35.9	37.6	103.	274.
1	make	CategoricalDtype	False	22 (< 0.1%)
2	fuel-type	CategoricalDtype	False	2 (< 0.1%)
3	aspiration	CategoricalDtype	False	2 (< 0.1%)
4	num-of-doors	CategoricalDtype	False	2 (< 0.1%)
5	body-style	CategoricalDtype	False	5 (< 0.1%)
6	drive-wheels	CategoricalDtype	False	3 (< 0.1%)
7	engine-location	CategoricalDtype	False	2 (< 0.1%)
8	wheel-base	Float64DType	False	966458 (96.6%)	98.7	6.03	83.1	97.0	127.
9	length	Float64DType	False	985867 (98.6%)	174.	11.9	134.	173.	217.
10	width	Float64DType	False	923762 (92.4%)	65.8	2.11	60.3	65.5	75.6
11	height	Float64DType	False	943102 (94.3%)	53.8	2.46	47.0	53.9	62.5
12	curb-weight	Float64DType	False	999676 (100.0%)	2.53e+03	510.	1.51e+03	2.43e+03	4.72e+03
13	engine-type	CategoricalDtype	False	7 (< 0.1%)
14	num-of-cylinders	CategoricalDtype	False	7 (< 0.1%)
15	engine-size	Float64DType	False	994182 (99.4%)	123.	39.2	9.89	112.	381.
16	fuel-system	CategoricalDtype	False	8 (< 0.1%)
17	bore	Float64DType	False	621757 (62.2%)	3.32	0.268	2.55	3.32	4.07
18	stroke	Float64DType	False	608190 (60.8%)	3.25	0.322	1.71	3.28	4.46
19	compression-ratio	Float64DType	False	675386 (67.5%)	9.98	3.83	-11.8	9.00	42.2
20	horsepower	Float64DType	False	994784 (99.5%)	103.	38.8	39.5	94.5	304.
21	peak-rpm	Float64DType	False	999433 (99.9%)	5.13e+03	543.	3.41e+03	5.09e+03	6.93e+03
22	city-mpg	Float64DType	False	973545 (97.4%)	24.7	6.41	11.1	24.1	54.8
23	highway-mpg	Float64DType	False	974625 (97.5%)	30.0	6.43	11.5	29.9	61.7
24	price	Float64DType	False	999968 (100.0%)	1.30e+04	7.74e+03	-1.20e+04	1.03e+04	6.39e+04

Column 1	Column 2	Cramér's V	Pearson's Correlation
fuel-type	fuel-system	0.733
num-of-doors	body-style	0.649
fuel-type	aspiration	0.397
drive-wheels	engine-type	0.389
make	drive-wheels	0.291
aspiration	fuel-system	0.288
city-mpg	highway-mpg	0.284	0.607
engine-type	num-of-cylinders	0.267
normalized-losses	num-of-doors	0.259
wheel-base	length	0.235	0.492
drive-wheels	curb-weight	0.220
curb-weight	price	0.212	0.450
num-of-doors	fuel-system	0.185
drive-wheels	width	0.183
engine-location	num-of-cylinders	0.175
drive-wheels	num-of-cylinders	0.174
engine-location	engine-type	0.173
drive-wheels	wheel-base	0.166
drive-wheels	city-mpg	0.166
length	width	0.164	0.314

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

It produces an interactive view in html of your table that you can explore interactively. You have all you need with dimensions, statistical summaries, classic graphs with distribution of numerical data etc etc.. It all you could have done with only pandas but more.

The graphs

The available plotting tools are numerous. If you want an idea of what is possible with python here is a gallery with a lot of nice examples. I will focus with basics but most useful ploting solution.

The library we will use is seaborn that is supposed to work on dataframes. It is based on matplotlib that is the basics plotting brick of python but quite hard to use. Seaborn get things done quickly and nicely.

We will see plotting by asking questions :

What is the proportion of each car maker ?

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8,4))
sns.histplot(
    X, 
    x = "make",
    stat = "density"
)
plt.xticks(rotation = 90);

Mutating the data

Pandas provides a lot of very useful methods to filter, modify delete items from a dataframe.

Warning

The philosophy in data science is to NEVER modifying by hand the data you are studying.

All your processing should be done in one place and clear.

For example here, we will see the most popular operation you might need in your work.

Let’s remember what the dataframe looks like :

	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	94.235189	volkswagen	diesel	turbo	four	hatchback	fwd	front	96.575655	173.369361	...	89.249734	2bbl	3.122279	3.295534	9.219909	57.578864	5988.161816	31.814042	34.505390	8605.484961
1	106.648646	volvo	diesel	turbo	four	sedan	rwd	front	109.846269	196.214268	...	111.024531	mfi	3.293933	3.392775	8.637036	113.705719	4490.621234	25.817650	35.053335	26633.568618
2	118.472162	nissan	gas	std	four	wagon	fwd	front	95.024314	165.909864	...	96.317551	mpfi	3.051725	3.248152	9.000000	117.757103	4799.168653	22.783708	27.760480	5900.715970
3	62.717600	porsche	gas	std	two	hatchback	rwd	front	100.174878	182.499698	...	144.228109	1bbl	3.238181	3.241202	9.278929	85.762326	5086.244824	14.949801	24.464281	8641.179741
4	105.808352	honda	gas	std	four	sedan	fwd	front	96.792849	172.594242	...	95.829115	2bbl	3.296874	3.506498	7.068202	73.158430	4417.985102	20.781260	19.686021	9831.490432
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
999995	135.343962	toyota	gas	std	two	hatchback	rwd	front	96.798115	172.721704	...	141.139493	2bbl	3.874087	3.239200	9.284329	82.423102	4795.277229	18.662406	31.668771	8697.083987
999996	117.541993	alfa-romero	gas	std	two	sedan	rwd	front	95.471999	172.728564	...	94.672790	mpfi	3.206900	3.219425	8.490094	108.563509	5831.253661	17.524008	19.988458	15168.656670
999997	92.971613	mercedes-benz	diesel	turbo	four	wagon	fwd	front	106.470022	198.734017	...	142.729616	idi	3.360154	2.570015	9.376938	103.584382	4820.162755	16.110262	34.389708	5811.145033
999998	195.508001	audi	gas	std	two	hatchback	fwd	front	98.163839	175.115516	...	90.780708	mpfi	3.108470	3.098080	8.625507	186.347715	5148.094022	24.794536	28.859866	30190.403231
999999	102.256740	mercedes-benz	diesel	std	four	sedan	4wd	front	99.880484	154.602788	...	100.256336	idi	3.290561	3.536298	22.934508	73.317623	4777.166405	23.020483	26.392033	7951.146933

1000000 rows × 25 columns

and see what the prices look like by brand :

plt.figure(figsize=(8,4))
sns.boxplot(
    X, 
    x = "make",
    y = "price",
)
plt.xticks(rotation = 90);

Filtering rows

We have negative prices this looks like errors, lets start with filtering rows.

Pandas has a very handy method : query. This method eats a string that contains conditions, in this string you can use columns name as variables and external variables from your code like this.

price_error_threshold = 0
X.query("price >= @price_error_threshold")

	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	94.235189	volkswagen	diesel	turbo	four	hatchback	fwd	front	96.575655	173.369361	...	89.249734	2bbl	3.122279	3.295534	9.219909	57.578864	5988.161816	31.814042	34.505390	8605.484961
1	106.648646	volvo	diesel	turbo	four	sedan	rwd	front	109.846269	196.214268	...	111.024531	mfi	3.293933	3.392775	8.637036	113.705719	4490.621234	25.817650	35.053335	26633.568618
2	118.472162	nissan	gas	std	four	wagon	fwd	front	95.024314	165.909864	...	96.317551	mpfi	3.051725	3.248152	9.000000	117.757103	4799.168653	22.783708	27.760480	5900.715970
3	62.717600	porsche	gas	std	two	hatchback	rwd	front	100.174878	182.499698	...	144.228109	1bbl	3.238181	3.241202	9.278929	85.762326	5086.244824	14.949801	24.464281	8641.179741
4	105.808352	honda	gas	std	four	sedan	fwd	front	96.792849	172.594242	...	95.829115	2bbl	3.296874	3.506498	7.068202	73.158430	4417.985102	20.781260	19.686021	9831.490432
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
999995	135.343962	toyota	gas	std	two	hatchback	rwd	front	96.798115	172.721704	...	141.139493	2bbl	3.874087	3.239200	9.284329	82.423102	4795.277229	18.662406	31.668771	8697.083987
999996	117.541993	alfa-romero	gas	std	two	sedan	rwd	front	95.471999	172.728564	...	94.672790	mpfi	3.206900	3.219425	8.490094	108.563509	5831.253661	17.524008	19.988458	15168.656670
999997	92.971613	mercedes-benz	diesel	turbo	four	wagon	fwd	front	106.470022	198.734017	...	142.729616	idi	3.360154	2.570015	9.376938	103.584382	4820.162755	16.110262	34.389708	5811.145033
999998	195.508001	audi	gas	std	two	hatchback	fwd	front	98.163839	175.115516	...	90.780708	mpfi	3.108470	3.098080	8.625507	186.347715	5148.094022	24.794536	28.859866	30190.403231
999999	102.256740	mercedes-benz	diesel	std	four	sedan	4wd	front	99.880484	154.602788	...	100.256336	idi	3.290561	3.536298	22.934508	73.317623	4777.166405	23.020483	26.392033	7951.146933

999875 rows × 25 columns

You see that the table only contains 999 875 elements. Some items have been removed. Query is a function that returns a mutated copy of the dataframe. So if you want the dataframe that have been modified, you need to assign it to a new variable like that.

new_df = X.query("price >= @price_error_threshold")
new_df

	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	94.235189	volkswagen	diesel	turbo	four	hatchback	fwd	front	96.575655	173.369361	...	89.249734	2bbl	3.122279	3.295534	9.219909	57.578864	5988.161816	31.814042	34.505390	8605.484961
1	106.648646	volvo	diesel	turbo	four	sedan	rwd	front	109.846269	196.214268	...	111.024531	mfi	3.293933	3.392775	8.637036	113.705719	4490.621234	25.817650	35.053335	26633.568618
2	118.472162	nissan	gas	std	four	wagon	fwd	front	95.024314	165.909864	...	96.317551	mpfi	3.051725	3.248152	9.000000	117.757103	4799.168653	22.783708	27.760480	5900.715970
3	62.717600	porsche	gas	std	two	hatchback	rwd	front	100.174878	182.499698	...	144.228109	1bbl	3.238181	3.241202	9.278929	85.762326	5086.244824	14.949801	24.464281	8641.179741
4	105.808352	honda	gas	std	four	sedan	fwd	front	96.792849	172.594242	...	95.829115	2bbl	3.296874	3.506498	7.068202	73.158430	4417.985102	20.781260	19.686021	9831.490432
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
999995	135.343962	toyota	gas	std	two	hatchback	rwd	front	96.798115	172.721704	...	141.139493	2bbl	3.874087	3.239200	9.284329	82.423102	4795.277229	18.662406	31.668771	8697.083987
999996	117.541993	alfa-romero	gas	std	two	sedan	rwd	front	95.471999	172.728564	...	94.672790	mpfi	3.206900	3.219425	8.490094	108.563509	5831.253661	17.524008	19.988458	15168.656670
999997	92.971613	mercedes-benz	diesel	turbo	four	wagon	fwd	front	106.470022	198.734017	...	142.729616	idi	3.360154	2.570015	9.376938	103.584382	4820.162755	16.110262	34.389708	5811.145033
999998	195.508001	audi	gas	std	two	hatchback	fwd	front	98.163839	175.115516	...	90.780708	mpfi	3.108470	3.098080	8.625507	186.347715	5148.094022	24.794536	28.859866	30190.403231
999999	102.256740	mercedes-benz	diesel	std	four	sedan	4wd	front	99.880484	154.602788	...	100.256336	idi	3.290561	3.536298	22.934508	73.317623	4777.166405	23.020483	26.392033	7951.146933

999875 rows × 25 columns

I don’t know what is the cleanest way to apply multiple filters, but i would recommend you to apply those in a sequential manner in order to be as clean as possible like that :

(
    X
    .query("price > 0")
    .query("make == 'porsche'")
    .query("aspiration== 'turbo'")
    .query("horsepower > 200")
)

	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
3190	185.876800	porsche	gas	turbo	two	wagon	fwd	rear	105.204967	199.830908	...	94.192687	spfi	3.077697	3.299996	8.572648	203.141156	5247.807116	21.175022	20.079863	16825.034664
7706	169.592917	porsche	gas	turbo	two	hardtop	rwd	rear	94.166153	170.515142	...	103.474633	spdi	3.343738	3.320666	9.000000	221.580768	5125.481092	24.691797	29.570073	32931.723172
13028	80.884942	porsche	gas	turbo	four	sedan	rwd	front	111.122111	189.999299	...	165.584111	2bbl	2.938367	3.272089	9.357822	227.389179	4334.123031	18.965615	26.375387	16813.144257
16276	73.890834	porsche	gas	turbo	two	hatchback	rwd	front	109.779892	178.833958	...	96.387741	mfi	3.651589	3.190689	9.000000	219.965458	5299.653454	19.982165	19.999320	14079.390642
27734	73.767420	porsche	gas	turbo	two	hatchback	rwd	front	106.649009	169.153863	...	89.379986	mpfi	2.969864	3.084899	9.649326	206.358553	5823.265443	19.884336	24.375660	14600.078186
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
977214	74.475238	porsche	gas	turbo	two	sedan	rwd	front	102.583083	178.282952	...	119.496931	mpfi	2.909086	2.523007	9.333629	206.509929	5628.363546	21.398545	23.528196	35183.174092
980475	174.669041	porsche	gas	turbo	two	convertible	rwd	front	96.008550	169.611921	...	114.554297	spdi	3.556880	3.964338	7.101980	228.480555	6094.263138	17.284775	23.653810	14130.291772
981505	87.402943	porsche	diesel	turbo	four	sedan	rwd	front	96.954457	184.067197	...	95.343936	idi	3.312339	2.429638	9.531729	241.163621	4485.750026	16.386203	28.877237	24801.794845
989078	117.694568	porsche	gas	turbo	four	hardtop	4wd	front	109.253923	187.952848	...	135.856749	mfi	3.510137	2.881988	9.431905	217.754979	5080.599992	33.395213	31.847391	14830.463212
996572	97.112480	porsche	gas	turbo	four	sedan	rwd	front	105.481551	201.169575	...	104.631448	mpfi	3.137802	2.439404	16.990415	228.311859	5363.232239	18.427064	20.890147	16440.108451

299 rows × 25 columns

I wonder how are the horsepowers distributed for each brands :

plt.figure(figsize=(8,4))
sns.histplot(
    new_df,
    x = "horsepower",
    hue = "make"
)

We can see an interesting pattern in the horsepower distribution ! there are peaks but we have to much brands to see a pattern, lets only compare some of them :

plt.figure(figsize=(8,4))
sns.histplot(
    (
        new_df
        .query("make.isin(['peugot','porsche','bmw'])")
        .assign(make  = lambda df_: df_.make.astype(object))
    ),
    x = "horsepower",
    hue = "make",
    stat = "density",
)

Nothing interesting here. Let’s see how to mutate columns. You could want to create columns according to a processing of some other columns.

Anonymous function

There is an important part of python that is usually quite useless but we will have to use it here : Anonymous function. It is when you want to use a function with as input to another function but not use it as itself.

Adding columns

In pandas, we use anonymous functions in various cases let’s see examples with the assign method :

(
    X.assign(volume = X.height * X.width * X.length)
)

	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	...	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price	volume
0	94.235189	volkswagen	diesel	turbo	four	hatchback	fwd	front	96.575655	173.369361	...	2bbl	3.122279	3.295534	9.219909	57.578864	5988.161816	31.814042	34.505390	8605.484961	619292.173905
1	106.648646	volvo	diesel	turbo	four	sedan	rwd	front	109.846269	196.214268	...	mfi	3.293933	3.392775	8.637036	113.705719	4490.621234	25.817650	35.053335	26633.568618	730511.069668
2	118.472162	nissan	gas	std	four	wagon	fwd	front	95.024314	165.909864	...	mpfi	3.051725	3.248152	9.000000	117.757103	4799.168653	22.783708	27.760480	5900.715970	566642.367482
3	62.717600	porsche	gas	std	two	hatchback	rwd	front	100.174878	182.499698	...	1bbl	3.238181	3.241202	9.278929	85.762326	5086.244824	14.949801	24.464281	8641.179741	633282.290282
4	105.808352	honda	gas	std	four	sedan	fwd	front	96.792849	172.594242	...	2bbl	3.296874	3.506498	7.068202	73.158430	4417.985102	20.781260	19.686021	9831.490432	617652.454060
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
999995	135.343962	toyota	gas	std	two	hatchback	rwd	front	96.798115	172.721704	...	2bbl	3.874087	3.239200	9.284329	82.423102	4795.277229	18.662406	31.668771	8697.083987	623103.391042
999996	117.541993	alfa-romero	gas	std	two	sedan	rwd	front	95.471999	172.728564	...	mpfi	3.206900	3.219425	8.490094	108.563509	5831.253661	17.524008	19.988458	15168.656670	582494.176541
999997	92.971613	mercedes-benz	diesel	turbo	four	wagon	fwd	front	106.470022	198.734017	...	idi	3.360154	2.570015	9.376938	103.584382	4820.162755	16.110262	34.389708	5811.145033	758006.025497
999998	195.508001	audi	gas	std	two	hatchback	fwd	front	98.163839	175.115516	...	mpfi	3.108470	3.098080	8.625507	186.347715	5148.094022	24.794536	28.859866	30190.403231	639722.547021
999999	102.256740	mercedes-benz	diesel	std	four	sedan	4wd	front	99.880484	154.602788	...	idi	3.290561	3.536298	22.934508	73.317623	4777.166405	23.020483	26.392033	7951.146933	550103.184730

1000000 rows × 26 columns

This computation works because the we add the column to X, so X.height is the column of X. But in most cases we want to chain operation from the same dataframe like so :

(
    X
    .query("make == 'volvo'")
    .assign(volume = lambda df_: df_.height * df_.width * df_.length)
)

	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	...	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price	volume
1	106.648646	volvo	diesel	turbo	four	sedan	rwd	front	109.846269	196.214268	...	mfi	3.293933	3.392775	8.637036	113.705719	4490.621234	25.817650	35.053335	26633.568618	730511.069668
9	98.278149	volvo	gas	std	four	hatchback	rwd	front	113.533132	192.410446	...	mpfi	3.694279	3.083717	8.902826	164.839204	5006.561572	19.564961	20.105817	14736.190072	629304.174602
45	124.123256	volvo	gas	std	four	hardtop	rwd	front	99.312390	190.208244	...	2bbl	3.608365	3.225135	7.737136	66.343513	4757.383809	28.393852	25.772739	10499.685299	663866.518027
51	152.911817	volvo	gas	std	two	hatchback	rwd	front	95.413365	174.632362	...	spdi	3.084798	2.667112	10.227535	214.732626	5136.182992	19.265381	23.848923	6782.556436	673043.970292
66	92.550045	volvo	gas	std	two	sedan	rwd	front	101.931293	197.169953	...	mpfi	2.899100	3.101075	7.748074	120.767322	5186.690201	22.828964	39.515080	14098.589700	750008.948305
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
999868	84.057334	volvo	gas	std	four	sedan	rwd	front	105.603092	189.582948	...	mpfi	3.421972	2.382150	6.177299	77.153412	4172.803042	18.686774	20.241031	5961.977685	717292.274429
999930	96.830261	volvo	gas	std	four	sedan	fwd	front	108.412834	195.293053	...	2bbl	3.337004	3.898358	9.000000	100.049448	6351.331883	23.539770	33.046702	8267.274693	653497.733018
999936	91.291390	volvo	gas	turbo	four	sedan	rwd	front	95.221499	173.070373	...	spdi	3.332252	3.277535	15.727727	153.916788	5037.995696	20.484083	28.600490	7752.476237	531707.168849
999952	93.885399	volvo	gas	std	four	wagon	fwd	front	106.762610	199.221956	...	mpfi	2.957202	3.721575	9.000000	128.319584	5189.472664	28.267098	35.193040	32660.749929	735475.748951
999976	218.868694	volvo	gas	std	four	wagon	4wd	front	105.205247	174.310172	...	2bbl	3.519628	3.096898	17.079278	92.305693	5951.340505	28.094625	29.207487	10869.807786	565988.124681

42242 rows × 26 columns

Here, the assign method take as input an anonymous function that process a dataframe. Indeed, in the previous cell, we filter some rows of X with the query step, so when the assign will be run, the columns of the dataframe at this step will be different from the columns of X (the new dataframe has less rows). So, in assign it is required to use the anonymous function syntax like this :

X.assign(new_col = (X.col1 * X.col2)/ X.col3 + 7)

by : 

X.assign(new_col = lambda df_: (df_.col1 * df_.col2)/ df_.col3 + 7)

Strings

Strings columns are very common and you might want to process strings in your daily life :). Pandas can help you with that !

Here we will process the make column with some example but its not very usefull since the dataframe is already clean. But the function you will need are the same !

(
    X.make.str.split("-", expand = True)
)

	0	1
0	volkswagen	None
1	volvo	None
2	nissan	None
3	porsche	None
4	honda	None
...	...	...
999995	toyota	None
999996	alfa	romero
999997	mercedes	benz
999998	audi	None
999999	mercedes	benz

1000000 rows × 2 columns

Here we want to create 2 columns from a single column, we split it by a pattern (here its “-”) and create new columns from the first splitted column. If you know you only want one part of the splited string its easier (keeping the first element in this example) :

(
    X.make.str.split("-").str[0]
)

0         volkswagen
1              volvo
2             nissan
3            porsche
4              honda
             ...    
999995        toyota
999996          alfa
999997      mercedes
999998          audi
999999      mercedes
Name: make, Length: 1000000, dtype: object

You can filter rows by the fact a a string cell contains or not a string pattern like that :

(
    X.make.str.contains("benz")
)

0         False
1         False
2         False
3         False
4         False
          ...  
999995    False
999996    False
999997     True
999998    False
999999     True
Name: make, Length: 1000000, dtype: bool

This creates a boolean masks that is usable in the query method

(
    X.query("make.str.contains('benz')")
)

	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
24	78.145246	mercedes-benz	gas	std	four	sedan	rwd	front	98.627816	184.178243	...	87.152894	2bbl	3.619763	3.428984	9.000000	116.724261	5546.241459	44.471921	39.600851	13506.102886
27	192.138251	mercedes-benz	diesel	std	four	sedan	4wd	front	101.633589	155.579818	...	99.760274	spdi	3.144883	3.948666	9.000000	81.175532	6140.981220	24.632908	28.725574	6952.149473
40	92.975466	mercedes-benz	gas	std	four	sedan	rwd	front	97.569063	165.257082	...	77.373192	1bbl	3.217990	3.070064	9.000000	78.085362	5082.690039	24.088293	30.222645	14855.739089
80	98.843142	mercedes-benz	gas	turbo	four	sedan	rwd	front	110.920185	189.493656	...	97.019759	mpfi	3.584555	3.083670	9.451520	73.311237	4567.253302	40.911565	49.135100	29070.307145
97	82.214150	mercedes-benz	gas	std	four	hatchback	fwd	front	94.886181	167.832934	...	102.136645	2bbl	2.892585	3.131031	7.245234	72.781963	4647.368061	29.180213	33.731709	6339.999308
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
999943	73.726165	mercedes-benz	gas	turbo	two	sedan	fwd	front	108.369176	184.750858	...	124.948909	mpfi	3.626866	3.381363	7.545847	118.186202	4550.313865	24.183442	31.209608	7865.394691
999945	69.088197	mercedes-benz	gas	std	four	sedan	rwd	front	112.227270	193.674685	...	133.136227	mpfi	3.210714	3.429679	9.267030	131.604353	5343.495652	24.026830	30.508760	13145.043462
999946	77.974004	mercedes-benz	gas	std	four	wagon	rwd	front	113.444076	202.028214	...	94.055280	mpfi	3.172689	3.557503	8.599569	145.838478	5264.356983	18.653069	27.266048	25412.610927
999997	92.971613	mercedes-benz	diesel	turbo	four	wagon	fwd	front	106.470022	198.734017	...	142.729616	idi	3.360154	2.570015	9.376938	103.584382	4820.162755	16.110262	34.389708	5811.145033
999999	102.256740	mercedes-benz	diesel	std	four	sedan	4wd	front	99.880484	154.602788	...	100.256336	idi	3.290561	3.536298	22.934508	73.317623	4777.166405	23.020483	26.392033	7951.146933

42537 rows × 25 columns

So nice isn’t it ?!

Chain all operations

The goal is to have all the processing into one single chaining operations like so :

X_cleaned = (
    X
    .query("price > 0")
    .assign(volume = lambda df_: df_.width * df_.height * df_.length)
    .query("volume <= volume.quantile(.25)")
    .query("make.isin(['nissan', 'peugot'])")
    .query("aspiration == 'turbo'")
)

Grouping

You might want to get statistical summary for sub categories of individuals or merge brands, items, people or whatever. Grouping in very easy and modular in pandas as you can see here :)

This is the syntax i prefer because you can do multiple grouping from the same column.

grouped = (
    X
    .query("price >0")
    .assign(volume = lambda df_: df_.width * df_.height * df_.length)
    .assign(price_per_cubic_whateverunit = lambda df_: df_.price/df_.volume)
    .groupby(by = "make")
    .agg(
        mean_length = ("length", lambda x: x.mean()),
        max_price = ("price", lambda s: s.max()),
        price_std = ("price", lambda s: s.std()),
        median_ppcw = ("price_per_cubic_whateverunit", lambda x: x.median())

    )
    .sort_values(by = "median_ppcw", ascending = False)
)
grouped.head()

/tmp/ipykernel_19796/1344792098.py:2: FutureWarning:

The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.

	mean_length	max_price	price_std	median_ppcw
make
peugot	181.020325	57085.028930	8495.916584	0.019026
volvo	179.189219	59808.006804	8661.579270	0.018844
mercedes-benz	178.606458	58246.449115	8391.706783	0.018640
porsche	173.588336	57863.174442	7903.200039	0.018223
bmw	177.121137	63918.819174	8276.675082	0.018152

If you want the best volume/price ratio you should probably get a peugot or a volvo.

Indexing

Columns

This is the most useful part (last but not least as they say). If you have a big dataframe and you only want a subset of it how can you do ?

You can either index by indices or by names. The simplest way is to subset by names where you only want to select some columns :

X.loc[:,["make", "fuel-type"]]

	make	fuel-type
0	volkswagen	diesel
1	volvo	diesel
2	nissan	gas
3	porsche	gas
4	honda	gas
...	...	...
999995	toyota	gas
999996	alfa-romero	gas
999997	mercedes-benz	diesel
999998	audi	gas
999999	mercedes-benz	diesel

1000000 rows × 2 columns

If you only want one column (called a pandas series) you can just use the dot notation :

X.make

0            volkswagen
1                 volvo
2                nissan
3               porsche
4                 honda
              ...      
999995           toyota
999996      alfa-romero
999997    mercedes-benz
999998             audi
999999    mercedes-benz
Name: make, Length: 1000000, dtype: category
Categories (22, object): ['alfa-romero' < 'audi' < 'bmw' < 'chevrolet' ... 'subaru' < 'toyota' < 'volkswagen' < 'volvo']

Here you selects of the rows witht he colomn, and all the columns with names included in the list provided to the loc method. If you want to select all columns with some string pattern you can use the filter method :

X.filter(like = "engine")

	engine-location	engine-type	engine-size
0	front	ohc	89.249734
1	front	ohc	111.024531
2	front	ohc	96.317551
3	front	ohcv	144.228109
4	front	ohc	95.829115
...	...	...	...
999995	front	ohc	141.139493
999996	front	ohcv	94.672790
999997	front	ohc	142.729616
999998	front	ohcf	90.780708
999999	front	ohcf	100.256336

1000000 rows × 3 columns

If you want to select only column with some required data types you can do :

X.select_dtypes("number") # can ask for  object | category

	normalized-losses	wheel-base	length	width	height	curb-weight	engine-size	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	94.235189	96.575655	173.369361	65.329051	54.678538	2193.379366	89.249734	3.122279	3.295534	9.219909	57.578864	5988.161816	31.814042	34.505390	8605.484961
1	106.648646	109.846269	196.214268	68.502701	54.348620	3211.189054	111.024531	3.293933	3.392775	8.637036	113.705719	4490.621234	25.817650	35.053335	26633.568618
2	118.472162	95.024314	165.909864	64.804938	52.702199	2098.071603	96.317551	3.051725	3.248152	9.000000	117.757103	4799.168653	22.783708	27.760480	5900.715970
3	62.717600	100.174878	182.499698	68.385729	50.742249	1879.045004	144.228109	3.238181	3.241202	9.278929	85.762326	5086.244824	14.949801	24.464281	8641.179741
4	105.808352	96.792849	172.594242	64.200921	55.741238	2353.389802	95.829115	3.296874	3.506498	7.068202	73.158430	4417.985102	20.781260	19.686021	9831.490432
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
999995	135.343962	96.798115	172.721704	62.991738	57.270320	2847.303204	141.139493	3.874087	3.239200	9.284329	82.423102	4795.277229	18.662406	31.668771	8697.083987
999996	117.541993	95.471999	172.728564	66.455439	50.745425	2847.057699	94.672790	3.206900	3.219425	8.490094	108.563509	5831.253661	17.524008	19.988458	15168.656670
999997	92.971613	106.470022	198.734017	64.394394	59.231453	2148.588288	142.729616	3.360154	2.570015	9.376938	103.584382	4820.162755	16.110262	34.389708	5811.145033
999998	195.508001	98.163839	175.115516	66.112063	55.256875	2833.256423	90.780708	3.108470	3.098080	8.625507	186.347715	5148.094022	24.794536	28.859866	30190.403231
999999	102.256740	99.880484	154.602788	66.169296	53.773750	2307.190623	100.256336	3.290561	3.536298	22.934508	73.317623	4777.166405	23.020483	26.392033	7951.146933

1000000 rows × 15 columns

Rows

Well, here, you can select rows but you can’t realy subset rows, you can randomly select subset of all rows with the sample method :

X.sample(10)

	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
749409	135.423014	toyota	diesel	std	two	hatchback	fwd	front	97.230550	172.130078	...	102.990225	4bbl	3.797869	2.562696	26.809110	52.968364	4787.846000	38.913628	33.054823	8489.508737
708816	82.512610	honda	diesel	turbo	four	sedan	fwd	front	96.656741	164.577647	...	132.465644	idi	3.345111	2.854951	6.808771	98.470228	5052.126528	22.661879	29.943136	6625.904715
97569	122.994495	mazda	gas	std	two	hatchback	rwd	front	102.838196	174.487702	...	186.188568	spfi	3.769101	3.386703	17.800225	168.330702	4304.229977	23.652817	20.383184	12143.569765
731921	80.774358	mercury	gas	std	two	hardtop	rwd	front	107.448320	197.476981	...	87.110142	spdi	3.836812	3.102497	8.697736	126.648959	4791.748617	16.372641	31.678556	9071.240776
670548	78.229888	subaru	gas	std	four	sedan	rwd	front	96.697969	171.478747	...	72.975592	1bbl	3.150233	3.617376	9.000000	77.781161	6003.819694	20.630556	18.691300	8885.354338
671272	150.415122	mercury	gas	std	two	hatchback	fwd	front	95.962498	171.263751	...	203.850474	4bbl	3.375899	3.391249	9.000000	60.144379	5554.122462	30.191300	32.883303	7703.883023
750007	190.109162	bmw	gas	std	two	hatchback	fwd	front	99.932555	173.687194	...	187.308491	spfi	3.481439	3.454303	11.754341	124.313463	5340.455642	21.068626	26.524270	7762.095066
240220	139.025723	plymouth	gas	std	two	hardtop	fwd	front	100.413386	173.991192	...	98.187409	1bbl	3.505655	3.437545	7.519928	113.548202	4803.046746	20.327109	26.006691	7501.613436
990970	112.816936	isuzu	gas	turbo	two	hatchback	rwd	front	113.736353	184.111808	...	93.810831	mpfi	3.186321	2.889710	8.586864	113.680830	5519.824813	23.597284	28.493760	16471.933216
346927	71.000145	toyota	gas	std	two	hatchback	rwd	front	99.272668	171.254444	...	131.870398	2bbl	3.645385	3.184023	18.828806	66.628474	5119.355235	26.083989	44.624066	11808.715314

10 rows × 25 columns

Shape of the data

Once upon a time, people were fighting about what was the best between long and wide format (differences and explanations explained here).

The best format to give to stats software is the long format, i.e. one row represents one point of measure.

The two methods we will need here are the pivot_table (going from long to wide) and the melt (doing the opposite (wide -> long)).

Lets see with some examples !

We will show it with a new dataframe with heart rate of 14 people from fitbit :

dataset = openml.datasets.get_dataset(46103)
X, y, is_categorical, feat_names = dataset.get_data(
    dataset_format="dataframe", target=dataset.default_target_attribute)


TableReport(X)

Processing column   1 / 3Processing column   2 / 3Processing column   3 / 3

	Id	Time	Value
0	2022484408	4/1/2016 7:54:00 AM	93
1	2022484408	4/1/2016 7:54:05 AM	91
2	2022484408	4/1/2016 7:54:10 AM	96
3	2022484408	4/1/2016 7:54:15 AM	98
4	2022484408	4/1/2016 7:54:20 AM	100

1,154,676	8877689391	4/12/2016 9:46:25 AM	113
1,154,677	8877689391	4/12/2016 9:46:30 AM	108
1,154,678	8877689391	4/12/2016 9:46:35 AM	102
1,154,679	8877689391	4/12/2016 9:46:40 AM	99
1,154,680	8877689391	4/12/2016 9:46:45 AM	98

Column	Column name	dtype	Is sorted	Unique values	Mean	Std	Min	Median	Max
0	Id	CategoricalDtype	True	14 (< 0.1%)
1	Time	ObjectDType	False	510597 (44.2%)
2	Value	UInt8DType	False	148 (< 0.1%)	79.8	18.7	36	77	185

Column 1	Column 2	Cramér's V
Id	Value	0.272
Id	Time	0.0581
Time	Value	0.0530

Please enable javascript

The base format is the long format, that is the format we would want for statistical analysis, but i don’t know, we could want time as index and each individuals as columns, and each heart rate has values.

The pivot methods allow us to do it. The pivot and pivot_table are similar, but the pivot method doesn’t allow duplicates values so you are sure every row doesn’t disappear in the process.

wide_df = X.pivot(
    values = "Value",
    index = "Time",
    columns = "Id"
)
TableReport(wide_df)

Processing column   1 / 14Processing column   2 / 14Processing column   3 / 14Processing column   4 / 14Processing column   5 / 14Processing column   6 / 14Processing column   7 / 14Processing column   8 / 14Processing column   9 / 14Processing column  10 / 14Processing column  11 / 14Processing column  12 / 14Processing column  13 / 14Processing column  14 / 14

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	2022484408	Float64DType	False	454114 (88.9%)	132 (< 0.1%)	81.7	18.1	46.0	78.0	182.
1	2026352035	Float64DType	False	510158 (99.9%)	23 (< 0.1%)	65.8	5.54	57.0	63.0	80.0
2	2347167796	Float64DType	False	389793 (76.3%)	124 (< 0.1%)	76.1	13.5	51.0	73.0	182.
3	4020332650	Float64DType	False	226803 (44.4%)	106 (< 0.1%)	81.9	14.6	46.0	84.0	151.
4	4558609924	Float64DType	False	441258 (86.4%)	108 (< 0.1%)	79.6	12.7	50.0	79.0	158.
5	5553957443	Float64DType	False	412800 (80.8%)	100 (< 0.1%)	69.4	14.9	48.0	65.0	152.
6	5577150313	Float64DType	False	422948 (82.8%)	141 (< 0.1%)	65.0	20.2	36.0	58.0	176.
7	6117666160	Float64DType	False	456931 (89.5%)	79 (< 0.1%)	83.5	13.8	52.0	84.0	130.
8	6391747486	Float64DType	False	506850 (99.3%)	75 (< 0.1%)	84.1	10.0	49.0	83.0	133.
9	6775888955	Float64DType	False	475497 (93.1%)	104 (< 0.1%)	97.7	15.4	59.0	98.0	185.
10	6962181067	Float64DType	False	384722 (75.3%)	129 (< 0.1%)	80.7	19.7	48.0	74.0	176.
11	7007744171	Float64DType	False	445811 (87.3%)	103 (< 0.1%)	89.9	15.3	56.0	88.0	158.
12	8792009665	Float64DType	False	440510 (86.3%)	100 (< 0.1%)	76.3	12.8	45.0	74.0	163.
13	8877689391	Float64DType	False	425482 (83.3%)	133 (< 0.1%)	87.6	31.2	47.0	76.0	179.

Column 1	Column 2	Cramér's V	Pearson's Correlation
6117666160	6775888955	0.242	0.513
2022484408	7007744171	0.209	0.0850
2022484408	4558609924	0.185	0.0723
6391747486	6775888955	0.184	-0.297
4558609924	7007744171	0.184	-0.0151
5577150313	6117666160	0.171	0.111
5577150313	8792009665	0.168	0.101
5553957443	5577150313	0.166	0.0952
5577150313	6775888955	0.165	-0.0372
7007744171	8792009665	0.165	-0.227
5577150313	6962181067	0.165	0.128
5553957443	7007744171	0.161	-0.142
2347167796	4558609924	0.160	0.147
5577150313	7007744171	0.159	0.200
6117666160	7007744171	0.159	-0.0131
2347167796	5577150313	0.156	0.183
2022484408	6117666160	0.154	0.000516
4558609924	6117666160	0.152	0.0438
4558609924	5553957443	0.152	0.226
4558609924	5577150313	0.151	0.0842

Please enable javascript

We can see that there are a lot of missing values, because each people have not recorded their measurements at the same time, so most of the datapoints doens’t match.

In the table report, we can see distribution of heart rates and see that some people are in better shape than others.

To come back with the original long form we use the melt method:

(
    wide_df
    .reset_index()
    .melt(id_vars = "Time")
)

	Time	Id	value
0	3/29/2016 10:00:00 AM	2022484408	NaN
1	3/29/2016 10:00:05 PM	2022484408	NaN
2	3/29/2016 10:00:10 PM	2022484408	NaN
3	3/29/2016 10:00:15 AM	2022484408	NaN
4	3/29/2016 10:00:15 PM	2022484408	NaN
...	...	...	...
7148353	4/9/2016 9:59:55 AM	8877689391	74.0
7148354	4/9/2016 9:59:55 PM	8877689391	NaN
7148355	4/9/2016 9:59:57 PM	8877689391	83.0
7148356	4/9/2016 9:59:58 PM	8877689391	NaN
7148357	4/9/2016 9:59:59 PM	8877689391	NaN

7148358 rows × 3 columns

We see that we have way more rows than before, its because it kepts the NaN values, we can drop them with the dropna method :

(
    wide_df
    .reset_index()
    .melt(id_vars = "Time")
    .dropna(subset = ["value"])
)

	Time	Id	value
35721	4/1/2016 10:00:00 AM	2022484408	98.0
35728	4/1/2016 10:00:05 AM	2022484408	97.0
35741	4/1/2016 10:00:15 AM	2022484408	98.0
35746	4/1/2016 10:00:20 AM	2022484408	99.0
35752	4/1/2016 10:00:25 AM	2022484408	101.0
...	...	...	...
7148343	4/9/2016 9:59:42 PM	8877689391	87.0
7148345	4/9/2016 9:59:45 AM	8877689391	73.0
7148350	4/9/2016 9:59:52 PM	8877689391	86.0
7148353	4/9/2016 9:59:55 AM	8877689391	74.0
7148355	4/9/2016 9:59:57 PM	8877689391	83.0

1154681 rows × 3 columns

And now we have the same data as before.

Time

This is not my speciality, but pandas is very good to deal with dates and temporal data. Lets see a bit of it here :) (the doc, better than this tuto is foundable here). We have the chance to get fitbit data with heart rates of 13 people during days.

We see that pandas treats the time colomn as an object, that is not what we want, let’s convert this column to date :)

cleaned_df = (
    X
    .assign(time_col = lambda df_: pd.to_datetime(df_.Time))
    .set_index("time_col")
)
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1154681 entries, 2016-04-01 07:54:00 to 2016-04-12 09:46:45
Data columns (total 3 columns):
 #   Column  Non-Null Count    Dtype   
---  ------  --------------    -----   
 0   Id      1154681 non-null  category
 1   Time    1154681 non-null  object  
 2   Value   1154681 non-null  uint8   
dtypes: category(1), object(1), uint8(1)
memory usage: 19.8+ MB

Now it is cool ! We got where we wanted to go.

sns.lineplot(
    cleaned_df,
    x = cleaned_df.index,
    y = "Value",
    hue = "Id"
)
plt.xticks(rotation = 45);

TODO : I don’t know what to say about it ! MB :/

Processing bonus

Sometimes you get a lof of files in some directory that you want to merge into one single dataframe. This is actually very simple and doable in a single beautiful and nice for-loop héhé.

Lets say you have a directory like that :

├── merge.py
└── data
    ├── car_1_day_1.csv
    ├── car_2_day_1.csv
    ├── car_1_day_2.csv

In your merge.py file you will interate through you data directory to load each csv as a pandas dataframe. all those file will be merged so they have to contain information to differentiate them.

in this case i will add a column in dataframes of each csv files given the information contained in the filename.

from pathlib import Path

csv_files = Path("data").glob("*.csv")

csv_file_list = []

for file in csv_files:
    car_id = file.split(".")[0].split("_")[1]
    day_id = file.split(".")[0].split("_")[2]
    file_dataframe = (
        pd.read_csv(file)
        .assign(car_id = car_id)
        .assign(day_id = day_id)
    )
    csv_file_list.append(file_dataframe)


final_dataframe = pd.concat(csv_file_list)
final_dataframe.to_csv("final_dataframe.csv")

final_dataframe is a big csv with all the content of each sub dataframe. Note that all the dataframe should have the same columns as they will be merged vertically.

Joining orizontally according to a primary key is completely faisible as well but is not the purpose of this article. If you need it don’t hesitate to contact me !

Bonus how to model off of this ?

Lets use skrub !

Skrub is a subset of scikit-learn that is supposed to work on dataframes. It is soooooooo nice ! Lets see some example that could be transposed to any other problems.

Lets go back with the car price prediction

First, ML models require number as input, not strings or anything else. You could work on each variable and items to understand how to encode it best, but the most parcimonious processing is provided by skrub with the tablevectorizer function.

from skrub import TableVectorizer
import skrub
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

dataset = openml.datasets.get_dataset(248)
X, y, is_categorical, feat_names = dataset.get_data(
    dataset_format="dataframe", target=dataset.default_target_attribute)


data_var = skrub.var("data", X)

X = data_var.drop(columns="price").skb.mark_as_X()
y = data_var["price"].skb.mark_as_y()

usable_df is the table we will use for our statistical modelling :). To say afew words about TableVectorizer primitives, it will OneHot encode categorical variables, and label encode categories with a high number of modalities.

Lets start with the dataops utilities ! We will define our X and y that will be used later to cross validate our estimators.

Now the modelling part, we are in regression so, the baseline should be the linear regression, lets import some regression estimators from sklearn.

We can use skrubs DataOps capabilities to do the model selection, we will compare the baseline against the simple model and the default sota regressor using 3 splits of cross validation.

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import GradientBoostingRegressor, HistGradientBoostingRegressor
from sklearn.dummy import DummyRegressor
from tqdm.auto import tqdm
import plotly.express as px
model = skrub.choose_from([LinearRegression(), HistGradientBoostingRegressor(), DummyRegressor()])

preds = X.skb.apply(TableVectorizer()).skb.apply(model, y=y)

search = preds.skb.make_randomized_search(
    random_state=1, 
    fitted=True, 
    n_iter=3,
    cv = 2
)


results = search.results_
px.bar(
    results.assign(model = results.iloc[:,0]),
    x= "model",
    y = "mean_test_score"
)

The HPO is supposed to be plot using the following syntax, but it looks like i have some bug with the plotly plotting layout.

I put it here anyway until i find a fix :

search.plot_results()

Getting the data

Toy data

Understanding the data

The tables

normalized-losses

make

fuel-type

aspiration

num-of-doors

body-style

drive-wheels

engine-location

wheel-base

length

width

height

curb-weight

engine-type

num-of-cylinders

engine-size

fuel-system

bore

stroke

compression-ratio

horsepower

peak-rpm

city-mpg

highway-mpg

price

normalized-losses

make

fuel-type

aspiration

num-of-doors

body-style

drive-wheels

engine-location

wheel-base

length

width

height

curb-weight

engine-type

num-of-cylinders

engine-size

fuel-system

bore

stroke

compression-ratio

horsepower

peak-rpm

city-mpg

highway-mpg

price

Please enable javascript

The graphs

What is the proportion of each car maker ?

Mutating the data

Filtering rows

Adding columns

Strings

Chain all operations

Grouping

Indexing

Columns

Rows

Shape of the data

Id

Time

Value

Id

Time

Value

Please enable javascript

2022484408

2026352035

2347167796

4020332650

4558609924

5553957443