Winter school where i learned about Conformal prediction (CP) and Transfert Learning. ECAS : European course in advanced statistics.

Motivation

My Phd relates to Active Learning, where we label images that would minimize the generalization error of an estimator on a dataset. To achieve this goal, two categories of strategies exists, we could focus on the estimator weaknesses where its estimated error is maximal. Or we could use a feature-based approach where we provide the network with as different items as possible.

Strategies based on weaknesses like entropy selection relies on two hypotheses :

The images for which the model is the least certain are the most informative ones
the prediction vector is a good estimator of the model uncertainty.

The problem is that the second hypothesis is wrong, indeed, as we can see in this image taken from (Brunekreef et al. 2023).

The probabilities returned by the net doesn’t correspond to the true probabilities of the prediction being predicted.

That is why one of my tasks is to quantify the uncertainty of the neural networks with better statistical guarantees, notably by using Conformal prediction. That part was address by the marvelous Margaux Zaffran. She took her time and her talk was very clear and well structured. I can’t thank her enough for her time and hope i will be able to use CP in my work to improve labeling efficiency.

The second problem is that in industry, the data in production is often different with the data in the database. Indeed, multiple shift can happen (covazriate and label shift mostly). We had the chance to have Antoine De Mathelin and Mounir Atiq to talk about domain adaptation and all the possible methods to try to get the best of our models even where the production data isnot the same as the training data.

And last but not least, we had Mathilde Mougeot talking about Physics informed neural networks (PINNS). It was the first time i have been exposed to this tool, i have to say i loved to see physics equations directly linked with the gradient descent to model mechanical contraints on tyre materials.

Conformal Prediction

This is the field i want to become good at. Indeed, my PA fell in love with this field and im afraid i did too during this week. I met so nice people working on it and making the best use of this theoretical tool in the industry i want to make our computer vision more reliable as well.

All the course and the slides are in this github repo. The slides are clear and all the proofs are written as well.

Complete book with theoretical proofs and explanations : (A. N. Angelopoulos, Barber, and Bates 2025).

Short Summary:

Linear regression with gaussian hypothesis allow a good uncertainty quantification (UQ) using gaussian hypotheses for parameters and target variable. Some situation where models are bigger and does not follow any assumptions over parameter values makes UQ impossible using any assumption on model or errors. That is the problem CP is solving, given one only hypotheses (exchangeability of non conformity scores) it allows a good UQ which is distribution free (no assumpotion over the data distribution) and model agnostic (independent on the estimator). This is very good because in deep learning we litteraly have no tool to quantify uncertainty with statistical guarantees.

Given a labeled dataset of size n $(X_i,Y_i)^n_{i=1}$, we want to be able to predict a new point $X_{n+1}$ with confidence.

Meaning that for any user defined risk $\alpha \in [0,1]$, $\mathbb{P}\{Y_{n+1} \in \mathcal{C}_{\alpha}(X_{n+1})\} \geq 1 - \alpha$. $\mathcal{C}_{\alpha}$ should be as small as possible to be informative. Indeed the case where $\mathcal{C}_{\alpha} = \mathcal{Y}$ verifies the $1-\alpha$ coverage but provides no information about the predictive uncertainty.

Exchangeability

$(Z_i)^n_{i=1}$ a random vector taking values in $\mathcal{Z}^n$ is exchangeable if, for any permutation $\sigma$ of $[0,1]$ :

\[(Z_1, ..., Z_n) \stackrel{d}{=} (Z_{\sigma(1)}, ..., Z_{\sigma(n)})\]

how does it work ? (Split conformal Prediction setting)

In machine vision, our models are so big we can’t afford doing LOO Cross validation or cross validation with Full Conformal Paradigm. We will focus on the Split Case.

Split your date into Train, Calibration and Test splits
Train A to get $\hat{\mu}$ on Train
Compute $\hat{\mu}$ on Cal
Obtain a set of non conformity scores

\[S = \{S_i = |\hat{\mu}(X_i) - Y_i|, i \in Cal \} \cup \{+\infty \}\]

Compute the $1 - \alpha$ quantile of these scores, noted $q_{1-\alpha}(S)$
For a new point $X_{n+1}$, return :

\[\hat{C}_{\alpha}(X_{n+1}) = [\hat{\mu}(X_{n+1})- q_{1-\alpha}(S), \hat{\mu}(X_{n+1})+ q_{1-\alpha}(S)]\]

The guarantee of those predictor are the following :

$\mathbb{P}\{ Y_{n+1} \in \mathcal{C}_{\alpha}(X_{n+1}) \} = 1 - \alpha$ This is the marginal guarantee, meaning that for all our predictions, at least $(1-\alpha)\%$ of the test points will be covered. But there might be some sub categories that are not covered at all.

Note

In active learning we want to estimate the uncertainty of the model for one specific data point. What we would like to have is conditional converage, being :

$\mathbb{P}\{ Y_{n+1} \in \mathcal{C}_{\alpha}(X_{n+1})| X_{n+1} \} = 1 - \alpha$.

This is adaptative methods. Some paper made adaptative method for classification with very strong guarantees. (see section discussion with litterature recommendations recarding this matter).

Illustration of different coverages. (Again taken from the notes of Margaux)

Transfer Learning

This field sounds specific but actually any statistician working with real world data will faced generalization problem one day or another.

All the material is in this github repo

Discussions

For classification (Romano, Sesia, and Candès 2020) provide very strong guarantee with adaptivity. That is what we need in active learning. RAPS Extensions with (A. Angelopoulos et al. 2022) improve this with smaller prediction sets. This gives hope in the use of CP in AL. When some defects are very rare conformal prediction garanteeing only marginal coverage some classes might not be covered (no conditional coverage in general cases), she recommanded me the work of Tiffany Ding (Ding, Fermanian, and Salmon 2025) in collaboration with Plantnet.

She suggest Y-conditional conformal algorithms to guarantee that the coverage is verified for any classes with this paper (Ding et al. 2023).

This is still an area at the research stage but some solution exists to tackle our problems. We shall not forget that most of people using big neural network do not quantify the uncertainty at all. So it is nice to do the first step.

Antoine was so nice to give his insight about his past with active learning and domain transfert i really want to thank him again. He agree that active learning doesn’t beat random in most cases, but he believes more in AL than in domain adaptation…. Let’s do our best !

It is normal for Al to bias the distribution, because if we didn’t want to bias it we would have stayed with random sampling.

Note

He prefers k-medoids over core-set selection. It must be similar to typiclust so i will work on that and see how i can make my best of it.

He liked the idea of the potato project and was surprised that the AL could work that well in some cases. (Might be a bug ? i hope not…). He is not surprised that transfert in strategy independent. Indeed, when there is a distribution shift it is likely that nothing will help you in most real data cases. He notes that even for $\pi^u=10%$ AL can generalize well and beat random.

Thanks

This week has been the most intense of the Phd so far. All the day was filled up with either insane classes or deep conversations. I want to particularly thank

Margaux, Louis, Guillaume1, Guillaume11, Francois, Matthias and Paul.

Thank you for all the good time guys. I hope i will be able to become like you one day.

References

Angelopoulos, Anastasios N., Rina Foygel Barber, and Stephen Bates. 2025. “Theoretical Foundations of Conformal Prediction.” arXiv. https://doi.org/10.48550/arXiv.2411.11824.

Angelopoulos, Anastasios, Stephen Bates, Jitendra Malik, and Michael I. Jordan. 2022. “Uncertainty Sets for Image Classifiers Using Conformal Prediction.” arXiv. https://doi.org/10.48550/arXiv.2009.14193.

Brunekreef, Joren, Eric Marcus, Ray Sheombarsing, Jan-Jakob Sonke, and Jonas Teuwen. 2023. “Kandinsky Conformal Prediction: Efficient Calibration of Image Segmentation Algorithms.” arXiv. https://doi.org/10.48550/arXiv.2311.11837.

Ding, Tiffany, Anastasios N. Angelopoulos, Stephen Bates, Michael I. Jordan, and Ryan J. Tibshirani. 2023. “Class-Conditional Conformal Prediction with Many Classes.” arXiv. https://doi.org/10.48550/arXiv.2306.09335.

Ding, Tiffany, Jean-Baptiste Fermanian, and Joseph Salmon. 2025. “Conformal Prediction for Long-Tailed Classification.” arXiv. https://doi.org/10.48550/arXiv.2507.06867.

Romano, Yaniv, Matteo Sesia, and Emmanuel J. Candès. 2020. “Classification with Valid and Adaptive Coverage.” arXiv. https://doi.org/10.48550/arXiv.2006.02544.

--- title: ECAS 2025 subtitle: Winter School author: "Julien Combes" date: "2025-12-05" categories: [Phd, productivity, ENG] image: "eglise.jpg" --- Winter school where i learned about Conformal prediction (CP) and Transfert Learning. ECAS : European course in advanced statistics. # Motivation My Phd relates to Active Learning, where we label images that would minimize the generalization error of an estimator on a dataset. To achieve this goal, two categories of strategies exists, we could focus on the estimator weaknesses where its estimated error is maximal. Or we could use a feature-based approach where we provide the network with as different items as possible. Strategies based on weaknesses like entropy selection relies on two hypotheses : - The images for which the model is the least certain are the most informative ones - the prediction vector is a good estimator of the model uncertainty. The problem is that the second hypothesis is wrong, indeed, as we can see in this image taken from [@brunekreefKandinskyConformalPrediction2023]. ![The probabilities returned by the net doesn't correspond to the true probabilities of the prediction being predicted.](reliability_curve.png) That is why one of my tasks is to quantify the uncertainty of the neural networks with better statistical guarantees, notably by using Conformal prediction. That part was address by the marvelous [Margaux Zaffran](https://mzaffran.github.io/). She took her time and her talk was very clear and well structured. I can't thank her enough for her time and hope i will be able to use CP in my work to improve labeling efficiency. The second problem is that in industry, the data in production is often different with the data in the database. Indeed, multiple shift can happen (covazriate and label shift mostly). We had the chance to have [Antoine De Mathelin](https://antoinedemathelin.github.io/homepage/) and [Mounir Atiq](https://fr.linkedin.com/in/mounir-atiq-a018159b) to talk about domain adaptation and all the possible methods to try to get the best of our models even where the production data isnot the same as the training data. And last but not least, we had [Mathilde Mougeot](https://www.linkedin.com/in/mathilde-mougeot-bb5a8a24/) talking about Physics informed neural networks (PINNS). It was the first time i have been exposed to this tool, i have to say i loved to see physics equations directly linked with the gradient descent to model mechanical contraints on tyre materials. ## Conformal Prediction This is the field i want to become good at. Indeed, my PA fell in love with this field and im afraid i did too during this week. I met so nice people working on it and making the best use of this theoretical tool in the industry i want to make our computer vision more reliable as well. All the course and the slides are in this [github repo](https://github.com/mzaffran/ECAS_SFdS_ConformalPrediction). The slides are clear and all the proofs are written as well. Complete book with theoretical proofs and explanations : [@angelopoulosTheoreticalFoundationsConformal2025]. Short Summary: Linear regression with gaussian hypothesis allow a good uncertainty quantification (UQ) using gaussian hypotheses for parameters and target variable. Some situation where models are bigger and does not follow any assumptions over parameter values makes UQ impossible using any assumption on model or errors. That is the problem CP is solving, given one only hypotheses (exchangeability of non conformity scores) it allows a good UQ which is distribution free (no assumpotion over the data distribution) and model agnostic (independent on the estimator). This is very good because in deep learning we litteraly have no tool to quantify uncertainty with statistical guarantees. Given a labeled dataset of size n $(X_i,Y_i)^n_{i=1}$, we want to be able to predict a new point $X_{n+1}$ with confidence. Meaning that for any user defined risk $\alpha \in [0,1]$, $\mathbb{P}\{Y_{n+1} \in \mathcal{C}_{\alpha}(X_{n+1})\} \geq 1 - \alpha$. $\mathcal{C}_{\alpha}$ should be as small as possible to be informative. Indeed the case where $\mathcal{C}_{\alpha} = \mathcal{Y}$ verifies the $1-\alpha$ coverage but provides no information about the predictive uncertainty. :::{.callout-note title="Exchangeability"} $(Z_i)^n_{i=1}$ a random vector taking values in $\mathcal{Z}^n$ is exchangeable if, for any permutation $\sigma$ of $[0,1]$ : $$(Z_1, ..., Z_n) \stackrel{d}{=} (Z_{\sigma(1)}, ..., Z_{\sigma(n)})$$ ::: how does it work ? (Split conformal Prediction setting) In machine vision, our models are so big we can't afford doing LOO Cross validation or cross validation with Full Conformal Paradigm. We will focus on the Split Case. 1. Split your date into Train, Calibration and Test splits 2. Train A to get $\hat{\mu}$ on Train 3. Compute $\hat{\mu}$ on Cal 4. Obtain a set of non conformity scores $$S = \{S_i = |\hat{\mu}(X_i) - Y_i|, i \in Cal \} \cup \{+\infty \}$$ 5. Compute the $1 - \alpha$ quantile of these scores, noted $q_{1-\alpha}(S)$ 6. For a new point $X_{n+1}$, return : $$\hat{C}_{\alpha}(X_{n+1}) = [\hat{\mu}(X_{n+1})- q_{1-\alpha}(S), \hat{\mu}(X_{n+1})+ q_{1-\alpha}(S)]$$ The guarantee of those predictor are the following : $\mathbb{P}\{ Y_{n+1} \in \mathcal{C}_{\alpha}(X_{n+1}) \} = 1 - \alpha$ This is the marginal guarantee, meaning that for all our predictions, at least $(1-\alpha)\%$ of the test points will be covered. But there might be some sub categories that are not covered at all. ::: {.callout-note} In active learning we want to estimate the uncertainty of the model for one specific data point. What we would like to have is conditional converage, being : $\mathbb{P}\{ Y_{n+1} \in \mathcal{C}_{\alpha}(X_{n+1})| X_{n+1} \} = 1 - \alpha$. This is adaptative methods. Some paper made adaptative method for classification with very strong guarantees. (see section discussion with litterature recommendations recarding this matter). ::: ![Illustration of different coverages. (Again taken from the notes of Margaux)](coverage.png) ## Transfer Learning This field sounds specific but actually any statistician working with real world data will faced generalization problem one day or another. All the material is in this [github repo](https://github.com/antoinedemathelin/ecas2025-transfer-learning) ## Discussions For classification [@romanoClassificationValidAdaptive2020] provide very strong guarantee with adaptivity. That is what we need in active learning. RAPS Extensions with [@angelopoulosUncertaintySetsImage2022] improve this with smaller prediction sets. This gives hope in the use of CP in AL. When some defects are very rare conformal prediction garanteeing only marginal coverage some classes might not be covered (no conditional coverage in general cases), she recommanded me the work of Tiffany Ding [@dingConformalPredictionLongTailed2025] in collaboration with Plantnet. She suggest Y-conditional conformal algorithms to guarantee that the coverage is verified for any classes with this paper [@dingClassConditionalConformalPrediction2023]. This is still an area at the research stage but some solution exists to tackle our problems. We shall not forget that most of people using big neural network do not quantify the uncertainty at all. So it is nice to do the first step. --- [Antoine](https://antoinedemathelin.github.io/homepage/) was so nice to give his insight about his past with active learning and domain transfert i really want to thank him again. He agree that active learning doesn't beat random in most cases, but he believes more in AL than in domain adaptation.... Let's do our best ! It is normal for Al to bias the distribution, because if we didn't want to bias it we would have stayed with random sampling. :::{.callout-note} He prefers k-medoids over core-set selection. It must be similar to typiclust so i will work on that and see how i can make my best of it. ::: He liked the idea of the potato project and was surprised that the AL could work that well in some cases. (Might be a bug ? i hope not...). He is not surprised that transfert in strategy independent. Indeed, when there is a distribution shift it is likely that nothing will help you in most real data cases. He notes that even for $\pi^u=10%$ AL can generalize well and beat random. # Thanks :::{layout="[[1, 1], [1]]"} ![Ruins of an old chapel in Saint-Raphael](ruine.jpg) ![Some roommate with my level in math](sangliers.jpg) ![Ice cream on a rainy day](glace.jpg) ::: This week has been the most intense of the Phd so far. All the day was filled up with either insane classes or deep conversations. I want to particularly thank [Margaux](https://mzaffran.github.io/), [Louis](https://louistier.github.io/), [Guillaume1](https://www.imo.universite-paris-saclay.fr/fr/perso/guillaume-principato/), [Guillaume11](https://guillaumechennetier.owlstown.net/), [Francois](https://fr.linkedin.com/in/francois-victor), [Matthias](https://fr.linkedin.com/in/matthias-pierre-489435194) and [Paul](https://www.linkedin.com/in/paul-michaux-2a83181a1/). Thank you for all the good time guys. I hope i will be able to become like you one day.