MSc.
Alice Bizeul
PhD Student
- alice.bizeul@inf.ethz.ch
- Address
-
Department of Computer Science
CAB G 15.2
Universitätstr. 6
CH – 8092 Zurich, Switzerland - Room
- CAB G 15.2
I completed my Bachelor’s in Life Sciences Engineering and Master’s in Computational Neurosciences at École Polytechnique Fédérale de Lausanne (EPFL). My Master’s thesis was conducted at MIT and focused on generative adversarial models for synthetic brain MRI generation. In September 2021, I joined the Medical Data Science group as a Ph.D. student, co-mentored by Bernhard Schölkopf.
My PhD focuses on self-supervised representation learning, with an emphasis on understanding and enhancing methods for learning representations with minimal prior knowledge about the downstream tasks of interest. In 2024, I interned at Amazon research and worked on diffusion-based generative modeling.
Publications
Supervised learning has become a cornerstone of modern machine learning, yet a comprehensive theory explaining its effectiveness remains elusive. Empirical phenomena, such as neural analogy-making and the linear representation hypothesis, suggest that supervised models can learn interpretable factors of variation in a linear fashion. Recent advances in self-supervised learning, particularly nonlinear Independent Component Analysis, have shown that these methods can recover latent structures by inverting the data generating process. We extend these identifiability results to parametric instance discrimination, then show how insights transfer to the ubiquitous setting of supervised learning with cross-entropy minimization. We prove that even in standard classification tasks, models learn representations of ground-truth factors of variation up to a linear transformation. We corroborate our theoretical contribution with a series of empirical studies. First, using simulated data matching our theoretical assumptions, we demonstrate successful disentanglement of latent factors. Second, we show that on DisLib, a widely-used disentanglement benchmark, simple classification tasks recover latent structures up to linear transformations. Finally, we reveal that models trained on ImageNet encode representations that permit linear decoding of proxy factors of variation. Together, our theoretical findings and experiments offer a compelling explanation for recent observations of linear representations, such as superposition in neural networks. This work takes a significant step toward a cohesive theory that accounts for the unreasonable effectiveness of supervised deep learning.
AuthorsPatrik Reizinger*, Alice Bizeul*, Attila Juhos*, Julia E. Vogt, Randall Balestriero, Wieland Brendel, David Klindt* denotes shared first authorship
Date04.11.2024
Self-Supervised Learning (SSL) methods often consist of elaborate pipelines with hand-crafted data augmentations and computational tricks. However, it is unclear what is the provably minimal set of building blocks that ensures good downstream performance. The recently proposed instance discrimination method, coined DIET, stripped down the SSL pipeline and demonstrated how a simple SSL algorithm can work by predicting the sample index. Our work proves that DIET recovers cluster-based latent representations, while successfully identifying the correct cluster centroids in its classification head. We demonstrate the identifiability of DIET on synthetic data adhering to and violating our assumptions, revealing that the recovery of the cluster centroids is even more robust than the feature recovery.
AuthorsAttila Juhos*, Alice Bizeul*, Patrik Reizinger*, David Klindt, Randall Balestriero, Mark Ibrahim, Julia E Vogt, Wieland Brendel* denotes shared first authorship
SubmittedNeurIPS 2024 Workshop: Self-Supervised Learning-Theory and Practice
Date10.10.2024
In self-supervised learning (SSL), representations are learned via an auxiliary task without annotated labels. A common task is to classify augmentations or different modalities of the data, which share semantic content (e.g. an object in an image) but differ in style (e.g. the object's location). Many approaches to self-supervised learning have been proposed, e.g. SimCLR, CLIP and VicREG, which have recently gained much attention for their representations achieving downstream performance comparable to supervised learning. However, a theoretical understanding of the mechanism behind self-supervised methods eludes. Addressing this, we present a generative latent variable model for self-supervised learning and show that several families of discriminative SSL, including contrastive methods, induce a comparable distribution over representations, providing a unifying theoretical framework for these methods. The proposed model also justifies connections drawn to mutual information and the use of a ``projection head''. Learning representations by fitting the model generatively (termed SimVAE) improves performance over discriminative and other VAE-based methods on simple image benchmarks and significantly narrows the gap between generative and discriminative representation learning in more complex settings. Importantly, as our analysis predicts, SimVAE outperforms self-supervised learning where style information is required, taking an important step toward understanding self-supervised methods and achieving task-agnostic representations.
AuthorsAlice Bizeul, Bernhard Schölkopf, Carl Allen
SubmittedTransactions on Machine Learning Research
Date01.09.2024
Self-supervised representation learning is a powerful paradigm that leverages the relationship between semantically similar data, such as augmentations, extracts of an image or sound clip, or multiple views/modalities. Recent methods, e.g. SimCLR, CLIP and DINO, have made significant strides, yielding representations that achieve state-of-the-art results on multiple downstream tasks. A number of self-supervised discriminative approaches have been proposed, e.g. instance discrimination, latent clustering and contrastive methods. Though often intuitive, a comprehensive theoretical understanding of their underlying mechanisms or *what* they learn eludes. Meanwhile, generative approaches, such as variational autoencoders (VAEs), fit a specific latent variable model and have principled appeal, but lag significantly in terms of performance. We present a theoretical analysis of self-supervised discriminative methods and a graphical model that reflects the assumptions they implicitly make and unifies these methods. We show that fitting this model under an ELBO objective improves representations over previous VAE methods on several common benchmarks, narrowing the gap to discriminative methods, and can also preserve information lost by discriminative approaches. This work brings new theoretical insight to modern machine learning practice.
AuthorsAlice Bizeul, Carl Allen
SubmittedNeurIPS 2023 Workshop on Mathematics of Modern Machine Learning
Date07.11.2023
Contrastive learning is a cornerstone underlying recent progress in multi-view and multimodal learning, e.g., in representation learning with image/caption pairs. While its effectiveness is not yet fully understood, a line of recent work reveals that contrastive learning can invert the data generating process and recover ground truth latent factors shared between views. In this work, we present new identifiability results for multimodal contrastive learning, showing that it is possible to recover shared factors in a more general setup than the multi-view setting studied previously. Specifically, we distinguish between the multi-view setting with one generative mechanism (e.g., multiple cameras of the same type) and the multimodal setting that is characterized by distinct mechanisms (e.g., cameras and microphones). Our work generalizes previous identifiability results by redefining the generative process in terms of distinct mechanisms with modality-specific latent variables. We prove that contrastive learning can block-identify latent factors shared between modalities, even when there are nontrivial dependencies between factors. We empirically verify our identifiability results with numerical simulations and corroborate our findings on a complex multimodal dataset of image/text pairs. Zooming out, our work provides a theoretical basis for multimodal representation learning and explains in which settings multimodal contrastive learning can be effective in practice.
AuthorsImant Daunhawer, Alice Bizeul, Emanuele Palumbo, Alexander Marx, Julia E. Vogt
SubmittedThe Eleventh International Conference on Learning Representations, ICLR 2023
Date23.03.2023
Orienting in space requires the processing of visual spatial cues. The dominant hypothesis about the brain structures mediating the coding of spatial cues stipulates the existence of a hippocampal‐dependent system for the representation of geometry and a striatal‐dependent system for the representation of landmarks. However, this dual‐system hypothesis is based on paradigms that presented spatial cues conveying either conflicting or ambiguous spatial information and that used the term landmark to refer to both discrete three‐dimensional objects and wall features. Here, we test the hypothesis of complex activation patterns in the hippocampus and the striatum during visual coding. We also postulate that object‐based and feature‐based navigation are not equivalent instances of landmark‐based navigation. We examined how the neural networks associated with geometry‐, object‐, and feature‐based spatial navigation compared with a control condition in a two‐choice behavioral paradigm using fMRI. We showed that the hippocampus was involved in all three types of cue‐based navigation, whereas the striatum was more strongly recruited in the presence of geometric cues than object or feature cues. We also found that unique, specific neural signatures were associated with each spatial cue. Object‐based navigation elicited a widespread pattern of activity in temporal and occipital regions relative to feature‐based navigation. These findings extend the current view of a dual, juxtaposed hippocampal–striatal system for visual spatial coding in humans. They also provide novel insights into the neural networks mediating object versus feature spatial coding, suggesting a need to distinguish these two types of landmarks in the context of human navigation.
AuthorsStephen Ramanoël, Marion Durteste, Alice Bizeul, Anthony Ozier‐Lafontaine, Marcia Bécu, José‐alain Sahel, Christophe Habas, Angelo Arleo
SubmittedHuman Brain Mapping
Date01.07.2022