MSc.
Emanuele Palumbo
PhD Student
- emanuele.palumbo@inf.ethz.ch
- Address
-
Department of Computer Science
CAB G 15.2
Universitätstr. 6
CH – 8092 Zurich, Switzerland - Room
- CAB G 15.2
I completed my Bachelor’s Degree in Computer and Automation Engineering at Università Politecnica delle Marche (UNIVPM). Afterwards, I obtained a Master’s Degree in Data Science at ETH Zürich, with a thesis in efficient deep generative learning with a large number of data modalities. In May 2022 I joined the Medical Data Science group as a Doctoral Fellow of the ETH AI Center, co-mentored by Prof. Dr. Andrea Burden.
I am interested in multimodal learning, and its applications in the medical domain. In particular, I conduct research on generative approaches which can efficiently handle a large number of heterogeneous modalities, with the downstream goal of applying such models to real-world medical data.
Publications
Multimodal VAEs have recently received significant attention as generative models for weakly-supervised learning with multiple heterogeneous modalities. In parallel, VAE-based methods have been explored as probabilistic approaches for clustering tasks. Our work lies at the intersection of these two research directions. We propose a novel multimodal VAE model, in which the latent space is extended to learn data clusters, leveraging shared information across modalities. Our experiments show that our proposed model improves generative performance over existing multimodal VAEs, particularly for unconditional generation. Furthermore, our method favourably compares to alternative clustering approaches, in weakly-supervised settings. Notably, we propose a post-hoc procedure that avoids the need for our method to have a priori knowledge of the true number of clusters, mitigating a critical limitation of previous clustering frameworks.
AuthorsEmanuele Palumbo, Sonia Laguna, Daphné Chopard, Julia E Vogt
SubmittedICML 2023 Workshop on Structured Probabilistic Inference/Generative Modeling
Date23.06.2023
Multimodal VAEs have recently received significant attention as generative models for weaklysupervised learning with multiple heterogeneous modalities. In parallel, VAE-based methods have been explored as probabilistic approaches for clustering tasks. Our work lies at the intersection of these two research directions. We propose a novel multimodal VAE model, in which the latent space is extended to learn data clusters, leveraging shared information across modalities. Our experiments show that our proposed model improves generative performance over existing multimodal VAEs, particularly for unconditional generation. Furthermore, our method favorably compares to alternative clustering approaches, in weakly-supervised settings. Notably, we propose a post-hoc procedure that avoids the need for to have a priori knowledge of the true number of clusters, mitigating a critical limitation previous clustering frameworks.
AuthorsEmanuele Palumbo, Sonia Laguna, Daphné Chopard, Julia E Vogt
SubmittedICML 2023 Workshop DeployableGenerativeAI
Date23.06.2023
Contrastive learning is a cornerstone underlying recent progress in multi-view and multimodal learning, e.g., in representation learning with image/caption pairs. While its effectiveness is not yet fully understood, a line of recent work reveals that contrastive learning can invert the data generating process and recover ground truth latent factors shared between views. In this work, we present new identifiability results for multimodal contrastive learning, showing that it is possible to recover shared factors in a more general setup than the multi-view setting studied previously. Specifically, we distinguish between the multi-view setting with one generative mechanism (e.g., multiple cameras of the same type) and the multimodal setting that is characterized by distinct mechanisms (e.g., cameras and microphones). Our work generalizes previous identifiability results by redefining the generative process in terms of distinct mechanisms with modality-specific latent variables. We prove that contrastive learning can block-identify latent factors shared between modalities, even when there are nontrivial dependencies between factors. We empirically verify our identifiability results with numerical simulations and corroborate our findings on a complex multimodal dataset of image/text pairs. Zooming out, our work provides a theoretical basis for multimodal representation learning and explains in which settings multimodal contrastive learning can be effective in practice.
AuthorsImant Daunhawer, Alice Bizeul, Emanuele Palumbo, Alexander Marx, Julia E. Vogt
SubmittedThe Eleventh International Conference on Learning Representations, ICLR 2023
Date23.03.2023
Multimodal VAEs have recently gained attention as efficient models for weakly-supervised generative learning with multiple modalities. However, all existing variants of multimodal VAEs are affected by a non-trivial trade-off between generative quality and generative coherence. In particular mixture-based models achieve good coherence only at the expense of sample diversity and a resulting lack of generative quality. We present a novel variant of the mixture-of-experts multimodal variational autoencoder that improves its generative quality, while maintaining high semantic coherence. We model shared and modality-specific information in separate latent subspaces, proposing an objective that overcomes certain dependencies on hyperparameters that arise for existing approaches with the same latent space structure. Compared to these existing approaches, we show increased robustness with respect to changes in the design of the latent space, in terms of the capacity allocated to modality-specific subspaces. We show that our model achieves both good generative coherence and high generative quality in challenging experiments, including more complex multimodal datasets than those used in previous works.
AuthorsEmanuele Palumbo, Imant Daunhawer, Julia E. Vogt
SubmittedThe Eleventh International Conference on Learning Representations, ICLR 2023
Date02.03.2023
Multimodal variational autoencoders (VAEs) have shown promise as efficient generative models for weakly-supervised data. Yet, despite their advantage of weak supervision, they exhibit a gap in generative quality compared to unimodal VAEs, which are completely unsupervised. In an attempt to explain this gap, we uncover a fundamental limitation that applies to a large family of mixture-based multimodal VAEs. We prove that the sub-sampling of modalities enforces an undesirable upper bound on the multimodal ELBO and thereby limits the generative quality of the respective models. Empirically, we showcase the generative quality gap on both synthetic and real data and present the tradeoffs between different variants of multimodal VAEs. We find that none of the existing approaches fulfills all desired criteria of an effective multimodal generative model when applied on more complex datasets than those used in previous benchmarks. In summary, we identify, formalize, and validate fundamental limitations of VAE-based approaches for modeling weakly-supervised data and discuss implications for real-world applications.
AuthorsImant Daunhawer, Thomas M. Sutter, Kieran Chin-Cheong, Emanuele Palumbo, Julia E. Vogt
SubmittedThe Tenth International Conference on Learning Representations, ICLR 2022
Date07.04.2022