Prof. Dr.
Julia Vogt

Group Leader
- julia.vogt@inf.ethz.ch
- Phone
- +41 44 633 8714
- Address
-
Department of Computer Science
CAB G 16.2
Universitätstr. 6
CH – 8092 Zurich, Switzerland - Room
- CAB G 16.2
Julia Vogt is an associate professor of Computer Science at ETH Zurich, where she leads the Medical Data Science Group. The focus of her research is on linking computer science with medicine, with the ultimate aim of personalized patient treatment. She has studied mathematics both in Konstanz and in Sydney and earned her Ph.D. in computer science at the University of Basel. She was a postdoctoral research fellow at the Memorial Sloan-Kettering Cancer Center in NYC and with the Bioinformatics and Information Mining group at the University of Konstanz. In 2018, she joined the University of Basel as an assistant professor. In May 2019, she and her lab moved to Zurich where she joined the Computer Science Department of ETH Zurich.
Publications
Intracardiac electrophysiological (EP) signals are frequently contaminated by diverse noise sources, posing a major obstacle to accurate arrhythmia diagnosis. We hypothesized that a physics-inspired conditional denoising diffusion probabilistic model (cDDPM) could outperform both classical filters and variational autoencoders by preserving subtle morphological features. Using 5706 monophasic action potentials from 42 patients, we introduced a range of simulated and real EP noise, then trained the cDDPM in an iterative process analogous to Brownian motion. The proposed model achieved superior performance across RMSE, PCC, and PSNR metrics, confirming its robustness against complex noise while maintaining essential signal fidelity. These findings suggest that diffusion-based methods can significantly enhance the clinical utility of EP signals for arrhythmia management and intervention. Clinical Relevance— We propose a denoising diffusion probabilistic model to reconstruct intracardiac signals in the presence of complex noise, which holds the potential to enhance diagnostic accuracy in EP procedures and inform more targeted treatment strategies.
AuthorsSamuel Ruipérez-Campillo, Moritz Rau, Prasanth Ganesan, Kelly A Brennan, Ruibin Feng, Sabyasachi Bandyopadhyay, Albert J Rogers, Sanjiv M Narayan, Julia E Vogt
SubmittedIEEE Engineering in Medicine & Biology Society (47th EMBC, 2025)
Date03.12.2025
Reducing electrophysiological (EP) signal noise is essential for diagnosis, mapping, and ablation procedures in patients with arrhythmias or conditions such as cardiomyopathies. However, traditional approaches have been suboptimal due to the varied sources of noise. We hypothesized that variational autoencoders (VAEs) can learn key components of ’clean’ electrophysiological signals by creating robust internal representations, thereby enabling automatic denoising of diverse noise in clinical recordings. We set out to apply a β-VAE model to a dataset of 5706 intra-ventricular monophasic action potential (MAP) signals, selected because their morphology is verifiable and measurable against a reference, from 42 patients with ischemic cardiomyopathy at risk for sudden death. We designed a noise library, and implemented baselines based on state-of-the-art clinical filtering techniques. The proposed β-VAE model was assessed for various noise types, including challenging non-stationary real EP noise. Comprehensive evaluation using general metrics and clinical action potential duration labels by domain experts revealed that our β-VAE outperformed current state-of-the-art filters in denoising efficacy, with key physiological information encoded in the reconstruction. We performed a sensitivity analysis that confirmed the robustness of the β-VAE model to increasing noise levels. These results demonstrate the ability of our model to denoise various sources, including those of time-varying nature. The application to well-studied MAPs verifies that clinically meaningful features were reconstructed in the EP context. This work enhances traditional signal processing approaches to ensure ’clean’ electrical signals, and may have promising applications for diagnosis, tracking therapy and prognostication in patients with EP disorders in real-world clinical environments.
AuthorsSamuel Ruipérez-Campillo, Alain Ryser, Thomas M Sutter, Brototo Deb, Ruibin Feng, Prasanth Ganesan, Kelly A Brennan, Albert J Rogers, Maarten ZH Kolk, Fleur VY Tjong, Sanjiv M Narayan†, Julia E Vogt†† denotes shared last authorship
SubmittedExpert Systems with Applications
Date05.11.2025
Background: Timely and accurate detection of arrhythmias from electrocardiograms (ECGs) is crucial for improving patient outcomes. While artificial intelligence (AI)-based ECG classification has shown promising results, limited transparency and interpretability often impede clinical adoption. Methods: We present ECG-XPLAIM, a novel deep learning model dedicated to ECG classification that employs a one-dimensional inception-style convolutional architecture to capture local waveform features (e.g., waves and intervals) and global rhythm patterns. To enhance interpretability, we integrate Grad-CAM visualization, highlighting key waveform segments that drive the model's predictions. ECG-XPLAIM was trained on the MIMIC-IV dataset and externally validated on PTB-XL for multiple arrhythmias, including atrial fibrillation (AFib), sinus tachycardia (STach), conduction disturbances (RBBB, LBBB, LAFB), long QT (LQT), Wolff-Parkinson-White (WPW) pattern, and paced rhythm detection. We evaluated performance using sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC), and benchmarked against a simplified convolutional neural network, a two-layer gated recurrent unit (GRU), and an external, pre-trained, ResNet-based model. Results: Internally (MIMIC-IV), ECG-XPLAIM achieved high diagnostic performance (sensitivity, specificity, AUROC > 0.9) across most tasks. External evaluation (PTB-XL) confirmed generalizability, with metric values exceeding 0.95 for AFib and STach. For conduction disturbances, macro-averaged sensitivity reached 0.90, specificity 0.95, and AUROC 0.98. Performance for LQT, WPW, and pacing rhythm detection was 0.691/0.864/0.878, 0.773/0.973/0.895, and 0.96/0.988/0.993 (sensitivity/specificity/AUROC), respectively. Compared to baseline models, ECG-XPLAIM offered superior performance across most tests, and improved sensitivity over the external ResNet-based model, albeit at the cost of specificity. Grad-CAM revealed physiologically relevant ECG segments influencing predictions and highlighted patterns of potential misclassification. Conclusion: ECG-XPLAIM combines high diagnostic performance with interpretability, addressing a key limitation in AI-driven ECG analysis. The open-source release of ECG-XPLAIM's architecture and pre-trained weights encourages broader adoption, external validation, and further refinement for diverse clinical applications.
AuthorsPanteleimon Pantelidis*, Samuel Ruipérez-Campillo*, Julia E Vogt, Alexios Antonopoulos, Ioannis Gialamas, George E Zakynthinos, Michael Spartalis, Polychronis Dilaveris, Jose Millet, Panagiotis Papapetrou, Theodore G Papaioannou, Evangelos Oikonomou, Gerasimos Siasos* denotes shared first authorship
SubmittedFrontiers in Cardiovascular Medicine
Date16.10.2025
We present RadVLM, a compact (7B) multitask conversational foundation model designed for CXR interpretation. Its development relies on the curation of a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks - such as report generation, abnormality classification, and visual grounding - and multi-turn, multi-task conversational interactions. Our experiments show that RadVLM, fine-tuned on this instruction dataset, achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks (report generation, classification). Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of the RadVLM model as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.
AuthorsNicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas Sutter, Julia Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Bluethgen, Farhad Nooralahzadeh, Michael Krauthammer
SubmittedPhysionet
Date08.10.2025
Generative modeling and clustering are conventionally distinct tasks in machine learning. Variational Autoencoders (VAEs) have been widely explored for their ability to integrate both, providing a framework for generative clustering. However, while VAEs can learn meaningful cluster representations in latent space, they often struggle to generate high-quality samples. This paper addresses this problem by introducing TreeDiffusion, a deep generative model that conditions diffusion models on learned latent hierarchical cluster representations from a VAE to obtain high-quality, cluster-specific generations. Our approach consists of two steps: first, a VAE-based clustering model learns a hierarchical latent representation of the data. Second, a cluster-aware diffusion model generates realistic images conditioned on the learned hierarchical structure. We systematically compare the generative capabilities of our approach with those of alternative conditioning strategies. Empirically, we demonstrate that conditioning diffusion models on hierarchical cluster representations improves the generative performance on real-world datasets compared to other approaches. Moreover, a key strength of our method lies in its ability to generate images that are both representative and specific to each cluster, enabling more detailed visualization of the learned latent structure. Our approach addresses the generative limitations of VAE-based clustering approaches by leveraging their learned structure, thereby advancing the field of generative clustering.
AuthorsJorge Silva Gonçalves, Laura Manduchi, Moritz Vandenhirtz, Julia E Vogt
SubmittedJoint European Conference on Machine Learning and Knowledge Discovery in Databases
Date04.10.2025
We release the RadVLM instruction dataset, a large-scale resource used to train the RadVLM model on diverse radiology tasks. The dataset contains 1,115,021 image–instruction pairs spanning five task families: (i) report generation from frontal CXRs using filtered Findings/Impression text; (ii) abnormality classification for the standard 14 CheXpert labels; (iii) anatomy grounding; (iv) abnormality detection and grounding; and (v) phrase grounding from report sentences. To support interactive use, we include ~89k LLM-generated multi-turn, multi-task conversations (~3k with spatial grounding) derived from image-linked attributes (reports, labels, boxes). Creation involved curating datasets from public sources, excluding lateral views, removing prior-study references and other non-image context from reports, fusing multi-reader annotations, and harmonizing label and coordinate formats. The resource is intended for training CXR assistants across diverse radiology tasks and within a conversational format.
AuthorsNicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas Sutter, Julia Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Bluethgen, Farhad Nooralahzadeh, Michael Krauthammer
SubmittedPhysionet
Date25.09.2025
Background In infants, pulmonary hypertension (PH) increases morbidity and mortality. Echocardiography, though standard, is time- and expertise-demanding. We propose a deep learning approach for automated PH detection using standard echocardiography videos, validated by the systolic eccentricity index (EIs). Methods The training and validation set comprised 975 videos and the held-out set 378 videos, including five echocardiographic standard views from infants aged 3–90 days, taken between 2018–2021 and 2021–2022, respectively. Echocardiograms were labeled as PH (EIs < 0.82) and healthy (EIs ≥ 0.87). After preprocessing and random segmentation of all videos into 13.530 frames, spatial and spatio-temporal convolutional neural network architectures were used for training of a PH prediction model and gradient-weighted class activation mapping for explainability. Results The best single-view performance was achieved using parasternal short axis view (AUROC spatial and spatio-temporal: 0.91 and 0.94 in validation set, 0.93 and 0.88 in held-out set, respectively). Combination of three standard views improved accuracy with AUROC 0.96 and 0.90 in validation (spatio-temporal) and held-out set (spatial), respectively. Saliency maps revealed model focus on clinically relevant regions, including interventricular septum and left atrial filling. Conclusions The presented deep learning model for automated detection of PH in neonates shows high accuracy, explainability, and reproducibility.
AuthorsHolger Michel, Ece Ozkan, Kieran Chin-Cheong, Anna Badura, Verena Lehnerer, Stephan Gerling, Julia E. Vogt, Sven Wellmann
SubmittedPediatric Research
Date24.09.2025
Vision foundation models (FMs) are accelerating the development of digital pathology algorithms and transforming biomedical research. These models learn, in a self-supervised manner, to represent histological features in highly heterogeneous tiles extracted from whole-slide images (WSIs) of real-world patient samples. The performance of these FMs is significantly influenced by the size, diversity, and balance of the pre-training data. Yet, data selection has been primarily guided by expert knowledge at the WSI level, focusing on factors such as disease classification and tissue types, while largely overlooking the granular details available at the tile level. In this paper, we investigate the potential of unsupervised automatic data curation at the tile-level, taking into account 350 million tiles. Specifically, we apply hierarchical clustering trees to pre-extracted tile embeddings, allowing us to sample balanced datasets uniformly across the embedding space of the pretrained FM. We further show that these datasets are subject to a trade-off between size and balance, potentially compromising the quality of representations learned by FMs. We propose tailored batch sampling strategies to mitigate this effect. We demonstrate the effectiveness of our method through improved performance on a diverse range of clinically relevant downstream tasks.
AuthorsBoqi Chen, Cédric Vincent-Cuaz, Lydia A Schoenpflug, Manuel Madeira, Lisa Fournier, Vaishnavi Subramanian, Sonali Andani, Samuel Ruipérez-Campillo, Julia E Vogt, Raphaëlle Luisier, Dorina Thanou, Viktor H Koelzer, Pascal Frossard, Gabriele Campanella, Gunnar Rätsch
SubmittedMICCAI 2025
Date19.09.2025
Anomaly detection focuses on identifying samples that deviate from the norm. Discovering informative representations of normal samples is crucial to detecting anomalies effectively. Recent self-supervised methods have successfully learned such representations by employing prior knowledge about anomalies to create synthetic outliers during training. However, we often do not know what to expect from unseen data in specialized real-world applications. In this work, we address this limitation with our new approach, Con2, which leverages prior knowledge about symmetries in normal samples to observe the data in different contexts. Con2 consists of two parts: Context Contrasting clusters representations according to their context, while Content Alignment encourages the model to capture semantic information by aligning the positions of normal samples across clusters. The resulting representation space allows us to detect anomalies as outliers of the learned context clusters. We demonstrate the benefit of this approach in extensive experiments on specialized medical datasets, outperforming competitive baselines based on self-supervised learning and pretrained models and presenting competitive performance on natural imaging benchmarks.
AuthorsAlain Ryser, Thomas M. Sutter, Alexander Marx, Julia E. Vogt
SubmittedTransactions on Machine Learning Research
Date16.09.2025
General movements (GMs) are spontaneous, coordinated body movements in infants that offer valuable insights into the developing nervous system. Assessed through the Prechtl GM Assessment (GMA), GMs are reliable predictors for neurodevelopmental disorders. However, GMA requires specifically trained clinicians, who are limited in number. To scale up newborn screening, there is a need for an algorithm that can automatically classify GMs from infant video recordings. This data poses challenges, including variability in recording length, device type, and setting, with each video coarsely annotated for overall movement quality. In this work, we introduce a tool for extracting features from these recordings and explore various machine learning techniques for automated GM classification.
AuthorsDaphne Chopard*, Sonia Laguna*, Kieran Chin-Cheong*, Annika Dietz, Anna Badura, Sven Wellmann, Julia E Vogt* denotes shared first authorship
SubmittedProceedings of Machine Learning Research - Machine Learning for Healthcare 2025, previous version in ICLR 2025 (Best Paper Award - Oral) Workshop AI4CHL
Date15.08.2025
Building generalizable medical AI systems requires pretraining strategies that are data-efficient and domain-aware. Unlike internet-scale corpora, clinical datasets such as MIMIC-CXR offer limited image counts and scarce annotations, but exhibit rich internal structure through multi-view imaging. We propose a self-supervised framework that leverages the inherent structure of medical datasets. Specifically, we treat paired chest X-rays (i.e., frontal and lateral views) as natural positive pairs, learning to reconstruct each view from sparse patches while aligning their latent embeddings. Our method requires no textual supervision and produces informative representations. Evaluated on MIMIC-CXR, we show strong performance compared to supervised objectives and baselines being trained without leveraging structure. This work provides a lightweight, modality-agnostic blueprint for domain-specific pretraining where data is structured but scarce.
AuthorsAndrea Agostini*, Sonia Laguna*, Alain Ryser*, Samuel Ruipérez-Campillo*, Moritz Vandenhirtz, Nicolas Deperrois, Farhad Nooralahzadeh, Michael Krauthammer, Thomas M Sutter†, Julia E Vogt†* denotes shared first authorship, † denotes shared last authorship
SubmittedInternational Conference of Machine Learning (ICML) 2025 Workshop on FM4LS
Date15.07.2025
We introduce Concept Bottleneck Reward Models (CB-RM), a reward modeling framework that enables interpretable preference learning through selective concept annotation. Unlike standard RLHF methods that rely on opaque reward functions, CB-RM decomposes reward prediction into human-interpretable concepts. To make this framework efficient in low-supervision settings, we formalize an active learning strategy that dynamically acquires the most informative concept labels. We propose an acquisition function based on Expected Information Gain and show that it significantly accelerates concept learning without compromising preference accuracy. Evaluated on the UltraFeedback dataset, our method outperforms baselines in interpretability and sample efficiency, marking a step towards more transparent, auditable, and human-aligned reward models.
AuthorsSonia Laguna, Katarzyna Kobalczyk, Julia E Vogt, Mihaela Van der Schaar
SubmittedInternational Conference on Machine Learning (ICML) 2025 Workshop on PRAL
Date12.07.2025
Concept Bottleneck Models (CBMs) aim to enhance interpretability by structuring predictions around human-understandable concepts. However, unintended information leakage, where predictive signals bypass the concept bottleneck, compromises their transparency. This paper introduces an information-theoretic measure to quantify leakage in CBMs, capturing the extent to which concept embeddings encode additional, unintended information beyond the specified concepts. We validate the measure through controlled synthetic experiments, demonstrating its effectiveness in detecting leakage trends across various configurations. Our findings highlight that feature and concept dimensionality significantly influence leakage, and that classifier choice impacts measurement stability, with XGBoost emerging as the most reliable estimator. Additionally, preliminary investigations indicate that the measure exhibits the anticipated behavior when applied to soft joint CBMs, suggesting its reliability in leakage quantification beyond fully synthetic settings. While this study rigorously evaluates the measure in controlled synthetic experiments, future work can extend its application to real-world datasets.
AuthorsMikael Makonnen, Moritz Vandenhirtz, Sonia Laguna, Julia E Vogt
SubmittedInternational Conference on Learning Representations (ICLR) 2025 Workshop: XAI4Science
Date15.04.2025
The dead-in-bed syndrome describes the sudden and unexplained death of young individuals with Type 1 Diabetes (T1D) without prior long-term complications. One leading hypothesis attributes this phenomenon to nocturnal hypoglycemia (NH), a dangerous drop in blood glucose during sleep. This study aims to improve NH prediction in children with T1D by leveraging physiological data and machine learning (ML) techniques. We analyze an in-house dataset collected from 16 children with T1D, integrating physiological metrics from wearable sensors. We explore predictive performance through feature engineering, model selection, architectures, and oversampling. To address data limitations, we apply transfer learning from a publicly available adult dataset. Our results achieve an AUROC of 0.75 +- 0.21 on the in-house dataset, further improving to 0.78 +- 0.05 with transfer learning. This research moves beyond glucose-only predictions by incorporating physiological parameters, showcasing the potential of ML to enhance NH detection and improve clinical decision-making for pediatric diabetes management.
AuthorsMarco Voegeli*, Sonia Laguna*, Heike Leutheuser, Marc Pfister, Marie-Anne Burckhardt, Julia E Vogt* denotes shared first authorship
SubmittedInternational Conference on Learning Representations (ICLR) 2025 Workshop AI4CHL
Date14.04.2025
Pulmonary hypertension (PH) in newborns is a critical condition characterized by elevated pressure in the pulmonary arteries, leading to right ventricular strain and heart failure. While right heart catheterization (RHC) is the diagnostic gold standard, echocardiography is preferred due to its non-invasive nature, safety, and accessibility. However, its accuracy highly depends on the operator, making PH assessment subjective. While automated detection methods have been explored, most models focus on adults and rely on single-view echocardiographic frames, limiting their performance in diagnosing PH in newborns. While multi-view echocardiography has shown promise in improving PH assessment, existing models struggle with generalizability. In this work, we employ a multi-view variational autoencoder (VAE) for PH prediction using echocardiographic videos. By leveraging the VAE framework, our model captures complex latent representations, improving feature extraction and robustness. We compare its performance against single-view and supervised learning approaches. Our results show improved generalization and classification accuracy, highlighting the effectiveness of multi-view learning for robust PH assessment in newborns.
AuthorsLucas Erlacher*, Samuel Ruipérez-Campillo*, Holger Michel, Sven Wellmann, Thomas M. Sutter, Ece Özkan Elsen†, Julia E. Vogt†* denotes shared first authorship, † denotes shared last authorship
SubmittedICLR 2025 - Workshop on AI for Children
Date06.03.2025
Effective blood glucose forecasting is crucial for detecting events such as hypo-or hyperglycemia in people with diabetes, yet remains challenging in domains with only small, heterogeneous datasets, such as in the pediatric field. We present GluTFT, a novel transfer learning approach that allows leveraging models pretrained on publicly available adult diabetes datasets for pediatric glucose forecasting. We systematically evaluate multiple transfer learning strategies, including zero-shot prediction and finetuning across the entire dataset as well as specific subgroups of participants. Our extensive experiments reveal that GluTFT excels on the pretraining datasets and significantly outperforms baseline methods when fine-tuned. To validate the clinical relevance of our approach, we evaluate Parkes Error Grids, demonstrating the quality of GluTFT's blood glucose forecasts and its potential for enhancing clinical decision-making for pediatric diabetes.
AuthorsAlain Ryser*, Chuhao Feng*, Tobias Scheithauer, Marc Pfister, Marie-Anne Burckhardt, Sara Bachmann, Alexander Marx†, Julia E. Vogt†* denotes shared first authorship, † denotes shared last authorship
SubmittedProceedings of the 4th Machine Learning for Health Symposium, PMLR 259:839-860
Date15.12.2024
Concept Bottleneck Models (CBMs) have emerged as a promising interpretable method whose final prediction is based on intermediate, human-understandable concepts rather than the raw input. Through time-consuming manual interventions, a user can correct wrongly predicted concept values to enhance the model's downstream performance. We propose Stochastic Concept Bottleneck Models (SCBMs), a novel approach that models concept dependencies. In SCBMs, a single-concept intervention affects all correlated concepts, thereby improving intervention effectiveness. Unlike previous approaches that model the concept relations via an autoregressive structure, we introduce an explicit, distributional parameterization that allows SCBMs to retain the CBMs' efficient training and inference procedure. Additionally, we leverage the parameterization to derive an effective intervention strategy based on the confidence region. We show empirically on synthetic tabular and natural image datasets that our approach improves intervention effectiveness significantly. Notably, we showcase the versatility and usability of SCBMs by examining a setting with CLIP-inferred concepts, alleviating the need for manual concept annotations.
AuthorsMoritz Vandenhirtz*, Sonia Laguna*, Ričards Marcinkevičs, Julia E Vogt* denotes shared first authorship
SubmittedNeurIPS: Thirty-Eighth Annual Conference on Neural Information Processing Systems
Date14.12.2024
General movements (GMs) are spontaneous, coordinated body movements in infants that offer valuable insights into the developing nervous system. Assessed through the Prechtl GM Assessment (GMA), GMs are reliable predictors for neurodevelopmental disorders. However, GMA requires specifically trained clinicians, who are limited in number. To scale up newborn screening, there is a need for an algorithm that can automatically classify GMs from infant video recordings. This data poses challenges, including variability in recording length, device type, and setting, with each video coarsely annotated for overall movement quality. In this work, we introduce a tool for extracting features from these recordings and explore various machine learning techniques for automated GM classification.
AuthorsDaphné Chopard*, Sonia Laguna*, Kieran Chin-Cheong*, Annika Dietz, Anna Badura, Sven Wellmann, Julia E Vogt* denotes shared first authorship
SubmittedFindings Machine Learning for Health (ML4H) Symposium colocated with NeurIPS
Date13.12.2024
Concept-based machine learning methods have increasingly gained importance due to the growing interest in making neural networks interpretable. However, concept annotations are generally challenging to obtain, making it crucial to leverage all their prior knowledge. By creating concept-enriched models that incorporate concept information into existing architectures, we exploit their interpretable capabilities to the fullest extent. In particular, we propose Concept-Guided Conditional Diffusion, which can generate visual representations of concepts, and Concept-Guided Prototype Networks, which can create a concept prototype dataset and leverage it to perform interpretable concept prediction. These results open up new lines of research by exploiting pre-existing information in the quest for rendering machine learning more human-understandable.
AuthorsAlba Carballo-Castro, Sonia Laguna, Moritz Vandenhirtz, Julia E Vogt
SubmittedNeurIPS 2024 Workshop Interpretable AI
Date12.12.2024
Recently, interpretable machine learning has re-explored concept bottleneck models (CBM), comprising step-by-step prediction of the high-level concepts from the raw features and the target variable from the predicted concepts. A compelling advantage of this model class is the user's ability to intervene on the predicted concept values, affecting the model's downstream output. In this work, we introduce a method to perform such concept-based interventions on already-trained neural networks, which are not interpretable by design, given an annotated validation set. Furthermore, we formalise the model's intervenability as a measure of the effectiveness of concept-based interventions and leverage this definition to fine-tune black-box models. Empirically, we explore the intervenability of black-box classifiers on synthetic tabular and natural image benchmarks. We demonstrate that fine-tuning improves intervention effectiveness and often yields better-calibrated predictions. To showcase the practical utility of the proposed techniques, we apply them to deep chest X-ray classifiers and show that fine-tuned black boxes can be as intervenable and more performant than CBMs.
AuthorsSonia Laguna*, Ricards Marcinkevics*, Moritz Vandenhirtz, Julia E. Vogt* denotes shared first authorship
SubmittedNeurIPS: Thirty-Eighth Annual Conference on Neural Information Processing Systems
Date11.12.2024
Self-supervised learning (SSL) has emerged as a powerful approach for learning biologically meaningful representations of single-cell data. To establish best practices in this domain, we present a comprehensive benchmark evaluating eight SSL methods across three downstream tasks and eight datasets, with various data augmentation strategies. Our results demonstrate that SimCLR and VICReg consistently outperform other methods across different tasks. Furthermore, we identify random masking as the most effective augmentation technique. This benchmark provides valuable insights into the application of SSL to single-cell data analysis, bridging the gap between SSL and single-cell biology.
AuthorsPhilip Toma*, Olga Ovcharenko*, Imant Daunhawer, Julia Vogt, Florian Barkmann†, Valentina Boeva†* denotes shared first authorship, † denotes shared last authorship
SubmittedPreprint
Date06.11.2024
Supervised learning has become a cornerstone of modern machine learning, yet a comprehensive theory explaining its effectiveness remains elusive. Empirical phenomena, such as neural analogy-making and the linear representation hypothesis, suggest that supervised models can learn interpretable factors of variation in a linear fashion. Recent advances in self-supervised learning, particularly nonlinear Independent Component Analysis, have shown that these methods can recover latent structures by inverting the data generating process. We extend these identifiability results to parametric instance discrimination, then show how insights transfer to the ubiquitous setting of supervised learning with cross-entropy minimization. We prove that even in standard classification tasks, models learn representations of ground-truth factors of variation up to a linear transformation. We corroborate our theoretical contribution with a series of empirical studies. First, using simulated data matching our theoretical assumptions, we demonstrate successful disentanglement of latent factors. Second, we show that on DisLib, a widely-used disentanglement benchmark, simple classification tasks recover latent structures up to linear transformations. Finally, we reveal that models trained on ImageNet encode representations that permit linear decoding of proxy factors of variation. Together, our theoretical findings and experiments offer a compelling explanation for recent observations of linear representations, such as superposition in neural networks. This work takes a significant step toward a cohesive theory that accounts for the unreasonable effectiveness of supervised deep learning.
AuthorsPatrik Reizinger*, Alice Bizeul*, Attila Juhos*, Julia E. Vogt, Randall Balestriero, Wieland Brendel, David Klindt* denotes shared first authorship
SubmittedThe Fifteenth International Conference on Learning Representations, ICLR 2025 (Oral)
Date04.11.2024
Finding clusters of data points with similar characteristics and generating new cluster-specific samples can significantly enhance our understanding of complex data distributions. While clustering has been widely explored using Variational Autoencoders, these models often lack generation quality in real-world datasets. This paper addresses this gap by introducing TreeDiffusion, a deep generative model that conditions Diffusion Models on hierarchical clusters to obtain high-quality, cluster-specific generations. The proposed pipeline consists of two steps: a VAE-based clustering model that learns the hierarchical structure of the data, and a conditional diffusion model that generates realistic images for each cluster. We propose this two-stage process to ensure that the generated samples remain representative of their respective clusters and enhance image fidelity to the level of diffusion models. A key strength of our method is its ability to create images for each cluster, providing better visualization of the learned representations by the clustering model, as demonstrated through qualitative results. This method effectively addresses the generative limitations of VAE-based approaches while preserving their clustering performance. Empirically, we demonstrate that conditioning diffusion models on hierarchical clusters significantly enhances generative performance, thereby advancing the state of generative clustering models.
AuthorsJorge da Silva Goncalves, Laura Manduchi, Moritz Vandenhirtz, Julia E. Vogt
SubmittedarXiv
Date22.10.2024
Self-Supervised Learning (SSL) methods often consist of elaborate pipelines with hand-crafted data augmentations and computational tricks. However, it is unclear what is the provably minimal set of building blocks that ensures good downstream performance. The recently proposed instance discrimination method, coined DIET, stripped down the SSL pipeline and demonstrated how a simple SSL algorithm can work by predicting the sample index. Our work proves that DIET recovers cluster-based latent representations, while successfully identifying the correct cluster centroids in its classification head. We demonstrate the identifiability of DIET on synthetic data adhering to and violating our assumptions, revealing that the recovery of the cluster centroids is even more robust than the feature recovery.
AuthorsAttila Juhos*, Alice Bizeul*, Patrik Reizinger*, David Klindt, Randall Balestriero, Mark Ibrahim, Julia E Vogt, Wieland Brendel* denotes shared first authorship
SubmittedNeurIPS 2024 Workshop: Self-Supervised Learning-Theory and Practice
Date10.10.2024
The structure of many real-world datasets is intrinsically hierarchical, making the modeling of such hierarchies a critical objective in both unsupervised and supervised machine learning. Recently, novel approaches for hierarchical clustering with deep architectures have been proposed. In this work, we take a critical perspective on this line of research and demonstrate that many approaches exhibit major limitations when applied to realistic datasets, partly due to their high computational complexity. In particular, we show that a lightweight procedure implemented on top of pre-trained non-hierarchical clustering models outperforms models designed specifically for hierarchical clustering. Our proposed approach is computationally efficient and applicable to any pre-trained clustering model that outputs logits, without requiring any fine-tuning. To highlight the generality of our findings, we illustrate how our method can also be applied in a supervised setup, recovering meaningful hierarchies from a pre-trained ImageNet classifier.
AuthorsEmanuele Palumbo, Moritz Vandenhirtz, Alain Ryser, Imant Daunhawer†, Julia E. Vogt†† denotes shared last authorship
SubmittedPreprint
Date10.10.2024
This paper introduces Diffuse-TreeVAE, a deep generative model that integrates hierarchical clustering into the framework of Denoising Diffusion Probabilistic Models (DDPMs). The proposed approach generates new images by sampling from a root embedding of a learned latent tree VAE-based structure, it then propagates through hierarchical paths, and utilizes a second-stage DDPM to refine and generate distinct, high-quality images for each data cluster. The result is a model that not only improves image clarity but also ensures that the generated samples are representative of their respective clusters, addressing the limitations of previous VAE-based methods and advancing the state of clustering-based generative modeling.
AuthorsJorge da Silva Gonçalves, Laura Manduchi, Moritz Vandenhirtz, Julia E. Vogt
SubmittedICML 2024 Workshop on Structured Probabilistic Inference & Generative Modeling
Date27.07.2024
Concept Bottleneck Models (CBMs) have emerged as a promising interpretable method whose final prediction is based on intermediate, human-understandable concepts rather than the raw input. Through time-consuming manual interventions, a user can correct wrongly predicted concept values to enhance the model's downstream performance. We propose Stochastic Concept Bottleneck Models (SCBMs), a novel approach that models concept dependencies. In SCBMs, a single-concept intervention affects all correlated concepts. Leveraging the parameterization, we derive an effective intervention strategy based on the confidence region. We show empirically on synthetic tabular and natural image datasets that our approach improves intervention effectiveness significantly. Notably, we showcase the versatility and usability of SCBMs by examining a setting with CLIP-inferred concepts, alleviating the need for manual concept annotations.
AuthorsMoritz Vandenhirtz*, Sonia Laguna*, Ricards Marcinkevics, Julia E. Vogt* denotes shared first authorship
SubmittedICML 2024 Workshop on Structured Probabilistic Inference & Generative Modeling, Workshop on Models of Human Feedback for AI Alignment, and Workshop on Humans, Algorithmic Decision-Making and Society
Date26.07.2024
Patent Ductus Arteriosus (PDA) is a heart condition that affects newborn infants. One way to estimate the severity of PDA is by measuring the Cardiac Output (CO) of the infants. Current techniques for estimating CO include echocardiography or invasive methods. While invasive methods generally provide a more accurate CO estimate, they are difficult procedures for premature infants due to the infants’ size and potential complications. On the other hand, echocardiography CO estimates do not allow for continuous CO monitoring. In this work, we examine the relationship between Pulse Wave Transition Time (PWTT) and stroke volume (SV-VTI), a clinical measure from which it is possible to derive CO. In addition, we build a machine-learning model to predict SV-VTI from PWTT, which provides a novel way of estimating CO inexpensively and non-invasively.
AuthorsDoriela Grabocka, Alain Ryser, Sven Wellman, Holger Michel, Julia E. Vogt
Submitted11th ACM Celebration of Women in Computing: womENcourage
Date26.06.2024
Recent years have witnessed a rapid development of deep generative models for creating synthetic media, such as images and videos. While the practical applications of these models in everyday tasks are enticing, it is crucial to assess the inherent risks regarding their fairness. In this work, we introduce a comprehensive framework for benchmarking the performance and fairness of conditional generative models. We develop a set of metrics—inspired by their supervised fairness counterparts—to evaluate the models on their fairness and diversity. Focusing on the specific application of image upsampling, we create a benchmark covering a wide variety of modern upsampling methods. As part of the benchmark, we introduce UnfairFace, a subset of FairFace that replicates the racial distribution of common large-scale face datasets. Our empirical study highlights the importance of using an unbiased training set and reveals variations in how the algorithms respond to dataset imbalances. Alarmingly, we find that none of the considered methods produces statistically fair and diverse results. All experiments can be reproduced using our provided repository.
AuthorsMike Laszkiewicz, Imant Daunhawer, Julia E. Vogt†, Asja Fischer†, Johannes Lederer†† denotes shared last authorship
SubmittedACM Conference on Fairness, Accountability, and Transparency, 2024
Date05.06.2024
Multimodal VAEs have recently gained significant attention as generative models for weakly-supervised learning with multiple heterogeneous modalities. In parallel, VAE-based methods have been explored as probabilistic approaches for clustering tasks. At the intersection of these two research directions, we propose a novel multimodal VAE model in which the latent space is extended to learn data clusters, leveraging shared information across modalities. Our experiments show that our proposed model improves generative performance over existing multimodal VAEs, particularly for unconditional generation. Furthermore, we propose a post-hoc procedure to automatically select the number of true clusters thus mitigating critical limitations of previous clustering frameworks. Notably, our method favorably compares to alternative clustering approaches, in weakly-supervised settings. Finally, we integrate recent advancements in diffusion models into the proposed method to improve generative quality for real-world images.
AuthorsEmanuele Palumbo, Laura Manduchi, Sonia Laguna, Daphne Chopard, Julia E Vogt
SubmittedICLR: The Twelfth International Conference on Learning Representations
Date17.05.2024
Despite significant progress, evaluation of explainable artificial intelligence remains elusive and challenging. In this paper we propose a fine-grained validation framework that is not overly reliant on any one facet of these sociotechnical systems, and that recognises their inherent modular structure: technical building blocks, user-facing explanatory artefacts and social communication protocols. While we concur that user studies are invaluable in assessing the quality and effectiveness of explanation presentation and delivery strategies from the explainees' perspective in a particular deployment context, the underlying explanation generation mechanisms require a separate, predominantly algorithmic validation strategy that accounts for the technical and human-centred desiderata of their (numerical) outputs. Such a comprehensive sociotechnical utility-based evaluation framework could allow to systematically reason about the properties and downstream influence of different building blocks from which explainable artificial intelligence systems are composed – accounting for a diverse range of their engineering and social aspects – in view of the anticipated use case.
AuthorsKacper Sokol, Julia E. Vogt
SubmittedExtended Abstracts of the 2024 ACM Conference on Human Factors in Computing Systems (CHI)
Date02.05.2024
Pulmonary hypertension (PH) in newborns and infants is a complex condition associated with several pulmonary, cardiac, and systemic diseases contributing to morbidity and mortality. Thus, accurate and early detection of PH and the classification of its severity is crucial for appropriate and successful management. Using echocardiography, the primary diagnostic tool in pediatrics, human assessment is both time-consuming and expertise-demanding, raising the need for an automated approach. Little effort has been directed towards automatic assessment of PH using echocardiography, and the few proposed methods only focus on binary PH classification on the adult population. In this work, we present an explainable multi-view video-based deep learning approach to predict and classify the severity of PH for a cohort of 270 newborns using echocardiograms. We use spatio-temporal convolutional architectures for the prediction of PH from each view, and aggregate the predictions of the different views using majority voting. Our results show a mean F1-score of 0.84 for severity prediction and 0.92 for binary detection using 10-fold cross-validation and 0.63 for severity prediction and 0.78 for binary detection on the held-out test set. We complement our predictions with saliency maps and show that the learned model focuses on clinically relevant cardiac structures, motivating its usage in clinical practice. To the best of our knowledge, this is the first work for an automated assessment of PH in newborns using echocardiograms.
AuthorsHanna Ragnarsdottir*, Ece Özkan Elsen*, Holger Michel*, Kieran Chin-Cheong, Laura Manduchi, Sven Wellmann†, Julia E. Vogt†* denotes shared first authorship, † denotes shared last authorship
SubmittedInternational Journal of Computer Vision
Date06.02.2024
Appendicitis is among the most frequent reasons for pediatric abdominal surgeries. Previous decision support systems for appendicitis have focused on clinical, laboratory, scoring, and computed tomography data and have ignored abdominal ultrasound, despite its noninvasive nature and widespread availability. In this work, we present interpretable machine learning models for predicting the diagnosis, management and severity of suspected appendicitis using ultrasound images. Our approach utilizes concept bottleneck models (CBM) that facilitate interpretation and interaction with high-level concepts understandable to clinicians. Furthermore, we extend CBMs to prediction problems with multiple views and incomplete concept sets. Our models were trained on a dataset comprising 579 pediatric patients with 1709 ultrasound images accompanied by clinical and laboratory data. Results show that our proposed method enables clinicians to utilize a human-understandable and intervenable predictive model without compromising performance or requiring time-consuming image annotation when deployed. For predicting the diagnosis, the extended multiview CBM attained an AUROC of 0.80 and an AUPR of 0.92, performing comparably to similar black-box neural networks trained and tested on the same dataset.
AuthorsRicards Marcinkevics*, Patricia Reis Wolfertstetter*, Ugne Klimiene*, Kieran Chin-Cheong, Alyssia Paschke, Julia Zerres, Markus Denzinger, David Niederberger, Sven Wellmann, Ece Özkan Elsen†, Christian Knorr†, Julia E. Vogt†* denotes shared first authorship, † denotes shared last authorship
SubmittedMedical Image Analysis
Date01.01.2024
We propose Tree Variational Autoencoder (TreeVAE), a new generative hierarchical clustering model that learns a flexible tree-based posterior distribution over latent variables. TreeVAE hierarchically divides samples according to their intrinsic characteristics, shedding light on hidden structures in the data. It adapts its architecture to discover the optimal tree for encoding dependencies between latent variables. The proposed tree-based generative architecture enables lightweight conditional inference and improves generative performance by utilizing specialized leaf decoders. We show that TreeVAE uncovers underlying clusters in the data and finds meaningful hierarchical relations between the different groups on a variety of datasets, including real-world imaging data. We present empirically that TreeVAE provides a more competitive log-likelihood lower bound than the sequential counterparts. Finally, due to its generative nature, TreeVAE is able to generate new samples from the discovered clusters via conditional sampling.
AuthorsLaura Manduchi*, Moritz Vandenhirtz*, Alain Ryser, Julia E. Vogt* denotes shared first authorship
SubmittedSpotlight at Neural Information Processing Systems, NeurIPS 2023
Date20.12.2023
Recently, interpretable machine learning has re-explored concept bottleneck models (CBM), comprising step-by-step prediction of the high-level concepts from the raw features and the target variable from the predicted concepts. A compelling advantage of this model class is the user's ability to intervene on the predicted concept values, consequently affecting the model's downstream output. In this work, we introduce a method to perform such concept-based interventions on already-trained neural networks, which are not interpretable by design. Furthermore, we formalise the model's intervenability as a measure of the effectiveness of concept-based interventions and leverage this definition to fine-tune black-box models. Empirically, we explore the intervenability of black-box classifiers on synthetic tabular and natural image benchmarks. We demonstrate that fine-tuning improves intervention effectiveness and often yields better-calibrated predictions. To showcase the practical utility of the proposed techniques, we apply them to chest X-ray classifiers and show that fine-tuned black boxes can be as intervenable and more performant than CBMs.
AuthorsRicards Marcinkevics*, Sonia Laguna*, Moritz Vandenhirtz, Julia E. Vogt* denotes shared first authorship
SubmittedXAI in Action: Past, Present, and Future Applications, NeurIPS 2023
Date16.12.2023
Background: The overarching goal of blood glucose forecasting is to assist individuals with type 1 diabetes (T1D) in avoiding hyper- or hypoglycemic conditions. While deep learning approaches have shown promising results for blood glucose forecasting in adults with T1D, it is not known if these results generalize to children. Possible reasons are physical activity (PA), which is often unplanned in children, as well as age and development of a child, which both have an effect on the blood glucose level. Materials and Methods: In this study, we collected time series measurements of glucose levels, carbohydrate intake, insulin-dosing and physical activity from children with T1D for one week in an ethics approved prospective observational study, which included daily physical activities. We investigate the performance of state-of-the-art deep learning methods for adult data—(dilated) recurrent neural networks and a transformer—on our dataset for short-term (30 min) and long-term (2 h) prediction. We propose to integrate static patient characteristics, such as age, gender, BMI, and percentage of basal insulin, to account for the heterogeneity of our study group. Results: Integrating static patient characteristics (SPC) proves beneficial, especially for short-term prediction. LSTMs and GRUs with SPC perform best for a prediction horizon of 30 min (RMSE of 1.66 mmol/l), a vanilla RNN with SPC performs best across different prediction horizons, while the performance significantly decays for long-term prediction. For prediction during the night, the best method improves to an RMSE of 1.50 mmol/l. Overall, the results for our baselines and RNN models indicate that blood glucose forecasting for children conducting regular physical activity is more challenging than for previously studied adult data. Conclusion: We find that integrating static data improves the performance of deep-learning architectures for blood glucose forecasting of children with T1D and achieves promising results for short-term prediction. Despite these improvements, additional clinical studies are warranted to extend forecasting to longer-term prediction horizons.
AuthorsAlexander Marx, Francesco Di Stefano, Heike Leutheuser, Kieran Chin-Cheong, Marc Pfister, Marie-Anne Burckhardt, Sara Bachmann†, Julia E. Vogt†† denotes shared last authorship
SubmittedFrontiers in Pediatrics
Date14.12.2023
Partitioning a set of elements into an unknown number of mutually exclusive subsets is essential in many machine learning problems. However, assigning elements, such as samples in a dataset or neurons in a network layer, to an unknown and discrete number of subsets is inherently non-differentiable, prohibiting end-to-end gradient-based optimization of parameters. We overcome this limitation by proposing a novel two-step method for inferring partitions, which allows its usage in variational inference tasks. This new approach enables reparameterized gradients with respect to the parameters of the new random partition model. Our method works by inferring the number of elements per subset and, second, by filling these subsets in a learned order. We highlight the versatility of our general-purpose approach on three different challenging experiments: variational clustering, inference of shared and independent generative factors under weak supervision, and multitask learning.
AuthorsThomas M. Sutter*, Alain Ryser*, Joram Liebeskind, Julia E. Vogt* denotes shared first authorship
SubmittedNeurips 2023
Date12.12.2023
Prototype learning, a popular machine learning method designed for inherently interpretable decisions, leverages similarities to learned prototypes for classifying new data. While it is mainly applied in computer vision, in this work, we build upon prior research and further explore the extension of prototypical networks to natural language processing. We introduce a learned weighted similarity measure that enhances the similarity computation by focusing on informative dimensions of pre-trained sentence embeddings. Additionally, we propose a post-hoc explainability mechanism that extracts prediction-relevant words from both the prototype and input sentences. Finally, we empirically demonstrate that our proposed method not only improves predictive performance on the AG News and RT Polarity datasets over a previous prototype-based approach, but also improves the faithfulness of explanations compared to rationale-based recurrent convolutions.
AuthorsClaudio Fanconi*, Moritz Vandenhirtz*, Severin Husmann, Julia E. Vogt* denotes shared first authorship
SubmittedConference on Empirical Methods in Natural Language Processing, EMNLP 2023
Date25.10.2023
Background: Hyperbilirubinemia of the newborn infant is a common disease worldwide. However, recognized early and treated appropriately, it typically remains innocuous. We recently developed an early phototherapy prediction tool (EPPT) by means of machine learning (ML) utilizing just one bilirubin measurement and few clinical variables. The aim of this study is to test applicability and performance of the EPPT on a new patient cohort from a different population. Materials and methods: This work is a retrospective study of prospectively recorded neonatal data from infants born in 2018 in an academic hospital, Regensburg, Germany, meeting the following inclusion criteria: born with 34 completed weeks of gestation or more, at least two total serum bilirubin (TSB) measurement prior to phototherapy. First, the original EPPT—an ensemble of a logistic regression and a random forest—was used in its freely accessible version and evaluated in terms of the area under the receiver operating characteristic curve (AUROC). Second, a new version of the EPPT model was re-trained on the data from the new cohort. Third, the predictive performance, variable importance, sensitivity and specificity were analyzed and compared across the original and re-trained models. Results: In total, 1,109 neonates were included with a median (IQR) gestational age of 38.4 (36.6–39.9) and a total of 3,940 bilirubin measurements prior to any phototherapy treatment, which was required in 154 neonates (13.9%). For the phototherapy treatment prediction, the original EPPT achieved a predictive performance of 84.6% AUROC on the new cohort. After re-training the model on a subset of the new dataset, 88.8% AUROC was achieved as evaluated by cross validation. The same five variables as for the original model were found to be most important for the prediction on the new cohort, namely gestational age at birth, birth weight, bilirubin to weight ratio, hours since birth, bilirubin value. Discussion: The individual risk for treatment requirement in neonatal hyperbilirubinemia is robustly predictable in different patient cohorts with a previously developed ML tool (EPPT) demanding just one TSB value and only four clinical parameters. Further prospective validation studies are needed to develop an effective and safe clinical decision support system.
AuthorsImant Daunhawer, Kai Schumacher, Anna Badura, Julia E. Vogt, Holger Michel, Sven Wellmann
SubmittedFrontiers in Pediatrics, 2023
Date09.10.2023
Chronic obstructive pulmonary disease (COPD) is a significant public health issue, affecting more than 100 million people worldwide. Remote patient monitoring has shown great promise in the efficient management of patients with chronic diseases. This work presents the analysis of the data from a monitoring system developed to track COPD symptoms alongside patients’ self-reports. In particular, we investigate the assessment of COPD severity using multisensory home-monitoring device data acquired from 30 patients over a period of three months. We describe a comprehensive data pre-processing and feature engineering pipeline for multimodal data from the remote home-monitoring of COPD patients. We develop and validate predictive models forecasting i) the absolute and ii) differenced COPD Assessment Test (CAT) scores based on the multisensory data. The best obtained models achieve Pearson’s correlation coefficient of 0.93 and 0.37 for absolute and differenced CAT scores. In addition, we investigate the importance of individual sensor modalities for predicting CAT scores using group sparse regularization techniques. Our results suggest that feature groups indicative of the patient’s general condition, such as static medical and physiological information, date, spirometer, and air quality, are crucial for predicting the absolute CAT score. For predicting changes in CAT scores, sleep and physical activity features are most important, alongside the previous CAT score value. Our analysis demonstrates the potential of remote patient monitoring for COPD management and investigates which sensor modalities are most indicative of COPD severity as assessed by the CAT score. Our findings contribute to the development of effective and data-driven COPD management strategies.
AuthorsZixuan Xiao, Michal Muszynski, Ricards Marcinkevics, Lukas Zimmerli, Adam D. Ivankay, Dario Kohlbrenner, Manuel Kuhn, Yves Nordmann, Ulrich Muehlner, Christian Clarenbach, Julia E. Vogt, Thomas Brunschwiler
Submitted25th ACM International Conference on Multimodal Interaction, ICMI'23
Date09.10.2023
Early detection of cardiac dysfunction through routine screening is vital for diagnosing cardiovascular diseases. An important metric of cardiac function is the left ventricular ejection fraction (EF), where lower EF is associated with cardiomyopathy. Echocardiography is a popular diagnostic tool in cardiology, with ultrasound being a low-cost, real-time, and non-ionizing technology. However, human assessment of echocardiograms for calculating EF is time-consuming and expertise-demanding, raising the need for an automated approach. In this work, we propose using the M(otion)-mode of echocardiograms for estimating the EF and classifying cardiomyopathy. We generate multiple artificial M-mode images from a single echocardiogram and combine them using off-the-shelf model architectures. Additionally, we extend contrastive learning (CL) to cardiac imaging to learn meaningful representations from exploiting structures in unlabeled data allowing the model to achieve high accuracy, even with limited annotations. Our experiments show that the supervised setting converges with only ten modes and is comparable to the baseline method while bypassing its cumbersome training process and being computationally much more efficient. Furthermore, CL using M-mode images is helpful for limited data scenarios, such as having labels for only 200 patients, which is common in medical applications.
AuthorsEce Özkan Elsen*, Thomas M. Sutter*, Yurong Hu, Sebastian Balzer, Julia E. Vogt* denotes shared first authorship
SubmittedGCPR 2023
Date01.09.2023
Appendicitis is among the most frequent reasons for pediatric abdominal surgeries. With recent advances in machine learning, data-driven decision support could help clinicians diagnose and manage patients while reducing the number of non-critical surgeries. However, previous decision support systems for appendicitis have focused on clinical, laboratory, scoring, and computed tomography data and have ignored the use of abdominal ultrasound, despite its noninvasive nature and widespread availability. In this work, we present interpretable machine learning models for predicting the diagnosis, management and severity of suspected appendicitis using ultrasound images. To this end, our approach utilizes concept bottleneck models (CBM) that facilitate interpretation and interaction with high-level concepts that are understandable to clinicians. Furthermore, we extend CBMs to prediction problems with multiple views and incomplete concept sets. Our models were trained on a dataset comprising 579 pediatric patients with 1709 ultrasound images accompanied by clinical and laboratory data. Results show that our proposed method enables clinicians to utilize a human-understandable and intervenable predictive model without compromising performance or requiring time-consuming image annotation when deployed.
AuthorsRicards Marcinkevics*, Patricia Reis Wolfertstetter*, Ugne Klimiene*, Kieran Chin-Cheong, Alyssia Paschke, Julia Zerres, Markus Denzinger, David Niederberger, Sven Wellmann, Ece Özkan Elsen†, Christian Knorr†, Julia E. Vogt†* denotes shared first authorship, † denotes shared last authorship
SubmittedWorkshop on Machine Learning for Multimodal Healthcare Data, Co-located with ICML 2023
Date29.07.2023
Abstract Ante-hoc interpretability has become the holy grail of explainable artificial intelligence for high-stakes domains such as healthcare; however, this notion is elusive, lacks a widely-accepted definition and depends on the operational context. It can refer to predictive models whose structure adheres to domain-specific constraints, or ones that are inherently transparent. The latter conceptualisation assumes observers who judge this quality, whereas the former presupposes them to have technical and domain expertise (thus alienating other groups of explainees). Additionally, the distinction between ante-hoc interpretability and the less desirable post-hoc explainability, which refers to methods that construct a separate explanatory model, is vague given that transparent predictive models may still require (post-)processing to yield suitable explanatory insights. Ante-hoc interpretability is thus an overloaded concept that comprises a range of implicit properties, which we unpack in this paper to better understand what is needed for its safe deployment across high-stakes domains. To this end, we outline modelling and explaining desiderata that allow us to navigate its distinct realisations in view of the envisaged application and audience.
AuthorsKacper Sokol, Julia E. Vogt
SubmittedWorkshop on Interpretable ML in Healthcare at 2023 International Conference on Machine Learning (ICML)
Date28.07.2023
Partitioning a set of elements into an unknown number of mutually exclusive subsets is essential in many machine learning problems. However, assigning elements, such as samples in a dataset or neurons in a network layer, to an unknown and discrete number of subsets is inherently non-differentiable, prohibiting end-to-end gradient-based optimization of parameters. We overcome this limitation by proposing a novel two-step method for inferring partitions, which allows its usage in variational inference tasks. This new approach enables reparameterized gradients with respect to the parameters of the new random partition model. Our method works by inferring the number of elements per subset and, second, by filling these subsets in a learned order. We highlight the versatility of our general-purpose approach on two different challenging experiments: variational clustering and inference of shared and independent generative factors under weak supervision.
AuthorsThomas M. Sutter*, Alain Ryser*, Joram Liebeskind, Julia E. Vogt* denotes shared first authorship
SubmittedICML workshop on Structured Probabilistic Inference & Generative Modeling
Date23.07.2023
Partitioning a set of elements into an unknown number of mutually exclusive subsets is essential in many machine-learning problems. However, assigning elements to an unknown and discrete number of subsets is inherently non-differentiable, prohibiting end-to-end gradient-based optimization of parameters. We propose a novel two-step method for learning distributions over partitions, including a reparametrization trick, to allow the inclusion of partitions in variational inference tasks. Our method works by first inferring the number of elements per subset and then sequentially filling these subsets in an order learned in a second step. We highlight the versatility of our general-purpose approach on two different experiments: multitask learning and unsupervised conditional sampling.
AuthorsThomas M. Sutter*, Alain Ryser*, Joram Liebeskind, Julia E. Vogt* denotes shared first authorship
SubmittedFifth Symposium on Advances in Approximate Bayesian Inference
Date18.07.2023