Home

The medical data science group carries out research at the intersection of machine learning and medicine with the ultimate goal of improving diagnosis and treatment outcome to the benefit of the care and wellbeing of patients. As medical and health data is heterogenous and multimodal, our research deals with the advancement of machine learning models and methodologies to address the specific challenges of the medical domain. Specifically, we work in the areas of multimodal data integration, structure detection, and trustworthy (or transparent) models. The challenge lies not only in developing fast, robust and reliable systems but also in systems that are easy to interpret and usable in clinical practice.

New Timeline Documents 30+ Years of Promoting Women in Computer Science at D-INFK

25.06.2025

The Department of Computer Science (D-INFK) at ETH Zurich has published a new historical timeline documenting the development of its women’s promotion…

Dr Ece Özkan Elsen appointed as BRCCH Professor of Paediatric Digital Health Data Analysis

04.04.2025

We are excited to announce that Dr. Ece Ozkan Elsen, currently an Established Researcher in our group, will be transitioning to her new role as…

MDS at NeurIPS 2024

15.12.2024

Several members of the MDS group attended NeurIPS 2024. Congratulations to everyone who presented work at the main conference and workshops!

Lucas Erlacher^, Samuel Ruipérez-Campillo^, Holger Michel, Sven Wellmann, Thomas M. Sutter, Ece Özkan Elsen^†, Julia E. Vogt^†
^* denotes shared first authorship, ^† denotes shared last authorshipPredicting pulmonary hypertension in newborns: A multi-view VAE approachICLR 2025 - Workshop on AI for Children

Abstract

Pulmonary hypertension (PH) in newborns is a critical condition characterized by elevated pressure in the pulmonary arteries, leading to right ventricular strain and heart failure. While right heart catheterization (RHC) is the diagnostic gold standard, echocardiography is preferred due to its non-invasive nature, safety, and accessibility. However, its accuracy highly depends on the operator, making PH assessment subjective. While automated detection methods have been explored, most models focus on adults and rely on single-view echocardiographic frames, limiting their performance in diagnosing PH in newborns. While multi-view echocardiography has shown promise in improving PH assessment, existing models struggle with generalizability. In this work, we employ a multi-view variational autoencoder (VAE) for PH prediction using echocardiographic videos. By leveraging the VAE framework, our model captures complex latent representations, improving feature extraction and robustness. We compare its performance against single-view and supervised learning approaches. Our results show improved generalization and classification accuracy, highlighting the effectiveness of multi-view learning for robust PH assessment in newborns.

Authors

Lucas Erlacher^*, Samuel Ruipérez-Campillo^*, Holger Michel, Sven Wellmann, Thomas M. Sutter, Ece Özkan Elsen^†, Julia E. Vogt^†
^* denotes shared first authorship, ^† denotes shared last authorship

Submitted

ICLR 2025 - Workshop on AI for Children

Date

06.03.2025

Link DOI

P Ganesan, M Pedron, R Feng, AJ Rogers, B Deb, HJ Chang, S Ruipérez-Campillo, V Srivastava, KA Brennan, W Giles, T Baykaner, P Clopton, PJ Wang, U Schotten, DE Krummen, SM NarayanComparing Phenotypes for Acute and Long-Term Response to Atrial Fibrillation Ablation Using Machine LearningCirculation: Arrhythmia and Electrophysiology

Abstract

Background: It is difficult to identify patients with atrial fibrillation (AF) most likely to respond to ablation. While any arrhythmia patient may recur after acutely successful ablation, AF is unusual in that patients may have long-term arrhythmia freedom despite a lack of acute success. We hypothesized that acute and chronic AF ablation outcomes may reflect distinct physiology and used machine learning of multimodal data to identify their phenotypes. Methods: We studied 561 consecutive patients in the Stanford AF ablation registry (66±10 years, 28% women, 67% nonparoxysmal), from whom we extracted 72 data features of electrograms, electrocardiogram, cardiac structure, lifestyle, and clinical variables. We compared 6 machine learning models to predict acute and long-term end points after ablation and used Shapley explainability analysis to contrast phenotypes. We validated our results in an independent external population of n=77 patients with AF. Results: The 1-year success rate was 69.5%, and the acute termination rate was 49.6%, which correlated poorly on a patient-by-patient basis (φ coefficient=0.08). The best model for acute termination (area under the curve=0.86, Random Forest) was more predictive than for long-term outcomes (area under the curve=0.67, logistic regression; P<0.001). Phenotypes for long-term success reflected clinical and lifestyle features, while phenotypes for AF termination reflected electrical features. The need for AF induction predicted both phenotypes. The external validation cohort showed similar results (area under the curve=0.81 and 0.64, respectively) with similar phenotypes. Conclusions: Long-term and acute responses to AF ablation reflect distinct clinical and electrical physiology, respectively. This de-linking of phenotypes raises the question of whether long-term success operates through factors such as attenuated AF progression. There remains an urgent need to develop procedural predictors of long-term AF ablation success.

Authors

P Ganesan, M Pedron, R Feng, AJ Rogers, B Deb, HJ Chang, S Ruipérez-Campillo, V Srivastava, KA Brennan, W Giles, T Baykaner, P Clopton, PJ Wang, U Schotten, DE Krummen, SM Narayan

Submitted

Circulation: Arrhythmia and Electrophysiology

Date

10.02.2025

Link DOI

N Deperrois, H Matsuo, S Ruipérez-Campillo, M Vandenhirtz, S Laguna, A Ryser, K Fujimoto, M Nishio, TM Sutter, JE Vogt, J Kluckert, T Frauenfelder, C Blüthgen, F Nooralahzadeh, M KrauthammerRadVLM: A Multitask Conversational Vision-Language Model for RadiologyarXiv

Abstract

The widespread use of chest X-rays (CXRs), coupled with a shortage of radiologists, has driven growing interest in automated CXR analysis and AI-assisted reporting. While existing vision-language models (VLMs) show promise in specific tasks such as report generation or abnormality detection, they often lack support for interactive diagnostic capabilities. In this work we present RadVLM, a compact, multitask conversational foundation model designed for CXR interpretation. To this end, we curate a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks -- such as report generation, abnormality classification, and visual grounding -- and multi-turn, multi-task conversational interactions. After fine-tuning RadVLM on this instruction dataset, we evaluate it across different tasks along with re-implemented baseline VLMs. Our results show that RadVLM achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks. Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of RadVLM as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.

Authors

N Deperrois, H Matsuo, S Ruipérez-Campillo, M Vandenhirtz, S Laguna, A Ryser, K Fujimoto, M Nishio, TM Sutter, JE Vogt, J Kluckert, T Frauenfelder, C Blüthgen, F Nooralahzadeh, M Krauthammer

Submitted

arXiv

Date

01.02.2025

Link DOI

R Feng, KA Brennan, Z Azizi, J Goyal, B Deb, HJ Chang, P Ganesan, P Clopton, M Pedron, S Ruipérez-Campillo, Y Desai, H De Larochellière, T Baykaner, M Perez, M Rodrigo, AJ Rogers, SM NarayanEngineering of Generative Artificial Intelligence and Natural Language Processing Models to Accurately Identify Arrhythmia RecurrenceCirculation: Arrhythmia and Electrophysiology

Abstract

Background: Large language models (LLMs) such as Chat Generative Pre-trained Transformer (ChatGPT) excel at interpreting unstructured data from public sources, yet are limited when responding to queries on private repositories, such as electronic health records (EHRs). We hypothesized that prompt engineering could enhance the accuracy of LLMs for interpreting EHR data without requiring domain knowledge, thus expanding their utility for patients and personalized diagnostics. Methods: We designed and systematically tested prompt engineering techniques to improve the ability of LLMs to interpret EHRs for nuanced diagnostic questions, referenced to a panel of medical experts. In 490 full-text EHR notes from 125 patients with prior life-threatening heart rhythm disorders, we asked GPT-4-turbo to identify recurrent arrhythmias distinct from prior events and tested 220 563 queries. To provide context, results were compared with rule-based natural language processing and Bidirectional Encoder Representations from Transformer-based language models. Experiments were repeated for 2 additional LLMs. Results: In an independent hold-out set of 389 notes, GPT-4-turbo had a balanced accuracy of 64.3%±4.7% out-of-the-box at baseline. This increased when asking GPT-4-turbo to provide a rationale for its answers, a structured data output, and in-context exemplars, to a balanced accuracy of 91.4%±3.8% (P<0.05). This surpassed the traditional logic-based natural language processing and BERT-based models (P<0.05). Results were consistent for GPT-3.5-turbo and Jurassic-2 LLMs. Conclusions: The use of prompt engineering strategies enables LLMs to identify clinical end points from EHRs with an accuracy that surpassed natural language processing and approximated experts, yet without the need for expert knowledge. These approaches could be applied to LLM queries for other domains, to facilitate automated analysis of nuanced data sets with high accuracy by nonexperts.

Authors

R Feng, KA Brennan, Z Azizi, J Goyal, B Deb, HJ Chang, P Ganesan, P Clopton, M Pedron, S Ruipérez-Campillo, Y Desai, H De Larochellière, T Baykaner, M Perez, M Rodrigo, AJ Rogers, SM Narayan

Submitted

Circulation: Arrhythmia and Electrophysiology

Date

01.01.2025

Link DOI

J Tonko^, S Ruipérez-Campillo^, G Cabero-Vidal, E Cabrera-Borrego, C Roney, J Jiminez, J Millet, F Castells, P Lambiase
^* denotes shared first authorshipVector Field Heterogeneity as a Novel Omnipolar Mapping Metric for Functional Substrate Characterisation in Scar-related Ventricular TachycardiasHeart Rhythm

Abstract

Background: Vector field heterogeneity (VFH) is a novel omnipolar metric to quantify local propagation heterogeneities that may identify functionally critical sites for ablation in scar-related ventricular tachycardia (VT). Objective: This study aims to assess the diagnostic value of VFH to identify abnormal propagation patterns during ventricular substrate mapping and compare VFH in VT isthmus sites (IS), low-voltage bystander area (LVA), and normal voltage areas (NVAa). Methods: Substrate maps acquired with a 16-pole grid catheter in patients with scar-related VT were segmented into sites corresponding to IS, LVA, and NVA (defined as omnipolar voltages > and <1.5 mV) based on corresponding VT activation maps. For each 4 × 4 electrode-array acquisition, omnipolar-derived vector maps of the directions of electrical propagation were computed offline, and a VFH value per clique ranging from 0 (perfect planar wave) to 1 (maximal disorganization) derived. Results: Sixteen patients were studied (56.3% ischemic), evaluating 9 endocardial and 7 epicardial substrate maps. VFH metric at IS was 0.57 ± 0.26, LVA 0.52 ± 0.34, and NVA 0.16 ± 0.24. VFH at IS was higher (more disorganized) than at LVA (P < .001). VFH in NVA was significantly lower (ie, more homogeneous) compared with IS and LVA (P < .001, respectively). Within the isthmus, highest heterogeneity was recorded at entry (0.61 ± 0.24), lowest at exit sites (0.44 ± 0.27). Conclusion: VFH mapping identified a significant increase in electrical heterogeneity at IS compared with bystanders with highest heterogeneity at the entrance. Yet, absolute differences were small, and substantial overlap among sites was recorded, precluding its use as a stand-alone mapping approach. VFH may be integrated in existing mapping strategies as complementary information to support identification of ablation targets.

Authors

J Tonko^*, S Ruipérez-Campillo^*, G Cabero-Vidal, E Cabrera-Borrego, C Roney, J Jiminez, J Millet, F Castells, P Lambiase
^* denotes shared first authorship

Submitted

Heart Rhythm

Date

06.11.2024

Link DOI