The medical data science group carries out research at the intersection of machine learning and medicine with the ultimate goal of improving diagnosis and treatment outcome to the benefit of the care and wellbeing of patients. As medical and health data is heterogenous and multimodal, our research deals with the advancement of machine learning models and methodologies to address the specific challenges of the medical domain. Specifically, we work in the areas of multimodal data integration, structure detection, and trustworthy (or transparent) models. The challenge lies not only in developing fast, robust and reliable systems but also in systems that are easy to interpret and usable in clinical practice.
MDS at NeurIPS 2024
Several members of the MDS group attended NeurIPS 2024. Congratulations to everyone who presented work at the main conference and workshops!
CSNOW awarded 2nd place in 2024 ETH Diversity Award
Congratulations to CSNOW for finishing second in the 2024 ETH Diversity Award! Read more about it here
'Internal portraits' in the past and present - X-ray technology on the “Magic Mountain” and ETH Zurich
What challenges did X-ray diagnostics face back then? How does today's medicine meet them? Marco Stampanoni, Professor of X-ray Imaging, and Julia…
Pulmonary hypertension (PH) in newborns is a critical condition characterized by elevated pressure in the pulmonary arteries, leading to right ventricular strain and heart failure. While right heart catheterization (RHC) is the diagnostic gold standard, echocardiography is preferred due to its non-invasive nature, safety, and accessibility. However, its accuracy highly depends on the operator, making PH assessment subjective. While automated detection methods have been explored, most models focus on adults and rely on single-view echocardiographic frames, limiting their performance in diagnosing PH in newborns. While multi-view echocardiography has shown promise in improving PH assessment, existing models struggle with generalizability. In this work, we employ a multi-view variational autoencoder (VAE) for PH prediction using echocardiographic videos. By leveraging the VAE framework, our model captures complex latent representations, improving feature extraction and robustness. We compare its performance against single-view and supervised learning approaches. Our results show improved generalization and classification accuracy, highlighting the effectiveness of multi-view learning for robust PH assessment in newborns.
AuthorsL Erlarcher*, S Ruipérez-Campillo*, H Michel, S Wellmann, TM Sutter, E Ozkan, JE Vogt* denotes shared first authorship
SubmittedICLR 2025 - Workshop on AI for Children
Date06.03.2025
Background: It is difficult to identify patients with atrial fibrillation (AF) most likely to respond to ablation. While any arrhythmia patient may recur after acutely successful ablation, AF is unusual in that patients may have long-term arrhythmia freedom despite a lack of acute success. We hypothesized that acute and chronic AF ablation outcomes may reflect distinct physiology and used machine learning of multimodal data to identify their phenotypes. Methods: We studied 561 consecutive patients in the Stanford AF ablation registry (66±10 years, 28% women, 67% nonparoxysmal), from whom we extracted 72 data features of electrograms, electrocardiogram, cardiac structure, lifestyle, and clinical variables. We compared 6 machine learning models to predict acute and long-term end points after ablation and used Shapley explainability analysis to contrast phenotypes. We validated our results in an independent external population of n=77 patients with AF. Results: The 1-year success rate was 69.5%, and the acute termination rate was 49.6%, which correlated poorly on a patient-by-patient basis (φ coefficient=0.08). The best model for acute termination (area under the curve=0.86, Random Forest) was more predictive than for long-term outcomes (area under the curve=0.67, logistic regression; P<0.001). Phenotypes for long-term success reflected clinical and lifestyle features, while phenotypes for AF termination reflected electrical features. The need for AF induction predicted both phenotypes. The external validation cohort showed similar results (area under the curve=0.81 and 0.64, respectively) with similar phenotypes. Conclusions: Long-term and acute responses to AF ablation reflect distinct clinical and electrical physiology, respectively. This de-linking of phenotypes raises the question of whether long-term success operates through factors such as attenuated AF progression. There remains an urgent need to develop procedural predictors of long-term AF ablation success.
AuthorsP Ganesan, M Pedron, R Feng, AJ Rogers, B Deb, HJ Chang, S Ruipérez-Campillo, V Srivastava, KA Brennan, W Giles, T Baykaner, P Clopton, PJ Wang, U Schotten, DE Krummen, SM Narayan
SubmittedCirculation: Arrhythmia and Electrophysiology
Date10.02.2025
The widespread use of chest X-rays (CXRs), coupled with a shortage of radiologists, has driven growing interest in automated CXR analysis and AI-assisted reporting. While existing vision-language models (VLMs) show promise in specific tasks such as report generation or abnormality detection, they often lack support for interactive diagnostic capabilities. In this work we present RadVLM, a compact, multitask conversational foundation model designed for CXR interpretation. To this end, we curate a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks -- such as report generation, abnormality classification, and visual grounding -- and multi-turn, multi-task conversational interactions. After fine-tuning RadVLM on this instruction dataset, we evaluate it across different tasks along with re-implemented baseline VLMs. Our results show that RadVLM achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks. Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of RadVLM as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.
AuthorsN Deperrois, H Matsuo, S Ruipérez-Campillo, M Vandenhirtz, S Laguna, A Ryser, K Fujimoto, M Nishio, TM Sutter, JE Vogt, J Kluckert, T Frauenfelder, C Blüthgen, F Nooralahzadeh, M Krauthammer
SubmittedarXiv
Date01.02.2025
Background: Large language models (LLMs) such as Chat Generative Pre-trained Transformer (ChatGPT) excel at interpreting unstructured data from public sources, yet are limited when responding to queries on private repositories, such as electronic health records (EHRs). We hypothesized that prompt engineering could enhance the accuracy of LLMs for interpreting EHR data without requiring domain knowledge, thus expanding their utility for patients and personalized diagnostics. Methods: We designed and systematically tested prompt engineering techniques to improve the ability of LLMs to interpret EHRs for nuanced diagnostic questions, referenced to a panel of medical experts. In 490 full-text EHR notes from 125 patients with prior life-threatening heart rhythm disorders, we asked GPT-4-turbo to identify recurrent arrhythmias distinct from prior events and tested 220 563 queries. To provide context, results were compared with rule-based natural language processing and Bidirectional Encoder Representations from Transformer-based language models. Experiments were repeated for 2 additional LLMs. Results: In an independent hold-out set of 389 notes, GPT-4-turbo had a balanced accuracy of 64.3%±4.7% out-of-the-box at baseline. This increased when asking GPT-4-turbo to provide a rationale for its answers, a structured data output, and in-context exemplars, to a balanced accuracy of 91.4%±3.8% (P<0.05). This surpassed the traditional logic-based natural language processing and BERT-based models (P<0.05). Results were consistent for GPT-3.5-turbo and Jurassic-2 LLMs. Conclusions: The use of prompt engineering strategies enables LLMs to identify clinical end points from EHRs with an accuracy that surpassed natural language processing and approximated experts, yet without the need for expert knowledge. These approaches could be applied to LLM queries for other domains, to facilitate automated analysis of nuanced data sets with high accuracy by nonexperts.
AuthorsR Feng, KA Brennan, Z Azizi, J Goyal, B Deb, HJ Chang, P Ganesan, P Clopton, M Pedron, S Ruipérez-Campillo, Y Desai, H De Larochellière, T Baykaner, M Perez, M Rodrigo, AJ Rogers, SM Narayan
SubmittedCirculation: Arrhythmia and Electrophysiology
Date01.01.2025
Background: Vector field heterogeneity (VFH) is a novel omnipolar metric to quantify local propagation heterogeneities that may identify functionally critical sites for ablation in scar-related ventricular tachycardia (VT). Objective: This study aims to assess the diagnostic value of VFH to identify abnormal propagation patterns during ventricular substrate mapping and compare VFH in VT isthmus sites (IS), low-voltage bystander area (LVA), and normal voltage areas (NVAa). Methods: Substrate maps acquired with a 16-pole grid catheter in patients with scar-related VT were segmented into sites corresponding to IS, LVA, and NVA (defined as omnipolar voltages > and <1.5 mV) based on corresponding VT activation maps. For each 4 × 4 electrode-array acquisition, omnipolar-derived vector maps of the directions of electrical propagation were computed offline, and a VFH value per clique ranging from 0 (perfect planar wave) to 1 (maximal disorganization) derived. Results: Sixteen patients were studied (56.3% ischemic), evaluating 9 endocardial and 7 epicardial substrate maps. VFH metric at IS was 0.57 ± 0.26, LVA 0.52 ± 0.34, and NVA 0.16 ± 0.24. VFH at IS was higher (more disorganized) than at LVA (P < .001). VFH in NVA was significantly lower (ie, more homogeneous) compared with IS and LVA (P < .001, respectively). Within the isthmus, highest heterogeneity was recorded at entry (0.61 ± 0.24), lowest at exit sites (0.44 ± 0.27). Conclusion: VFH mapping identified a significant increase in electrical heterogeneity at IS compared with bystanders with highest heterogeneity at the entrance. Yet, absolute differences were small, and substantial overlap among sites was recorded, precluding its use as a stand-alone mapping approach. VFH may be integrated in existing mapping strategies as complementary information to support identification of ablation targets.
AuthorsJ Tonko*, S Ruipérez-Campillo*, G Cabero-Vidal, E Cabrera-Borrego, C Roney, J Jiminez, J Millet, F Castells, P Lambiase* denotes shared first authorship
SubmittedHeart Rhythm
Date06.11.2024