Dr.
Alexander Marx
Alumni
 alexander.marx@inf.ethz.ch
 Address

Department of Computer Science
CAB G 15.2
Universitätstr. 6
CH – 8092 Zurich, Switzerland
In September 2021, I joined the Medical Data Science group led by Prof. Dr. Julia Vogt as a PostDoctoral Fellow of the ETH AI Center, and am comentored by Peter Bühlmann. I am interested in conducting research related to causality, information theory, and multimodal learning to approach modern challenges in the medical and biological domain. I obtained my Master’s degree in Bioinformatics in 2016 and completed my PhD in Computer Science in 2021 at Saarland University, where I was affiliated with the CISPA Helmholtz Center for Information Security and the Max Planck Institute for Informatics. During my PhD, I did a threemonth research visit at the University of Amsterdam in 2019.
You can find a portrait article about me and my research here. For more information on current projects, please visit my personal website.
Publications
Background: The overarching goal of blood glucose forecasting is to assist individuals with type 1 diabetes (T1D) in avoiding hyper or hypoglycemic conditions. While deep learning approaches have shown promising results for blood glucose forecasting in adults with T1D, it is not known if these results generalize to children. Possible reasons are physical activity (PA), which is often unplanned in children, as well as age and development of a child, which both have an effect on the blood glucose level. Materials and Methods: In this study, we collected time series measurements of glucose levels, carbohydrate intake, insulindosing and physical activity from children with T1D for one week in an ethics approved prospective observational study, which included daily physical activities. We investigate the performance of stateoftheart deep learning methods for adult data—(dilated) recurrent neural networks and a transformer—on our dataset for shortterm (30 min) and longterm (2 h) prediction. We propose to integrate static patient characteristics, such as age, gender, BMI, and percentage of basal insulin, to account for the heterogeneity of our study group. Results: Integrating static patient characteristics (SPC) proves beneficial, especially for shortterm prediction. LSTMs and GRUs with SPC perform best for a prediction horizon of 30 min (RMSE of 1.66 mmol/l), a vanilla RNN with SPC performs best across different prediction horizons, while the performance significantly decays for longterm prediction. For prediction during the night, the best method improves to an RMSE of 1.50 mmol/l. Overall, the results for our baselines and RNN models indicate that blood glucose forecasting for children conducting regular physical activity is more challenging than for previously studied adult data. Conclusion: We find that integrating static data improves the performance of deeplearning architectures for blood glucose forecasting of children with T1D and achieves promising results for shortterm prediction. Despite these improvements, additional clinical studies are warranted to extend forecasting to longerterm prediction horizons.
AuthorsAlexander Marx, Francesco Di Stefano, Heike Leutheuser, Kieran ChinCheong, Marc Pfister, MarieAnne Burckhardt, Sara Bachmann^{†}, Julia E. Vogt^{†}
^{†} denotes shared last authorship
Frontiers in Pediatrics
Date14.12.2023
Contrastive learning is a cornerstone underlying recent progress in multiview and multimodal learning, e.g., in representation learning with image/caption pairs. While its effectiveness is not yet fully understood, a line of recent work reveals that contrastive learning can invert the data generating process and recover ground truth latent factors shared between views. In this work, we present new identifiability results for multimodal contrastive learning, showing that it is possible to recover shared factors in a more general setup than the multiview setting studied previously. Specifically, we distinguish between the multiview setting with one generative mechanism (e.g., multiple cameras of the same type) and the multimodal setting that is characterized by distinct mechanisms (e.g., cameras and microphones). Our work generalizes previous identifiability results by redefining the generative process in terms of distinct mechanisms with modalityspecific latent variables. We prove that contrastive learning can blockidentify latent factors shared between modalities, even when there are nontrivial dependencies between factors. We empirically verify our identifiability results with numerical simulations and corroborate our findings on a complex multimodal dataset of image/text pairs. Zooming out, our work provides a theoretical basis for multimodal representation learning and explains in which settings multimodal contrastive learning can be effective in practice.
AuthorsImant Daunhawer, Alice Bizeul, Emanuele Palumbo, Alexander Marx, Julia E. Vogt
SubmittedThe Eleventh International Conference on Learning Representations, ICLR 2023
Date23.03.2023
We study the problem of identifying cause and effect over two univariate continuous variables X and Y from a sample of their joint distribution. Our focus lies on the setting when the variance of the noise may be dependent on the cause. We propose to partition the domain of the cause into multiple segments where the noise indeed is dependent. To this end, we minimize a scaleinvariant, penalized regression score, finding the optimal partitioning using dynamic programming. We show under which conditions this allows us to identify the causal direction for the linear setting with heteroscedastic noise, for the nonlinear setting with homoscedastic noise, as well as empirically confirm that these results generalize to the nonlinear and heteroscedastic case. Altogether, the ability to model heteroscedasticity translates into an improved performance in telling cause from effect on a wide range of synthetic and realworld datasets.
AuthorsSascha Xu, Osman A Mian, Alexander Marx, Jilles Vreeken
SubmittedProceedings of the 39th International Conference on Machine Learning, ICML 2022
Date28.06.2022
The algorithmic independence of conditionals, which postulates that the causal mechanism is algorithmically independent of the cause, has recently inspired many highly successful approaches to distinguish cause from effect given only observational data. Most popular among these is the idea to approximate algorithmic independence via twopart Minimum Description Length (MDL). Although intuitively sensible, the link between the original postulate and practical twopart MDL encodings is left vague. In this work, we close this gap by deriving a twopart formulation of this postulate, in terms of Kolmogorov complexity, which directly links to practical MDL encodings. To close the cycle, we prove that this formulation leads on expectation to the same inference result as the original postulate.
AuthorsAlexander Marx, Jilles Vreeken
SubmittedAAAI'22 Workshop on InformationTheoretic Methods for Causal Inference and Discovery (ITCI’22)
Date05.05.2022
We study the problem of identifying the cause and the effect between two univariate continuous variables X and Y. The examined data is purely observational, hence it is required to make assumptions about the underlying model. Often, the independence of the noise from the cause is assumed, which is not always the case for real world data. In view of this, we present a new method, which explicitly models heteroscedastic noise. With our HEC algorithm, we can find the optimal model regularized, by an information theoretic score. In thorough experiments we show, that our ability to model heteroscedastic noise translates into a superior performance on a wide range of synthetic and realworld datasets.
AuthorsSascha Xu, Alexander Marx, Osman Mian, Jilles Vreeken
SubmittedAAAI'22 Workshop on InformationTheoretic Methods for Causal Inference and Discovery (ITCI’22)
Date05.05.2022
Estimating mutual information (MI) between two continuous random variables X and Y allows to capture nonlinear dependencies between them, nonparametrically. As such, MI estimation lies at the core of many data science applications. Yet, robustly estimating MI for highdimensional X and Y is still an open research question. In this paper, we formulate this problem through the lens of manifold learning. That is, we leverage the common assumption that the information of X and Y is captured by a lowdimensional manifold embedded in the observed highdimensional space and transfer it to MI estimation. As an extension to stateoftheart kNN estimators, we propose to determine the knearest neighbors via geodesic distances on this manifold rather than from the ambient space, which allows us to estimate MI even in the highdimensional setting. An empirical evaluation of our method, GKSG, against the stateoftheart shows that it yields good estimations of MI in classical benchmark and manifold tasks, even for high dimensional datasets, which none of the existing methods can provide.
AuthorsAlexander Marx, Jonas Fischer
SubmittedProceedings of the SIAM International Conference on Data Mining, SDM 2022
Date30.04.2022
Estimating conditional mutual information (CMI) is an essential yet challenging step in many machine learning and data mining tasks. Estimating CMI from data that contains both discrete and continuous variables, or even discretecontinuous mixture variables, is a particularly hard problem. In this paper, we show that CMI for such mixture variables, defined based on the RadonNikodym derivate, can be written as a sum of entropies, just like CMI for purely discrete or continuous data. Further, we show that CMI can be consistently estimated for discretecontinuous mixture variables by learning an adaptive histogram model. In practice, we estimate such a model by iteratively discretizing the continuous data points in the mixture variables. To evaluate the performance of our estimator, we benchmark it against stateoftheart CMI estimators as well as evaluate it in a causal discovery setting.
AuthorsAlexander Marx, Lincen Yang, Matthijs van Leeuwen
SubmittedProceedings of the SIAM International Conference on Data Mining, SDM 2021
Date21.10.2021
Understanding how epigenetic variation in noncoding regions is involved in distal geneexpression regulation is an important problem. Regulatory regions can be associated to genes using largescale datasets of epigenetic and expression data. However, for regions of complex epigenomic signals and enhancers that regulate many genes, it is difficult to understand these associations. We present StitchIt, an approach to dissect epigenetic variation in a genespecific manner for the detection of regulatory elements (REMs) without relying on peak calls in individual samples. StitchIt segments epigenetic signal tracks over many samples to generate the location and the target genes of a REM simultaneously. We show that this approach leads to a more accurate and refined REM detection compared to standard methods even on heterogeneous datasets, which are challenging to model. Also, StitchIt REMs are highly enriched in experimentally determined chromatin interactions and expression quantitative trait loci. We validated several newly predicted REMs using CRISPRCas9 experiments, thereby demonstrating the reliability of StitchIt. StitchIt is able to dissect regulation in superenhancers and predicts thousands of putative REMs that go unnoticed using peakbased approaches suggesting that a large part of the regulome might be uncharted water.
AuthorsFlorian Schmidt, Alexander Marx, Nina Baumgarten, Marie Hebel, Martin Wegner, Manuel Kaulich, Matthias S Leisegang, Ralf P Brandes, Jonathan Göke, Jilles Vreeken, Marcel H Schulz
SubmittedNucleic Acids Research
Date01.09.2021
One of the core assumptions in causal discovery is the faithfulness assumptioni.e. assuming that independencies found in the data are due to separations in the true causal graph. This assumption can, however, be violated in many ways, including xor connections, deterministic functions or cancelling paths. In this work, we propose a weaker assumption that we call 2adjacency faithfulness. In contrast to adjacency faithfulness, which assumes that there is no conditional independence between each pair of variables that are connected in the causal graph, we only require no conditional independence between a node and a subset of its Markov blanket that can contain up to two nodes. Equivalently, we adapt orientation faithfulness to this setting. We further propose a sound orientation rule for causal discovery that applies under weaker assumptions. As a proof of concept, we derive a modified Grow and Shrink algorithm that recovers the Markov blanket of a target node and prove its correctness under strictly weaker assumptions than the standard faithfulness assumption.
AuthorsAlexander Marx, Arthur Gretton, Joris M. Mooij
SubmittedProceedings of the Conference on Uncertainty in Artificial Intelligence, UAI 2021
Date01.08.2021
It is wellknown that correlation does not equal causation, but how can we infer causal relations from data? Causal discovery tries to answer precisely this question by rigorously analyzing under which assumptions it is feasible to infer directed causal networks from passively collected, socalled observational data. Classical approaches assume the data to be faithful to the causal graph, that is, independencies found in the distribution are assumed to be due to separations in the true graph. Under this assumption, socalled constraintbased methods can infer the correct Markov equivalence class of the true graph (i.e. the correct undirected graph and some edge directions), only using conditional independence tests. In this dissertation, we aim to alleviate some of the weaknesses of constraintbased algorithms. In the first part, we investigate causal mechanisms, which cannot be detected when assuming faithfulness. We then suggest a weaker assumption based on triple interactions, which allows for recovering a broader spectrum of causal mechanisms. Subsequently, we focus on conditional independence testing, which is a crucial tool for causal discovery. In particular, we propose to measure dependencies through conditional mutual information, which we show can be consistently estimated even for the most general setup: discretecontinuous mixture random variables. Last, we focus on distinguishing Markov equivalent graphs (i.e. infer the complete DAG structure), which boils down to inferring the causal direction between two random variables. In this setting, we focus on continuous and mixedtype data and develop our methods based on an informationtheoretic postulate, which states that the true causal graph can be compressed best, i.e. has the smallest Kolmogorov complexity.
AuthorsAlexander Marx
SubmittedSaarländische Universitäts und Landesbibliothek
Date06.07.2021
We study the problem of inferring causal graphs from observational data. We are particularly interested in discovering graphs where all edges are oriented, as opposed to the partially directed graph that the stateoftheart discover. To this end we base our approach on the algorithmic Markov condition. Unlike the statistical Markov condition, it uniquely identifies the true causal network as the one that provides the simplestas measured in Kolmogorov complexityfactorization of the joint distribution. Although Kolmogorov complexity is not computable, we can approximate it from above via the Minimum Description Length principle, which allows us to define a consistent and computable score based on nonparametric multivariate regression. To efficiently discover causal networks in practice, we introduce the GLOBE algorithm, which greedily adds, removes, and orients edges such that it minimizes the overall cost. Through an extensive set of experiments we show GLOBE performs very well in practice, beating the stateoftheart by a margin.
AuthorsOsman Mian, Alexander Marx, Jilles Vreeken
SubmittedProceedings of the AAAI Conference on Artificial Intelligence, AAAI 2021
Date01.05.2021
We consider the problem of inferring the causal direction between two univariate numeric random variables X and Y from observational data. This case is especially challenging as the graph X causes Y is Markov equivalent to the graph Y causes X, and hence it is impossible to determine the correct direction using conditional independence tests. To tackle this problem, we follow an information theoretic approach based on the algorithmic Markov condition. This postulate states that in terms of Kolmogorov complexity the factorization given by the true causal model is the most succinct description of the joint distribution. This means that we can infer that X is a likely cause of Y when we need fewer bits to first transmit the data over X, and then the data of Y as a function of X, than for the inverse direction. That is, in this paper we perform causal inference by compression. To put this notion to practice, we employ the Minimum Description Length principle, and propose a score to determine how many bits we need to transmit the data using a class of regression functions that can model both local and global functional relations. To determine whether an inference, i.e. the difference in compressed sizes, is significant, we propose two analytical significance tests based on the nohypercompression inequality. Last, but not least, we introduce the lineartime Slope and Sloper algorithms that through thorough empirical evaluation we show outperform the state of the art by a wide margin.
AuthorsAlexander Marx, Jilles Vreeken
SubmittedKnowledge and Information Systems
Date01.09.2019
We consider the problem of telling apart cause from effect between two univariate continuousvalued random variables X and Y. In general, it is impossible to make definite statements about causality without making assumptions on the underlying model; one of the most important aspects of causal inference is hence to determine under which assumptions are we able to do so. In this paper we show under which general conditions we can identify cause from effect by simply choosing the direction with the best regression score. We define a general framework of identifiable regressionbased scoring functions, and show how to instantiate it in practice using regression splines. Compared to existing methods that either give strong guarantees, but are hardly applicable in practice, or provide no guarantees, but do work well in practice, our instantiation combines the best of both worlds; it gives guarantees, while empirical evaluation on synthetic and realworld data shows that it performs at least as well as the state of the art.
AuthorsAlexander Marx, Jilles Vreeken
SubmittedProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD 2019
Date01.07.2019
Testing for conditional independence is a core aspect of constraintbased causal discovery. Although commonly used tests are perfect in theory, they often fail to reject independence in practiceespecially when conditioning on multiple variables. We focus on discrete data and propose a new test based on the notion of algorithmic independence that we instantiate using stochastic complexity. Amongst others, we show that our proposed test, SCI, is an asymptotically unbiased as well as L2 consistent estimator for conditional mutual information (CMI). Further, we show that SCI can be reformulated to find a sensible threshold for CMI that works well on limited samples. Empirical evaluation shows that SCI has a lower type II error than commonly used tests. As a result, we obtain a higher recall when we use SCI in causal discovery algorithms, without compromising the precision.
AuthorsAlexander Marx, Jilles Vreeken
SubmittedProceedings of the International Conference on Artificial Intelligence and Statistics, AISTATS 2019
Date01.04.2019
How can we discover whether X causes Y, or vice versa, that Y causes X, when we are only given a sample over their joint distribution? How can we do this such that X and Y can be univariate, multivariate, or of different cardinalities? And, how can we do so regardless of whether X and Y are of the same, or of different data type, be it discrete, numeric, or mixed? These are exactly the questions we answer. We take an information theoretic approach, based on the Minimum Description Length principle, from which it follows that first describing the data over cause and then that of effect given cause is shorter than the reverse direction. Simply put, if Y can be explained more succinctly by a set of classification or regression trees conditioned on X, than in the opposite direction, we conclude that X causes Y. Empirical evaluation on a wide range of data shows that our method, Crack, infers the correct causal direction reliably and with high accuracy on a wide range of settings, outperforming the state of the art by a wide margin. Code related to this paper is available at: http://eda.mmci.unisaarland.de/crack.
AuthorsAlexander Marx, Jilles Vreeken
SubmittedProceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Data, ECMLPKDD 2018
Date13.08.2018
We consider the fundamental problem of inferring the causal direction between two univariate numeric random variables X and Y from observational data. The twovariable case is especially difficult to solve since it is not possible to use standard conditional independence tests between the variables. To tackle this problem, we follow an information theoretic approach based on Kolmogorov complexity and use the Minimum Description Length (MDL) principle to provide a practical solution. In particular, we propose a compression scheme to encode local and global functional relations using MDLbased regression. We infer X causes Y in case it is shorter to describe Y as a function of X than the inverse direction. In addition, we introduce Slope, an efficient lineartime algorithm that through thorough empirical evaluation on both synthetic and real world data we show outperforms the state of the art by a wide margin.
AuthorsAlexander Marx, Jilles Vreeken
SubmittedProceedings of the IEEE International Conference on Data Mining, ICDM 2017
Date01.11.2017
In many research disciplines, hypothesis tests are applied to evaluate whether findings are statistically significant or could be explained by chance. The Wilcoxon–Mann–Whitney (WMW) test is among the most popular hypothesis tests in medicine and life science to analyze if two groups of samples are equally distributed. This nonparametric statistical homogeneity test is commonly applied in molecular diagnosis. Generally, the solution of the WMW test takes a high combinatorial effort for large sample cohorts containing a significant number of ties. Hence, P value is frequently approximated by a normal distribution. We developed EDISONWMW, a new approach to calculate the exact permutation of the twotailed unpaired WMW test without any corrections required and allowing for ties. The method relies on dynamic programing to solve the combinatorial problem of the WMW test efficiently. Beyond a straightforward implementation of the algorithm, we presented different optimization strategies and developed a parallel solution. Using our program, the exact P value for large cohorts containing more than 1000 samples with ties can be calculated within minutes. We demonstrate the performance of this novel approach on randomlygenerated data, benchmark it against 13 other commonlyapplied approaches and moreover evaluate molecular biomarkers for lung carcinoma and chronic obstructive pulmonary disease (COPD). We found that approximated P values were generally higher than the exact solution provided by EDISONWMW. Importantly, the algorithm can also be applied to highthroughput omics datasets, where hundreds or thousands of features are included. To provide easy access to the multithreaded version of EDISONWMW, a webbased solution of our algorithm is freely available at http://www.ccb.unisaarland.de/software/wtest/.
AuthorsAlexander Marx, Christina Backes, Eckart Meese, HansPeter Lenhof, Andreas Keller
SubmittedGenomics, Proteomics & Bioinformatics
Date01.01.2016