Post-Doctoral position : « Causal structure learning and statistics: computational methods and applications »

Causal structure learning and statistics: computational methods and applications

Date limite de candidature : 20/03/2024
Date de début : 01/06/2024
Date de fin : 01/06/2025

Pôle : Signaux et statistiques
Type de poste : Post-Doc ou ATER
Contact : M.N. EL KORSO (

la fiche

Causal structure learning and statistics: computational methods and applications
AAP Postdoc L2S 2024
March 13, 2024


Supervision Team: M. N. El Korso, A. Tenenhaus, L. Le Brusquet

1 Introduction
Discovering causal relations from observational data emerges as an important problem for
artificial intelligence [2] with fundamental and practical motivations. A notable reason
is that causal models support modes of reasoning, e.g., counterfactual reasoning and
algorithmic recourse [1] that are otherwise out of reach by correlation-based machine
learning, as shown by [3]. One of the most fundamental causal models, appearing in
social sciences and econometrics, is called the linear structural equation model (linear
SEM) [4]. A linear SEM formulates the variations of a d-dimensional random variable
X = (X1, . . . , Xd) by the following linear system:
X1 = β2,1X2 + · · · + βd,1Xd + ϵ1
X2 = β1,2X1 + β3,2X3 + · · · + βd,2Xd + ϵ2
… …
Xd = β1,dX1 + · · · + βd−1,dXd−1 + ϵd,0

where βi,j are explanatory coefficients and {ϵi}i=1,…,d are random noise variables. Basic
assumptions about this model are: (i) the random noises {ϵi}i=1,…,d are independent of
X and are mutually independent, and (ii) the coefficients βij form a matrix B such that
Bij = βi,j and that B is the adjacency matrix of a directed acyclic graph (DAG) G, as
illustrated in Figure 1. The system above can be rewritten as X = BTX + N where
N = (ϵ1, . . . , ϵd). Due to the absence of cycles in the graph of B, the linear SEM offers a
causal explanation to the data generating process of X.

The linear SEM is a special case of a more generic framework called structural causal
model (SCM) where the variables are assumed to satisfy the following properties:
X1 = f1(XPa1(G), ϵ1), . . . , Xd = fd(XPad(G), ϵd),
where XPai(G) denotes the variables that belong to the parent set—i.e., the set of direct
causes—of Xi with respect to the DAG G, and {fi}i=i,…,d are multivariate functions
allowing for more flexible (linear or nonlinear) modeling of the data variations.
Causal structure learning is the problem of learning the DAG G (or its adjacency
matrix B) of a causal model. Given observations of the random variable X, causal struc-
ture learning involves inferring all possible cause-effect explanations among the variables,
while the graph of all such cause-effect directions must be acyclic. Causal structure learn-
ing faces challenges in statistical aspects when the data is limited, and at the same time
it is also a highly complex problem due to the DAG constraint. The number of DAGs
increases super-exponentially with respect to the number d of variables. Learning a causal
DAG is shown to be NP-hard [5] even in the large sample limit.
A usual strategy to overcome computational difficulties is to restrict the search space
of DAGs [6]. In the linear SEM, [7] show that the moral graph of the DAG coincides with
the support graph of the inverse covariance matrix under a mild faithfulness assumption.
This result entails a reduced search of DAGs within the support of the inverse covariance
matrix with respect to the log-likelihood score function involving the diagonal of noise
variances Ω. Furthermore, when the true Ω is provided, the score function admits the
true DAG as the unique minimum [7]. Note however that even when Ω is known, the
actual minimum of score function is unknown, requiring a thorough search.
Another strategy for causal discovery is to formulate a continuous optimization prob-
lem based on a differentiable or subdifferentiable (e.g., ℓ1-penalty) score function [8].
In this line of work, the optimization of the causal model, defined on the set of real
square matrices, is subject to a differentiable DAG constraint [8], or alternatively, a
sparsity-promoting constraint on the sought matrix [9]. The computational cost of these
optimization approaches, different from combinatorial methods, depends mainly on the
gradient computations which scale well enough; but the overall complexity may still be
high due to the nonconvex optimization landscape (see e.g., [8]).
Due to the above difficulties and the limitations in the current algorithms, apply-
ing causal modeling to real-world data still faces computational burdens that are often
prohibitive on large-scale problems. The postdoc will intend to address this issue by im-
proving two fundamental aspects—scalability and statistical performance. The work
plans are outlined in Section 1.2 and are detailed in Section 2. These work plans also
answer to the ever growing demand for eco-responsible learning systems.

1.1 Applications
Causal modeling is not only a research topic on its own but it also plays a crucial role in
various related disciplines including machine learning, signal processing, automatic con-
trol and artificial intelligence. Therefore, causal modeling and causal structure learning
have become an increasingly important topic in these areas. Below are some examples: (i)
Machine learning, deep learning and predictive modeling [10]; (ii) Biomedical signal pro-
cessing [11]; (iii) Time series analysis [12]; (iv) Image processing and computer vision [13];
(v) Signal restoration [14].

1.2 Challenges and objectives
Challenge 1: Scalability. Real world systems usually have hundreds or thousands of
variables, each of which may be causally connected to dozens of other variables. Models
of functional brain imaging (fMRI), functional genomics, electronic health records, and
financial systems are a few examples. Therefore, in order for causal structure learning
methods to have an impact on these real world problems, they need to be not only highly
accurate but also scalable in the number of variables and in the node degree of causal
graph. The challenge is in making causal structure learning methods more scalable is
largely due to the acyclicity constraint on the causal graph G.
Challenge 2: Statistical performance. In high-dimensional statistics, a common
challenge arises when dealing with datasets where the number of variables (features) is
much larger than the number of available data samples. This scenario is often referred to as
the “large p, small n” problem. This challenge is prevalent in many of the aforementioned
application areas, including genomics and bioinformatics, fMRI, financial time series,
remote sensing and imaging.
Objectives. The primary focus of this postdoc position is to conceive more efficient and
scalable methods for causal structure learning. The objectives of this research project are
defined to address the aforementioned challenges in causal structure learning and causal
modeling, namely scalability (“large p”, for large number of variables) and statistical
performance (“small n”, for insufficient data).
In this project, we will tackle two research themes: (i) matrix and tensor decompo-
sition models for learning from correlations and high-order statistics, and (ii) studies on
statistical performance and sample-efficient algorithms.

2 Detailed description
Learning causal structures from observational data is an optimization problem where
the objective function is defined on a discrete space of a certain causal graphical model.
Among the various different graphical models for causal structures, a most important
feature is that the graph must be directed and acyclic. Therefore, the optimization of a
causal graphical model is subject to an acyclicity constraint in a discrete search space.
Matrix and tensor decomposition models. In multivariate data analysis, signal
processing, medical and social sciences, the computation of correlations is an important
tool for capturing and analysing the interactions among the system variables. However,
the generating system of the observational data is usually associated with a causal mecha-
nism (which is no longer symmetric but directional and acyclic). Therefore the inference of
causal relations from the observed correlations is another fundamental challenge for causal
structure learning. In this line of research, matrix decomposition-based methods include
DAG enumeration [7], Cholesky-like decomposition given a certain causal ordering [15],
search of sparsest decompositions of the precision matrix [16], ICA-based approaches [17]
and mixed-graphical extensions [18].
Taking these previous works into account, one may summarize the primary question as
follows: how to learn causal relations through the correlations and higher-order statistics
of multivariate data?
An immediate work plan is to enhance the constrained optimization algorithm (an
augmented Lagrangian method) for the sparse decomposition method named ICID [19].
Given the inverse covariance matrix Θ of X, the ICID is formulated as follows, for a fixed
diagonal matrix D ≻ 0,
B∈Rd×d, diag(B)=0
subject to (I − B)D(I − B)T = Θ
Supp(B) ⊂ Supp(Θ),
where Supp(·) denotes the support graph, i.e., the graph encoded by the nonzero entries
of a matrix. The above support-preserving constraint enables ICID to have significantly
reduced computational cost than dense matrix methods. At the same time, the solution
set of ICID is guaranteed to include the true DAG of the linear SEM, under a mild
faithfulness assumption [7], as illustrated in Figure 2.0

Figure 2: The support of B is guaranteed to be included in Supp(Θ) under mild assump-
tion [7].
A second work plan is to identify the hardness of the optimization problem underlying
ICID. It is worth mentioning that the work of the postdoc candidate shows that the
performance of ICID is still sensitive to the accurateness of the precision matrix estimators.
On the other hand, ICID also entails enhanced results when a regularization term using
skewness-based statistics [20] is added (see [19] and the activities report (Fig. 4)),
showing promising signs for investigating in this direction.
To go beyond linear SEMs and extend a matrix or tensor decomposition method to
more complex causal models, we may consider a path through the theoretical framework
of kernel methods (e.g., [21]). Similar to the extension of covariance matrices to nonlinear
kernels, a matrix or tensor decomposition model has the potential of being used in the
context of nonlinear causal models.
Statistical performance and sample-efficient algorithms. In high-dimensional
statistics, a common challenge arises when dealing with datasets where the number of
variables (features) is much larger than the number of available data samples. This sce-
nario is often referred to as the “large p, small n” problem.
In the literature on structure learning and more general structure learning, a significant
open question is the optimal sample complexity of learning a directed acyclic graphical
model. Previous work in the study of upper bounds on the sample complexity include [22].
Despite these upper bounds, a tight characterization of the optimal sample complexity is
missing. This is to be contrasted with the situation for learning undirected graphs, also
known as Markov random fields, for which optimal rates were established in [23], alongside
similar results for support recovery in linear models [24]. This is in fact unsurprising given
the connection between these two problems via neighbourhood regression. Unfortunately,
learning a directed acyclic graph (DAG) does not reduce to neighbourhood regression as
it involves a more difficult constraint for acyclicity.
Gao et al. (2022) [25] introduced an algorithm that combines causal order recovery and
causal structure learning, and showed its optimality in sample complexity. However, their
analysis and result is restricted to the linear SEM with one particularly strong assumption,
that is, the noise variables of the SEM are assumed to have equal variances. For more
general cases without the equal noise-variance assumption, the sample complexity and
even the identifiability of the DAG remain a question.
In this context, we will investigate more refined sparsity-pursuing methods such as
structured sparsity regularization [26]. For causal structure learning, a key challenge
is to adapt the structured sparsity methodology to learning DAGs, while the typical
scenario in the previous work is the learning of undirected graphs (e.g., group sparsity
for graph LASSO [27]). The postdoc will tackle the above questions through lens of his
sparse matrix decomposition approach (ICID) [19] and a related work on sparse-ICA [28].
Both methods involves an ℓ1 norm term for finding sparse graphs according to a specific
matrix decomposition model. Their learning accuracies depend on the accurateness of
either the precision matrix or the covariance matrix, which all rely on a sufficient number
of samples.

[1] Stratis Tsirtsis, Amir-Hossein Karimi, Ana Lucic, Manuel Gomez-Rodriguez, Isabel Valera, and Hima Lakkaraju.
ICML workshop on algorithmic recourse. 2021.
[2] Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000.
[3] Jonas Peters, Peter B¨uhlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification
and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pages 947–1012,
[4] Otis Dudley Duncan. Introduction to structural equation models. Elsevier, 2014.
[5] David Maxwell Chickering. Learning bayesian networks is NP-complete. In Learning from data, pages 121–130.
Springer, 1996.
[6] Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT
press, 2000.
[7] Po-Ling Loh and Peter B¨uhlmann. High-dimensional learning of linear causal networks via inverse covariance estima-
tion. The Journal of Machine Learning Research, 15(1):3065–3105, 2014.
[8] Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. DAGs with NO TEARS: Continuous optimization
for structure learning. In Advances in Neural Information Processing Systems, volume 31, 2018.
[9] Ignavier Ng, AmirEmad Ghassami, and Kun Zhang. On the role of sparsity and DAG constraints for learning linear
DAGs. Advances in Neural Information Processing Systems, 33:17943–17954, 2020.
[10] Christina Heinze-Deml, Jonas Peters, and Nicolai Meinshausen. Invariant causal prediction for nonlinear models.
Journal of Causal Inference, 6(2):20170016, 2018.
[11] Joshua D. Angrist, Guido W. Imbens, and Donald B. Rubin. Identification of causal effects using instrumental variables.
Journal of the American Statistical Association, 91(434):444–455, 1996.
[12] Jakob Runge. Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets.
In Jonas Peters and David Sontag, editors, Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence
(UAI), volume 124 of Proceedings of Machine Learning Research, pages 1388–1397. PMLR, 03–06 Aug 2020.
[13] Daniel C Castro, Ian Walker, and Ben Glocker. Causality matters in medical imaging. Nature Communications,
11(1):3673, 2020.
[14] Peng Ding and Fan Li. Causal inference: A missing data perspective. Statistical Science, 33(2):214–237, 2018.
[15] Asish Ghoshal and Jean Honorio. Learning linear structural equation models in polynomial time and sample complexity.
In International Conference on Artificial Intelligence and Statistics, pages 1466–1475. PMLR, 2018.
[16] Garvesh Raskutti and Caroline Uhler. Learning directed acyclic graph models based on sparsest permutations. Stat,
7(1):e183, 2018.
[17] Shohei Shimizu, Patrik O Hoyer, Aapo Hyv¨arinen, and Antti Kerminen. A linear non-Gaussian acyclic model for
causal discovery. Journal of Machine Learning Research, 7(10), 2006.
[18] Yiheng Liu, Elina Robeva, and Huanqing Wang. Learning linear non-gaussian graphical models with multidirected
edges. Journal of Causal Inference, 9(1):250–263, 2021.
[19] Shuyu Dong, Kento Uemura, Akito Fujii, Shuang Chang, Yusuke Koyanagi, Koji Maruhashi, and Mich`ele Se-
bag. Learning large causal structures from inverse covariance matrix via matrix decomposition. arXiv preprint
arXiv:2211.14221, 2023.
[20] Aapo Hyv¨arinen and Stephen M Smith. Pairwise likelihood ratios for estimation of non-gaussian structural equation
models. The Journal of Machine Learning Research, 14(1):111–152, 2013.
[21] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Sch¨olkopf, et al. Kernel mean embedding of
distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10(1-2):1–141, 2017.
[22] Or Zuk, Shiri Margel, and Eytan Domany. On the number of samples needed to learn the correct structure of a
bayesian network. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pages
560–567, 2006.
[23] Narayana P Santhanam and Martin J Wainwright. Information-theoretic limits of selecting binary graphical models
in high dimensions. IEEE Transactions on Information Theory, 58(7):4117–4134, 2012.
[24] Martin J Wainwright. Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting.
IEEE transactions on information theory, 55(12):5728–5741, 2009.
[25] Ming Gao, Wai Ming Tai, and Bryon Aragam. Optimal estimation of gaussian dag models. In International Conference
on Artificial Intelligence and Statistics, pages 8738–8757. Probab.
[26] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal
Statistical Society Series B: Statistical Methodology, 68(1):49–67, 2006.
[27] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group LASSO with overlap and graph LASSO. In
Proceedings of the 26th annual international conference on machine learning, pages 433–440, 2009.
[28] Ignavier Ng, Yujia Zheng, Xinshuai Dong, and Kun Zhang. On the identifiability of sparse ICA without assuming
non-Gaussianity. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.