Causal structure learning and statistics: computational methods and applications

Causal structure learning and statistics: computational methods and applications

AAP Postdoc L2S 2024

March 13, 2024

Supervision Team: M. N. El Korso, A. Tenenhaus, L. Le Brusquet

1 Introduction

Discovering causal relations from observational data emerges as an important problem for

artificial intelligence [2] with fundamental and practical motivations. A notable reason

is that causal models support modes of reasoning, e.g., counterfactual reasoning and

algorithmic recourse [1] that are otherwise out of reach by correlation-based machine

learning, as shown by [3]. One of the most fundamental causal models, appearing in

social sciences and econometrics, is called the linear structural equation model (linear

SEM) [4]. A linear SEM formulates the variations of a d-dimensional random variable

X = (X1, . . . , Xd) by the following linear system:

X1 = β2,1X2 + · · · + βd,1Xd + ϵ1

X2 = β1,2X1 + β3,2X3 + · · · + βd,2Xd + ϵ2

… …

Xd = β1,dX1 + · · · + βd−1,dXd−1 + ϵd,0

where βi,j are explanatory coefficients and {ϵi}i=1,…,d are random noise variables. Basic

assumptions about this model are: (i) the random noises {ϵi}i=1,…,d are independent of

X and are mutually independent, and (ii) the coefficients βij form a matrix B such that

Bij = βi,j and that B is the adjacency matrix of a directed acyclic graph (DAG) G, as

illustrated in Figure 1. The system above can be rewritten as X = BTX + N where

N = (ϵ1, . . . , ϵd). Due to the absence of cycles in the graph of B, the linear SEM offers a

causal explanation to the data generating process of X.

The linear SEM is a special case of a more generic framework called structural causal

model (SCM) where the variables are assumed to satisfy the following properties:

X1 = f1(XPa1(G), ϵ1), . . . , Xd = fd(XPad(G), ϵd),

where XPai(G) denotes the variables that belong to the parent set—i.e., the set of direct

causes—of Xi with respect to the DAG G, and {fi}i=i,…,d are multivariate functions

allowing for more flexible (linear or nonlinear) modeling of the data variations.

Causal structure learning is the problem of learning the DAG G (or its adjacency

matrix B) of a causal model. Given observations of the random variable X, causal struc-

ture learning involves inferring all possible cause-effect explanations among the variables,

while the graph of all such cause-effect directions must be acyclic. Causal structure learn-

ing faces challenges in statistical aspects when the data is limited, and at the same time

it is also a highly complex problem due to the DAG constraint. The number of DAGs

increases super-exponentially with respect to the number d of variables. Learning a causal

DAG is shown to be NP-hard [5] even in the large sample limit.

A usual strategy to overcome computational difficulties is to restrict the search space

of DAGs [6]. In the linear SEM, [7] show that the moral graph of the DAG coincides with

the support graph of the inverse covariance matrix under a mild faithfulness assumption.

This result entails a reduced search of DAGs within the support of the inverse covariance

matrix with respect to the log-likelihood score function involving the diagonal of noise

variances Ω. Furthermore, when the true Ω is provided, the score function admits the

true DAG as the unique minimum [7]. Note however that even when Ω is known, the

actual minimum of score function is unknown, requiring a thorough search.

Another strategy for causal discovery is to formulate a continuous optimization prob-

lem based on a differentiable or subdifferentiable (e.g., ℓ1-penalty) score function [8].

In this line of work, the optimization of the causal model, defined on the set of real

square matrices, is subject to a differentiable DAG constraint [8], or alternatively, a

sparsity-promoting constraint on the sought matrix [9]. The computational cost of these

optimization approaches, different from combinatorial methods, depends mainly on the

gradient computations which scale well enough; but the overall complexity may still be

high due to the nonconvex optimization landscape (see e.g., [8]).

Due to the above difficulties and the limitations in the current algorithms, apply-

ing causal modeling to real-world data still faces computational burdens that are often

prohibitive on large-scale problems. The postdoc will intend to address this issue by im-

proving two fundamental aspects—scalability and statistical performance. The work

plans are outlined in Section 1.2 and are detailed in Section 2. These work plans also

answer to the ever growing demand for eco-responsible learning systems.

1.1 Applications

Causal modeling is not only a research topic on its own but it also plays a crucial role in

various related disciplines including machine learning, signal processing, automatic con-

trol and artificial intelligence. Therefore, causal modeling and causal structure learning

have become an increasingly important topic in these areas. Below are some examples: (i)

Machine learning, deep learning and predictive modeling [10]; (ii) Biomedical signal pro-

cessing [11]; (iii) Time series analysis [12]; (iv) Image processing and computer vision [13];

(v) Signal restoration [14].

1.2 Challenges and objectives

Challenge 1: Scalability. Real world systems usually have hundreds or thousands of

variables, each of which may be causally connected to dozens of other variables. Models

of functional brain imaging (fMRI), functional genomics, electronic health records, and

financial systems are a few examples. Therefore, in order for causal structure learning

methods to have an impact on these real world problems, they need to be not only highly

accurate but also scalable in the number of variables and in the node degree of causal

graph. The challenge is in making causal structure learning methods more scalable is

largely due to the acyclicity constraint on the causal graph G.

Challenge 2: Statistical performance. In high-dimensional statistics, a common

challenge arises when dealing with datasets where the number of variables (features) is

much larger than the number of available data samples. This scenario is often referred to as

the “large p, small n” problem. This challenge is prevalent in many of the aforementioned

application areas, including genomics and bioinformatics, fMRI, financial time series,

remote sensing and imaging.

Objectives. The primary focus of this postdoc position is to conceive more efficient and

scalable methods for causal structure learning. The objectives of this research project are

defined to address the aforementioned challenges in causal structure learning and causal

modeling, namely scalability (“large p”, for large number of variables) and statistical

performance (“small n”, for insufficient data).

In this project, we will tackle two research themes: (i) matrix and tensor decompo-

sition models for learning from correlations and high-order statistics, and (ii) studies on

statistical performance and sample-efficient algorithms.

2 Detailed description

Learning causal structures from observational data is an optimization problem where

the objective function is defined on a discrete space of a certain causal graphical model.

Among the various different graphical models for causal structures, a most important

feature is that the graph must be directed and acyclic. Therefore, the optimization of a

causal graphical model is subject to an acyclicity constraint in a discrete search space.

Matrix and tensor decomposition models. In multivariate data analysis, signal

processing, medical and social sciences, the computation of correlations is an important

tool for capturing and analysing the interactions among the system variables. However,

the generating system of the observational data is usually associated with a causal mecha-

nism (which is no longer symmetric but directional and acyclic). Therefore the inference of

causal relations from the observed correlations is another fundamental challenge for causal

structure learning. In this line of research, matrix decomposition-based methods include

DAG enumeration [7], Cholesky-like decomposition given a certain causal ordering [15],

search of sparsest decompositions of the precision matrix [16], ICA-based approaches [17]

and mixed-graphical extensions [18].

Taking these previous works into account, one may summarize the primary question as

follows: how to learn causal relations through the correlations and higher-order statistics

of multivariate data?

An immediate work plan is to enhance the constrained optimization algorithm (an

augmented Lagrangian method) for the sparse decomposition method named ICID [19].

Given the inverse covariance matrix Θ of X, the ICID is formulated as follows, for a fixed

diagonal matrix D ≻ 0,

minimize

B∈Rd×d, diag(B)=0

∥B∥ℓ1

subject to (I − B)D(I − B)T = Θ

Supp(B) ⊂ Supp(Θ),

where Supp(·) denotes the support graph, i.e., the graph encoded by the nonzero entries

of a matrix. The above support-preserving constraint enables ICID to have significantly

reduced computational cost than dense matrix methods. At the same time, the solution

set of ICID is guaranteed to include the true DAG of the linear SEM, under a mild

faithfulness assumption [7], as illustrated in Figure 2.0

Figure 2: The support of B is guaranteed to be included in Supp(Θ) under mild assump-

tion [7].

A second work plan is to identify the hardness of the optimization problem underlying

ICID. It is worth mentioning that the work of the postdoc candidate shows that the

performance of ICID is still sensitive to the accurateness of the precision matrix estimators.

On the other hand, ICID also entails enhanced results when a regularization term using

skewness-based statistics [20] is added (see [19] and the activities report (Fig. 4)),

showing promising signs for investigating in this direction.

To go beyond linear SEMs and extend a matrix or tensor decomposition method to

more complex causal models, we may consider a path through the theoretical framework

of kernel methods (e.g., [21]). Similar to the extension of covariance matrices to nonlinear

kernels, a matrix or tensor decomposition model has the potential of being used in the

context of nonlinear causal models.

Statistical performance and sample-efficient algorithms. In high-dimensional

statistics, a common challenge arises when dealing with datasets where the number of

variables (features) is much larger than the number of available data samples. This sce-

nario is often referred to as the “large p, small n” problem.

In the literature on structure learning and more general structure learning, a significant

open question is the optimal sample complexity of learning a directed acyclic graphical

model. Previous work in the study of upper bounds on the sample complexity include [22].

Despite these upper bounds, a tight characterization of the optimal sample complexity is

missing. This is to be contrasted with the situation for learning undirected graphs, also

known as Markov random fields, for which optimal rates were established in [23], alongside

similar results for support recovery in linear models [24]. This is in fact unsurprising given

the connection between these two problems via neighbourhood regression. Unfortunately,

learning a directed acyclic graph (DAG) does not reduce to neighbourhood regression as

it involves a more difficult constraint for acyclicity.

Gao et al. (2022) [25] introduced an algorithm that combines causal order recovery and

causal structure learning, and showed its optimality in sample complexity. However, their

analysis and result is restricted to the linear SEM with one particularly strong assumption,

that is, the noise variables of the SEM are assumed to have equal variances. For more

general cases without the equal noise-variance assumption, the sample complexity and

even the identifiability of the DAG remain a question.

In this context, we will investigate more refined sparsity-pursuing methods such as

structured sparsity regularization [26]. For causal structure learning, a key challenge

is to adapt the structured sparsity methodology to learning DAGs, while the typical

scenario in the previous work is the learning of undirected graphs (e.g., group sparsity

for graph LASSO [27]). The postdoc will tackle the above questions through lens of his

sparse matrix decomposition approach (ICID) [19] and a related work on sparse-ICA [28].

Both methods involves an ℓ1 norm term for finding sparse graphs according to a specific

matrix decomposition model. Their learning accuracies depend on the accurateness of

either the precision matrix or the covariance matrix, which all rely on a sufficient number

of samples.

References

[1] Stratis Tsirtsis, Amir-Hossein Karimi, Ana Lucic, Manuel Gomez-Rodriguez, Isabel Valera, and Hima Lakkaraju.

ICML workshop on algorithmic recourse. 2021.

[2] Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000.

5

[3] Jonas Peters, Peter B¨uhlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification

and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pages 947–1012,

2016.

[4] Otis Dudley Duncan. Introduction to structural equation models. Elsevier, 2014.

[5] David Maxwell Chickering. Learning bayesian networks is NP-complete. In Learning from data, pages 121–130.

Springer, 1996.

[6] Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT

press, 2000.

[7] Po-Ling Loh and Peter B¨uhlmann. High-dimensional learning of linear causal networks via inverse covariance estima-

tion. The Journal of Machine Learning Research, 15(1):3065–3105, 2014.

[8] Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. DAGs with NO TEARS: Continuous optimization

for structure learning. In Advances in Neural Information Processing Systems, volume 31, 2018.

[9] Ignavier Ng, AmirEmad Ghassami, and Kun Zhang. On the role of sparsity and DAG constraints for learning linear

DAGs. Advances in Neural Information Processing Systems, 33:17943–17954, 2020.

[10] Christina Heinze-Deml, Jonas Peters, and Nicolai Meinshausen. Invariant causal prediction for nonlinear models.

Journal of Causal Inference, 6(2):20170016, 2018.

[11] Joshua D. Angrist, Guido W. Imbens, and Donald B. Rubin. Identification of causal effects using instrumental variables.

Journal of the American Statistical Association, 91(434):444–455, 1996.

[12] Jakob Runge. Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets.

In Jonas Peters and David Sontag, editors, Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence

(UAI), volume 124 of Proceedings of Machine Learning Research, pages 1388–1397. PMLR, 03–06 Aug 2020.

[13] Daniel C Castro, Ian Walker, and Ben Glocker. Causality matters in medical imaging. Nature Communications,

11(1):3673, 2020.

[14] Peng Ding and Fan Li. Causal inference: A missing data perspective. Statistical Science, 33(2):214–237, 2018.

[15] Asish Ghoshal and Jean Honorio. Learning linear structural equation models in polynomial time and sample complexity.

In International Conference on Artificial Intelligence and Statistics, pages 1466–1475. PMLR, 2018.

[16] Garvesh Raskutti and Caroline Uhler. Learning directed acyclic graph models based on sparsest permutations. Stat,

7(1):e183, 2018.

[17] Shohei Shimizu, Patrik O Hoyer, Aapo Hyv¨arinen, and Antti Kerminen. A linear non-Gaussian acyclic model for

causal discovery. Journal of Machine Learning Research, 7(10), 2006.

[18] Yiheng Liu, Elina Robeva, and Huanqing Wang. Learning linear non-gaussian graphical models with multidirected

edges. Journal of Causal Inference, 9(1):250–263, 2021.

[19] Shuyu Dong, Kento Uemura, Akito Fujii, Shuang Chang, Yusuke Koyanagi, Koji Maruhashi, and Mich`ele Se-

bag. Learning large causal structures from inverse covariance matrix via matrix decomposition. arXiv preprint

arXiv:2211.14221, 2023.

[20] Aapo Hyv¨arinen and Stephen M Smith. Pairwise likelihood ratios for estimation of non-gaussian structural equation

models. The Journal of Machine Learning Research, 14(1):111–152, 2013.

[21] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Sch¨olkopf, et al. Kernel mean embedding of

distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10(1-2):1–141, 2017.

[22] Or Zuk, Shiri Margel, and Eytan Domany. On the number of samples needed to learn the correct structure of a

bayesian network. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pages

560–567, 2006.

[23] Narayana P Santhanam and Martin J Wainwright. Information-theoretic limits of selecting binary graphical models

in high dimensions. IEEE Transactions on Information Theory, 58(7):4117–4134, 2012.

6

[24] Martin J Wainwright. Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting.

IEEE transactions on information theory, 55(12):5728–5741, 2009.

[25] Ming Gao, Wai Ming Tai, and Bryon Aragam. Optimal estimation of gaussian dag models. In International Conference

on Artificial Intelligence and Statistics, pages 8738–8757. Probab.

[26] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal

Statistical Society Series B: Statistical Methodology, 68(1):49–67, 2006.

[27] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group LASSO with overlap and graph LASSO. In

Proceedings of the 26th annual international conference on machine learning, pages 433–440, 2009.

[28] Ignavier Ng, Yujia Zheng, Xinshuai Dong, and Kun Zhang. On the identifiability of sparse ICA without assuming

non-Gaussianity. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

7