Machine Learning in Statistical Genetics: A Unified Framework from Representation Learning to Causal Inference

xuanjun Fang

Research Article

Machine Learning in Statistical Genetics: A Unified Framework from Representation Learning to Causal Inference

xuanjun Fang

Hainan Provincial Key Laboratory of Crop Molecular Breeding, Hainan Institute of Tropical Agricultural Resources (HITAR), Sanya, 572025, Hainan, China

Author

Correspondence author
Computational Molecular Biology, 2026, Vol. 16, No. 5
Received: 16 Jul., 2026 Accepted: 18 Aug., 2026 Published: 03 Sep., 2026

This is an open access article published under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

The genetic dissection of complex traits is undergoing a paradigm shift driven by high-dimensional data, nonlinear architectures, and multi-modal integration. Traditional statistical genetics approaches, centered on linear mixed models (LMMs), provide robust frameworks for population structure correction, effect estimation, and causal inference, yet remain limited in capturing higher-order interactions and complex regulatory patterns. In parallel, machine learning and artificial intelligence (ML/AI) methods have demonstrated superior predictive performance through representation learning and nonlinear modeling in large-scale genomic and multi-omics datasets. However, their outputs are largely confined to association-level findings and are often challenged by limited interpretability and portability. Here, we systematically reconceptualize the relationship between statistical genetics and ML/AI by proposing a unified analytical framework. In this framework, ML/AI functions as a representation layer, compressing high-dimensional signals and extracting latent structures, while statistical genetics serves as an inference layer, enabling effect estimation and hypothesis testing. These components are further integrated within a causal inference framework, forming a closed-loop pipeline from prediction to representation, inference, and causality. Within this context, we identify interpretability stability and result portability as key constraints governing ML/AI applications and highlight the role of structural priors—such as regulatory networks and causal graphs—in mitigating spurious associations and facilitating causal discovery. Our analysis demonstrates that the primary value of ML/AI lies not in replacing traditional statistical models, but in enhancing the representation of complex genetic signals, whereas statistical genetics remains indispensable for ensuring inferential validity and reproducibility. Future progress will depend on structurally constrained model integration, alongside the development of standardized benchmark datasets and evaluation frameworks. Such advances will enable a transition from association-based analysis toward causal understanding, ultimately unifying predictive performance with biological interpretability.

Keywords

Statistical genetics; Machine learning; Explainable AI; Causal inference; Representation learning; Linear mixed models; Multi-omics integration; Structural constraints

[Full-Text HTML]

Computational Molecular Biology

• Volume 16

View Options
. PDF
. HTML
Associated material
. Readers' comments
Other articles by authors
. xuanjun Fang