Background. Genome wide association studies (GWAS) define how genetic variation influences phenotypic variation by providing gene-trait associations. We hypothesised that genes with shared functions influence complex trait phenotypes in similar ways. Therefore, sampling a sufficiently diverse landscape of gene-trait variation provides an unsupervised strategy to parse the organisation of cellular gene programs.
Methods. MultiXcan analysis was performed for 1393 complex traits sampled from ~400,000 individuals to generate a gene-trait association matrix in which each gene is linked with each phenotype by the significance of their association. Genes were clustered using dimensionality reduction methods from Seurat and consolidated into a consensus matrix for hierarchical clustering.
Results. 16,849 genes were clustered into 242 unique gene groups predicted to share biological functions based on their pattern of association with complex traits. Gene clusters were significantly enriched for known biological gene sets and protein-protein interactions governing development, signalling, disease, and homeostasis, with exquisite specificity for the top ranked gene-ontologies across all clusters. We show that genes associated with an independent GWAS phenotype reproducibly cluster within our identified gene modules, indicating that genetic effects on complex trait phenotypes are biologically conserved and predictable. Furthermore, our clustering predictions can identify gene programs influencing complex traits from underpowered and transethnic GWAS without requiring increased cohort sizes to increase statistical power. Lastly, we show that despite the considerable size of the UK Biobank, the data does not saturate gene clustering predictions and therefore will improve in quality and accuracy as more data become available.
Conclusion. Our analysis provides a novel approach for predicting genetic mechanisms of development and disease by identifying genes coordinating cell function in an unsupervised way using large-scale GWAS data. As such, we demonstrate that the effect of genetic variation on biological phenotypes is predictable.