Poster Presentation 43rd Lorne Genome Conference 2022

Machine learning pipeline, VariantSpark + BitEpi, reveals novel variants and epistatic interactions associated with coronary artery disease (#252)

Letitia M.F. Sng 1 , Piotr Szul 2 , Johan Verjans 3 , Denis C. Bauer 1 , Natalie A. Twine 1
  1. Health & Biosecurity, CSIRO, Sydney, NSW, Australia
  2. Data61, CSIRO, Brisbane, QLD, Australia
  3. South Australia Health and Medical Research Centre, Adelaide, SA, Australia

Cardiovascular disease (CVD) is the leading cause of mortality worldwide. Although behavioural risk factors are important, there is a strong genetic component in CVD aetiology too. Genome-wide association studies have identified hundreds of loci associated with CVD risk, but account for less than 50% of CVD heritability. Epistasis, the combinatorial effect of multiple genetic variants, may explain part of this ‘missing heritability’. Yet, the nature of epistasis analysis poses multiple challenges for parametric statistical methods, such as computational demand and the high multiple testing burden.

VariantSpark is a cloud-based machine-learning platform that can identify complex interactions between millions of SNPs from thousands of samples efficiently. We have applied VariantSpark to the UK Biobank dataset and have identified 141 significant SNPs associated with coronary artery disease (CAD). These SNPs map to known CAD genes including LPA, CDKN2B, and CELSR2 as well as novel genes including FLJ34503 and FBN2. For comparison, significant SNPs from a logistic regression model on the same dataset were also significant by VariantSpark and were all known associations.

We then used our novel epistasis platform, BitEpi, to search for interactions between the 141 significant SNPs. We found that all the novel SNPs identified by VariantSpark were involved in higher order epistatic interactions with known CAD SNPs. For example, the microRNA MIR1538 was involved in a three-way interaction with the LPA and LPAL2 genes, suggesting that the miRNA may regulate the known deleterious effect of LPA and LPAL2 on CAD.

Further exploration of other interactions through pathway analysis and gene ontology has found that most interactions were occurring between key CAD pathways. Finally, the incorporation of these epistatic interactions into risk prediction models can improve the predictive ability of existing genetic risk scores which are based on variants with additive effects only.