Poster Presentation 43rd Lorne Genome Conference 2022

PeakCNV: an artificial intelligence based tool for genome-wide copy number variation association study (#152)

Mahdieh Labani 1 , Ali Afrasiabi 1 , Amin Beheshti 2 , Hamid Alinejad-Rokny 1
  1. BioMedical Machine Learning Lab (BML), The Graduate School of Biomedical Engineering, UNSW Sydney, Sydney, NSW, 2052, Australia
  2. Department of Computing, Data Analytics Lab, Macquarie University, Sydney, NSW 2109, Australia

To date, many copy number variations (a type of structural genomic variation resulting in deletion or amplification of the segment in the genome - CNVs) have been identified with pathogenic roles for several diseases. One of the major obstacles in a CNV-based genome wide association study occurs when categorising CNVs across all cases (individuals with the phenotype of interest) and controls (healthy individuals), which is challenging because CNVs are inconsistent in sequence, size and genomic coordinates across cases and controls. One of the efficient strategies to categorise CNVs for genome-wide CNV-phenotype association studies is building CNV regions (genomic regions that CNVs are overlapping - CNVRs). However, this approach is susceptible to high false positive rate due to CNVRs which overlap or co-occur with true positive CNVRs. We developed an innovative tool, PeakCNV, to correct this bias via identifying the independence of CNVR associations with their respective phenotype from other CNVRs that are collocated in the same loci. PeakCNV differentiates false-positive CNVRs from true positives by calculating a new metrics, independence ranking score, (IR-score) via an Artificial Intelligence based feature ranking approach. We compared the performance of PeakCNV with current existing tools by analysing the CNV genotype data for individuals with Neurodevelopmental disorders (19,663 cases and 6,479 healthy subjects). Crucially, our benchmarking tests indicated that PeakCNV identifies smaller candidate CNVRs that significantly better discriminate cases from controls. By integrating data from the FANTOM5 expression atlas and Clinical Genomic Database, we showed that CNVRs identified by PeakCNV contain significantly more genes with the Brain-enriched expression, and more genes that are associated with neurological conditions. We also indicated that the accuracy of PeakCNV in identifying relevant candidate CNVRs are reproducible for a Prostate cancer. Taken together, PeakCNV outperformed other existing tools by identifying more biologically meaningful CNVRs relevant to the phenotype of interest.