BACKGROUND: Genomic variant interpretation to the clinical standards of a diagnostic laboratory is a labour-intensive process that can take hours to days. Our aim was to reduce this workload by shrinking the number of variants needing manual curation through eliminating informationally redundant genomic attributes.
METHODS: We evaluated several machine learning algorithms using a clinically validated training dataset of pathogenic and non-pathogenic variants to develop NINO, a parsimonious decision tree genomic classifier built on optimisation of variant annotations in our existing pipeline. We used NINO to generate a candidate list of potentially pathogenic variants and AMELIE, a freely available phenomic classifier, to rank these variants based on phenotypic relevance. The resultant workflow is TOP MOVIE, a Tandem, Orthogonal Parsimonious Mendelian Optimized Variant Interpretation Engine.
RESULTS: NINO reduced the number of genomic attributes needing evaluation by an order of magnitude. The addition of the phenomics classifier AMELIE in tandem with NINO further decreased the number of candidate variants requiring curation. The resultant TOP MOVIE workflow significantly reduces the variant search space and identifies the causative pathogenic variant with exponential decrease in turn-around time (TAT).
CONCLUSIONS: TOP MOVIE performs as well as human experts but is exponentially faster. It can be easily implemented in any clinical diagnostic laboratory and optimised using its existing annotation pipeline and referral population, and can be customized and updated without any requirement for programming expertise. Our machine-learning optimised parsimonious classifier (NINO) correctly classified known pathogenic variants using only a small proportion of commonly-used genomic attributes, suggesting that existing in silico annotation tools may already hold sufficient information content for accurate diagnosis. TOP MOVIE is currently clinically validated for single nucleotide variants and indels (≤ 20 nt) in genomic coding regions; we have not yet applied it to structural variants or non-coding variants.