The characterization of intra-tumoral heterogeneity caused by the accumulation of somatic mutations over time is critical to understanding the natural histories of cancer cell populations and to guiding patient treatment. Previous approaches to studying clonal structure from scRNA-seq data focus on different aspects of somatic mutation information. For example, Cardelino utilizes single-nucleotide variants (SNV), while inferCNV and HoneyBADGER use copy number alterations (CNA).
Here, we develop a more comprehensive Bayesian model that integrates various data inputs (including gene measurement, germline allelic fraction and somatic SNV), by allowing these orthogonal sources of information to borrow strength from each other. The proposed model jointly assigns single cells to subclonal populations and infers clonal-level SNV and CNA profiles. Further, it is well known that copy number carries spatial dependency that needs to be properly accounted for in the model. While existing methods handle this in a separate step after clustering the cells, our model captures the spatial dependency within the clustering procedure in an integrated way. We derive and implement the Gibbs sampler for the model in closed form.
We simulate 500 scRNA-seq data of 200 cells and 1000 genes each, following the Splatter pipeline to maintain the key properties of a real dataset. Given the same input (gene expression), our model outperforms inferCNV in around 75% of the cases, gaining an average of 11–14.5% in accuracy in both cell clustering and copy number state estimation. Its ability to integrate germline allelic fraction provides a further 10% efficiency gain in copy number state estimation. We apply our model to published scRNA-seq data from 5 melanoma patients, identifying sub-clones with distinct CNA profiles, and observe hundreds of differentially expressed genes between different tumour clones, as well as recurrent differential activity in cancer-related MSigDB gene sets such as Myc targets.