Ultraconserved elements (UCEs) are DNA sequences which are extremely conserved and found almost unchanged in the genomes of multiple, divergent species [1]. UCEs have been found in a wide variety of organisms, including mammals, fish, insects, birds, and plants. Whilst the evidence suggests that that they are the result of natural selection, indicating biological importance, their function has thus far proven elusive [2]. The recent explosion in the quality and quantity of reference genomes across multiple taxa provides new opportunities for investigating the prevalence, evolution and role of UCEs. However, the field is hampered by a lack of fast and resource-efficient algorithms to identify UCEs. Furthermore, common alignment-based algorithms fail to identify non-syntenic UCEs.
Here, we present dedUCE, a novel tool for identifying all UCEs in a set of genomes. dedUCE uses a hash-based algorithm to rapidly identify core UCE kmers that are shared by multiple genomes, before extending and merging candidates into a final comprehensive but non-redundant set of UCEs. dedUCE can support UCEs appearing out-of-order due to genetic rearrangements and/or assembly artefacts, and is able to return UCEs with inexact homology and support. The tool can then apply these UCEs to genome assemblies and produce three related measures of completeness, based on the presence of UCEs and their ordering in a consensus syntenic map.
Preliminary results show that dedUCE can identify all UCEs in a group of 40 mammalian genomes in 8 hours on a 16-core machine, which is orders of magnitude faster than previous algorithms. Using UCEs identified in organisms closely related to four target assemblies, dedUCE provides a good complement to existing completeness measurements, especially in targeting non-coding regions and identifying local rearrangements of UCE pairs.
DedUCE therefore improves on existing methods for UCE identification, enabling large-scale analysis, and provides new tools for measuring assembly completeness.