DNA encoding method aids data-driven genetics research

New approach to representing genetic inheritance could support studies involving large datasets.

A new method of encoding and understanding the ancestry of a set of related DNA sequences could support storage and analysis of large genetic datasets.

The development has widespread applications, from understanding the evolution of viruses such as SARS-CoV-2, and mapping areas of DNA linked to disease, to leveraging these insights to benefit agricultural populations of plants and animals.

Inheritance network

A team of researchers including Professor Gregor Gorjanc of The Roslin Institute developed a simple, efficient method of capturing the history and relatedness of DNA sequences from sampled individuals.

Their method keeps track of which sections of DNA were inherited from which ancestors, and how these various sections are related.

This can be applied to large sets of DNA sequences from different individuals, creating a network of inheritance paths between sampled DNA sequences and their ancestors, also known as a genetic genealogy or an ancestral recombination graph (ARG).

These can be used to shed light on the history of DNA sequences of sampled individuals, and to compress this DNA data and speed genetic analyses.

Wide application

The novel approach allows scientists to store and analyse large amounts of genetic data, and it can be applied to any species of life on Earth.

For example, it forms the basis of a unified genealogy of more than 7,000 publicly available whole human genome sequences from a previous study, and more recently by a genealogy of millions of SARS-CoV-2 genomes.

These SARS-CoV-2 genomes were collected over the span of the coronavirus pandemic, and their genealogy allows analysis of the recent history of the virus, pinpointing the emergence of novel mixed, or recombinant, strains. The outcome will support ongoing research in population genetics, by presenting a universal approach for encoding genealogies as graphs, allowing for easier sharing and comparison of results from scientists all over the world. By using this simple method, recording genome-to-genome transmission of information, the study shows that genealogy can be stored to different degrees of precision, which has important implications when the genealogy is inferred from the sampled DNA sequences.

This means relationships between different DNA sequences can be represented with limited precision regarding joins and splits that underlie the true genealogy.

The study, led by the Big Data Institute and involving the Roslin Institute, is published in Genetics in an article entitled "A general and efficient representation of ancestral recombination graphs".

This work highlights a way forward in working with ever-increasing genomic datasets both in number of individuals and number of DNA markers. We expect that analyses based on this DNA data encoding will also enable richer analyses and their applications across several areas of genetics.