Finding needles in haystacks: making sense of genomic data
Sequencing the genome is just the start; interpreting the sequence is the next big challenge
Earlier this month, Genomics England announced their chosen providers of clinical interpretation services for the first 8,000 patients enrolled in the main phases of the 100,000 Genomes Project. Just what do such services provide?
Although with modern high-throughput technology there are now (a few) facilities capable of sequencing an entire human genome within a day, the sheer volume of data generated is massive: about 200 gigabytes (enough to fill an average laptop’s hard drive). Yet this is just the beginning; for the sequence to be useful, it needs to be processed and analysed. The many short ‘reads’ of DNA sequence have to be mapped (assembled) against a known guide or reference genome sequence.
A reliable reference?
The original reference genome sequence is actually a composite of several different human sequences, but even so it is far from a perfect template, despite ongoing efforts to improve it from the international Genome Reference Consortium.
Some researchers have suggested that ultimately a different form of graph or ‘pan-genome’ representing hundreds or even thousands of different human genomes may offer an alternative and superior reference point. Others report that advances in sequencing and computational analytic techniques will allow reference-free sequencing that reveals the true diversity of human genomes, even to the point of distinguishing between those variants inherited from a person’s mother and father.
Spot the (significant) difference
For now, the test and reference genome sequences must be compared in order to identify variants. While human genomes are almost identical, the combined number of rare and common variations in any one sequence typically adds up to over three million in total. But which of all these variants are associated with disease – and which could account for a particular disorder in the patient?
Sophisticated computational algorithms compare variants with known disease-linked genomic databases, seeking for example the rarer variants and those with potential biological links to the patient’s condition. In cancer genomes, finding mutations that derail normal cellular checks and balances on replication are of particular clinical interest.
These steps help to filter the many variants down to a much smaller number of potential clinical relevance. Ultimately, deciding which (if any) of these are likely to have a genuine causal involvement in the disease in question typically requires in-depth, expert analysis.
Keeping it simple
Bioinformatics is the professional discipline that seeks to make sense of the vast volumes of data generated by genomic sequencing. Increasingly, bioinformaticians are working as part of multidisciplinary teams to interpret raw genomic data and distil the clinically actionable information into a format that can be used in a healthcare setting.
At the end of the day, a non-expert health professional needs a comprehensible result delivered in a form that they can readily act on and explain to the patient.
This is the ultimate demand of the impressive combination of genomic, mathematical and computational power and expertise that makes up practical clinical interpretation.
Find out more
If you would like to learn more about bioinformatics and the problems it can help to solve, take our free short online course: Introduction to Bioinformatics.
–