Decoding the human genome: the most accurate version ever

da | Giu 12, 2023 | Biologia Molecolare

Figure 1 – The researchers succeed in adding a missing piece of the human genome, represented like a complex puzzle.


Abstract

Since the first version of the human genome was released, it was clear that it represented one, if not the most, important scientific project of our time. The last effort to achieve a complete genome was made in 2022 in the National Human Genome Research Institute in Maryland (USA) with the collaboration of many others scientific American Institutions (1). The whole work of the Telomere-to-Telomere T2T-CHM13 assembly encloses a total of 3.055 billion base pairs, obtained thanks to the most advanced long read sequencing technologies available: PacBio circular consensus sequencing (HiFi) and Oxford Nanopore ultralong- read sequencing (ONT).

Discussion

The first version of the human genome has been discovered in 2003 thanks to the Human Genome Project. Since then, there have been countless versions of it, adding more and more precise reads every time. In one of the most recent versions, GRCh38.p13 (2), the assembly was constructed from sequenced bacterial artificial chromosomes (BACs), but it led to several unfinished or incorrect regions. Ultimately, it lacked a total of 151 Mbp, including mainly ribosomal DNA (rDNA) and the short arms of all five acrocentric chromosomes.

Genome assembly

To overcome the limitations of BAC-based assembly, the researchers used long-read shotgun sequencing techniques. PacBio HiFi and ONT’s reads were merged together to obtain the most accurate and long string of uninterrupted sequences possible, producing a telomere-to-telomere assembly supported by complementary techniques.

Thanks to the chosen homozygous cell line, a complete hydatidiform mole (CHM), no excess of singleton alleles or loss-of-function variants were detected.

The core of the T2T-CHM13 assembly is a high-resolution string graph built from HiFi reads, which have a very low error rate (0.1%). Errors were mostly due to small insertions or deletions within repeats, and they were solved by comparing reads one another. While most of the chromosomes were completed with high accuracy using this approach, chromosome 9 and the five acrocentric chromosomes were the only exceptions. Chromosome 9 was easily solved thanks to ONT reads, which helped to fill in lack of HiFi coverage and unravel the tangled structures.

rDNA assembly

The same procedure was not possible for the most complex regions of the CHM13 string graph, involving the rDNA arrays and their surrounding sequences. To assemble these highly dynamic genomic regions and to overcome limitations of the string graph assembly, the researchers constructed sparse de Bruijn graphs from HiFi reads for each of the five rDNA arrays (3). They were next aligned to ONT reads to identify a set of walks, clustered into “morphs” and, finally, converted into sequence. The final T2T-CHM13 assembly contains 219 complete rDNA copies, with 99.86% of it within 3 standard deviations of the mean coverage.

Figure 2: 238 Mbp of added or corrected sequences. The majority of them are centromeric satellites (156.2 Mbp), segmental duplications (44.2 Mbp) and rDNA (9.9 Mbp).

Materials and Methods

Here we show a summary of all the techniques applied in this study:

  • 30× PacBio circular consensus sequencing (HiFi);
  • 20× Oxford Nanopore ultralong-read sequencing (ONT);
  • 100× Illumina PCR-Free sequencing (ILMN);
  • 70× Illumina Arima Genomics Hi-C (Hi-C);
  • BioNano optical maps;
  • single-cell DNA template strand sequencing (Strand-seq);
  • CHM13: diploid homozygous 46 XX karyotype cell line.

Conclusions

The high-quality sequence of satellite repeats and segmental duplications, of whose acrocentric chromosomes were enriched, allowed the researchers to acquire the last piece of the puzzle, obtaining the most complete version of the human genome with almost 100% fidelity. A total of 238 Mbp was added or corrected, of which 182 Mbp has no primary alignments to previous assemblies and is exclusive to T2T-CHM13 (Fig.2). The majority of the 182 Mbp is noncoding, with about hundred new protein coding genes.

This is notably fundamental to the study of facioscapulohumeral muscular dystrophy (FSHD), because several paralogs of genes causing this disease were found to be in the acrocentric chromosomes. Thus, the new completed assembly will promote advanced clinical analysis that were not possible till now (4).

However, there still are some limitations. First, T2T-CHM13 completely lacks chromosome Y due to the fact that the cell has 46 XX karyotype. Moreover, the newly added 182 Mbp of sequence could be exclusive to T2T-CHM13 not only because of technical limitations, but also because of the differences between cell lines used in prior works.

Despite all the limitations, the researchers undoubtedly improved the human genome to the most complete and faithful version up to date.

References

  1. Sergey Nurk et al., The complete sequence of a human genome. Science 376,4453 (2022).
  2. Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen HC, Kitts PA, Murphy TD, Pruitt KD, Thibaud-Nissen F, Albracht D, Fulton RS, Kremitzki M, Magrini V, Markovic C, McGrath S, Steinberg KM, Auger K, Chow W, Collins J, Harden G, Hubbard T, Pelan S, Simpson JT, Threadgold G, Torrance J, Wood JM, Clarke L, Koren S, Boitano M, Peluso P, Li H, Chin CS, Phillippy AM, Durbin R, Wilson RK, Flicek P, Eichler EE, Church DM. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017 May;27(5):849-864.
  3. Rautiainen, T. Marschall, MBG: construction of a sparse de Bruijn graph based on a minimum. Bioinformatic 37 , 2476–2478 (2021).
  4. Schätzl, T., Kaiser, L. & Deigner, HP. Facioscapulohumeral muscular dystrophy: genetics, gene activation and downstream signalling with regard to recent therapeutic approaches: an update. Orphanet J Rare Dis 16, 129 (2021).

Sara Blengino

Master Industrial Biotechnology student

Daniele Pelloni

Master Industrial Biotechnology student