How UC Santa Cruz scientists bridged the gap in the human genome

UCSC Genomics Institute Associate Director Karen Miga.
UCSC Genomics Institute Associate Director Karen Miga.
(Via UC Santa Cruz)

An international team of scientists, co-led by UCSC Genomics Institute Associate Director Karen Miga, completed the first gap-free human genome sequence — an achievement experts weren’t expecting to happen this quickly. Her team’s work builds on the efforts of UCSC computer scientists who helped assemble the first working draft of a human genome with the international Human Genome Project in 2000.

More than 20 years ago, a team of international researchers including UC Santa Cruz scientists sequenced about 92% of the human genome. Last month, a new team, the Telomere-to-Telomere Consortium (T2T) — co-led by Karen Miga, associate director at the UCSC Genomics Institute — made the “gapless human genome sequence” available to the public, filling in that final crucial 8%.

Miga, an assistant professor of biomolecular engineering at UCSC, worked with more than 100 scientists from 32 centers and universities from the United States, the United Kingdom, Germany and Russia to complete the sequence by using new long-read technologies, including nanopore sequencing technology pioneered at UCSC.

By filling in the gaps, scientists will now be able to study the sequence variations related to disease, evolution and human biology, which will continue to lead to new understandings and treatments.

“I think everybody assumed that eventually technology would come to the point of being able to do it [bridging the gap],” said David Haussler, UCSC Genomics Institute director. “But there wasn’t any expectation that would happen soon.”

Haussler led the UCSC team that assembled the human genome in 2000 and posted it on the then-newly formed UCSC Genome Browser website — where Miga’s team recently posted its new complete genome sequence. Now, Haussler says, about 20,000 scientists visit the site every day, making it the most widely used genomics platform in human genetics.

With the completion of the project — one recognized globally by genomics experts — researchers will now have a fuller understanding of our human genome. That genome is all of an organism’s DNA, its manual of instructions that brings a single cell into a fully grown organism, be it an apple, a worm — or a human. DNA molecules are bound up tightly on structures called chromosomes, of which humans have 46.

A better understanding of chromosomes is key.

“If there are errors in these regions of the genome, it could lead to an imbalance in the number of chromosomes between the resulting daughter cells,” Miga said. “Gains and loss or rearrangements of chromosomes can lead to cancers, infertility and conditions that arise early in development.”

While Miga and her colleagues first reported completing the new gapless genome, called T2T-CHM13, in July 2020, the team published six papers of its analysis on March 31 in the journal Science.

Senior investigator Adam Phillippy of the Maryland-based National Human Genome Research Institute co-led the group’s work. As a collaborative, grassroots project, labs and centers received grants to support their work, but the project wasn’t centrally funded by the National Institutes of Health (NIH) or another institution.

Including Miga, five UCSC scientists participated in the consortium: Benedict Paten, Kishwar Shafin, Mark Diekhans and Miten Jain completed the group.

When and how UC Santa Cruz got involved in the international Human Genome Project

UCSC’s continuing role in this great advance is one unrecognized by many locals.

Its 37-year path began in May 1985, when Chancellor Robert Sinsheimer — a leading molecular biologist — held the “Santa Cruz Workshop” on the feasibility of sequencing a human genome.

UCSC Genomics Institute Director Haussler recalls the visionary meeting.

“The idea at the time of sequencing a genome, that wasn’t just a few 1,000 DNA letters long, but one that was 3 billion DNA letters long, was viewed by many as totally in outer space,” he said.

Haussler, who wasn’t at the meeting, said it was an important step in gaining momentum to start sequencing. Sinsheimer tried to raise the funding to launch a project, but ultimately couldn’t.

For the next several years, the NIH and the Department of Energy held meetings and came up with plans for a possible project before it was finally launched in October 1990. Twenty sequencing centers in the U.S. and around the world began working on the technology to sequence the human genome.

In general, Haussler said, the Human Genome Project faced three major challenges. Think of the human genome as a book with 3 billion letters, written in a language of As, Cs, Ts and Gs. But the book is shredded into tiny pieces.

DNA molecules are bound up tightly on structures called chromosomes.
DNA molecules are bound up tightly on structures called chromosomes.
(Via National Human Genome Research Institute)

“The technology that you have can only read tiny little snippets at a time. And even that technology was very hard to develop,” said Haussler. “From 1990 until 2000, most of the work in the human genome project was to get the machines to read all of those little pieces as efficiently as possible. And that was 10 years of really hard work.”

Next, the machines reading the letters need to preserve all the information so the mix of words coming out can be put back together. Assembling the mix of words that came out of the machines back into book form turned out to be a bigger problem than expected.

In 1999, project leaders asked Haussler to help assemble the genome.

The Human Genome Project pushed to complete the genome because a private company had entered the race.

Celera, an Alameda-based for-profit corporation, aimed to sequence the human genome ahead of the Human Genome Project’s public efforts. Celera had far better machines and was on its way to sequencing the human genome first, said Haussler. The concern: The company would not make it freely accessible to scientists, but instead protect it with patents.

Haussler recruited a team of scientists to work on the assembly. He got the help of Jim Kent, a graduate student in UCSC’s Department of Molecular, Cell and Developmental Biology, in addition to systems engineer Patrick Gavin and graduate students Terrence Furey and David Kulp.

Members of the UCSC Genome Browser, from left: Jim Kent, David Haussler, Patrick Gavin and Scot Free Kennedy at UCSC.
(Via UC Santa Cruz)

In May 2000, Kent started writing the 10,000 lines of code that eventually became the computer program that assembled the working draft of the human genome.

Haussler said it took him four weeks of writing code. The process was repetitive: write code, take a nap, write code, ice his hands and write code.

On June 22, 2000, Kent’s program assembled the first working draft of the human genome and on June 26, his work and that of the Human Genome Project was announced at a White House ceremony.

In the global science competition, Haussler’s team beat Celera by the narrow margin of a few days, allowing the consortium to share the genome with scientists worldwide. On July 7, 2000, UCSC scientists posted the draft on the internet.

Jim Kent sits next to the computer he used to write code in his research on the human genome
Jim Kent sits next to the computer he used to write 10,000 lines of code to assemble the first draft assembly of the human genome in his garage in Santa Cruz in May 2000.
(Via UC Santa Cruz)

“For the first time, there’s a life form on the planet that’s read its own story,” said Haussler. “We are now able to understand the genetic sequence that starts every person’s life. We have an example of one.”

The early findings were remarkable. When the International Human Genome Sequencing Consortium published a draft sequence and analysis in the journal Nature in early 2001, it found that the DNA sequences of any two humans are about 99.9% identical. The other 0.1% accounts for all the genomic differences among the human species.

Still, only one complete human genome of European ancestry has been sequenced. Researchers are now eager to see what a global reference of genomes — a set of multiple, diverse genomes — can tell them.

In 2019, the Telomere-to-Telomere Consortium, co-led by Miga, joined the Human Pangenome Reference Consortium with the aim of creating a new “human pangenome reference,” which would include the genome sequences of 350 individuals.

UCSC scientists continue to lead genome research

Paten, a UCSC biomolecular engineering professor and a leader in the pangenome reference project, explains that having more individual genomes sequenced will provide a higher level of confidence that scientists have accurately captured the whole genome.

“We do differ by millions of variations. And there are certain regions of our genome that are quite variable between human beings. So in order to represent that sequence better, we can’t really make do with just one reference,” he said. “To be able to basically build a map that’s comprehensive, we need to sequence lots of genomes and they have to be done at really high quality.”

In 2019, the NIH announced $29.5 million in funding toward the pangenome project. It’s gotten a lot cheaper to do the work; sequencing the first human genome between 1990 and 2000 cost $300 million. Today, researchers say, it takes just a few days and costs about $1,000 to sequence someone’s genome, depending on where it’s done.

“Over the last 20-odd years, we’ve had this massive technological race to produce cheaper, better, faster, more scalable DNA sequencing technology,” Paten said. “It’s a lot like how the chips in our phones and so forth get faster every year, except kind of on steroids. It’s been going way, way, way faster than the progression of our phones.”

He was just involved in breaking the world record for fastest sequencing of the human genome: in under eight hours.

Miga said these advances in technology and the Human Pangenome Reference Consortium are big steps in making genomics something that everyone can benefit from.

“I want to emphasize that we did not entitle our paper ‘the human genome’ on purpose,” she said. “It’s a human genome. And this genome that we’re representing is European ancestry. And so what we’d love to do now is to broaden that.”

FOR THE RECORD: This article has been updated to reflect the technologies used to complete the human genome sequence.

Latest Stories