MSU computer scientists help expand horizon of genetics research


A tweaked gene or two among the millions or even billions of proteins that make up an organism's DNA are often all that distinguish the drought-tolerant plant or the person pre-disposed to cancer.

That's why a better understanding of genetic variation within a species could, among other things, help improve selection of crops for local conditions and detection of disease, according to Joann Mudge, senior research scientist at the nonprofit National Center for Genome Resources.

A generation ago, recording an organism's DNA from beginning to end was so laborious and expensive that scientists celebrated when they completed the task for a single bacterium. But as genome sequencing becomes faster and cheaper, scientists increasingly have access to insights about which genes do what, Mudge said.

"We're sequencing multiple individuals of some species," including plants and other complex organisms, Mudge said. That allows scientists to begin to sort out which segments of DNA form a species' core genome and which correspond to traits shared by only some individuals, she said.

But the growing field of pangenomics, as it is called, presents a major analytical challenge. That's why NCGR recently partnered with Montana State University computer scientists to develop software that can compare multiple genomes and make sense of the results. The project is backed by a three-year, $662,000 grant from the National Science Foundation.

"We've been very happy with the way it's working," said Brendan Mumey, professor in the Gianforte School of Computing in MSU's Norm Asbjornson College of Engineering. He and Mudge are co-leading the project.

According to Mumey, previously available software struggled with analyzing pangenomes for relatively primitive organisms such as the common yeast Saccharomyces cerevisiae, whose genome contains only 12 million of the DNA units known as base pairs. (By comparison, the human genome contains 3 billion base pairs.) Among the known strains of the yeast, minor genetic variations account for physical adaptations such as the ability of brewer's yeast to survive alcohol during the making of beer and wine.

"It's a classic 'big data' problem," Mumey said, referring to the field of computing that deals with exceptionally large and complex data sets.

MSU assistant professor of computer science Indika Kahanda, a member of the research team, specializes in developing the "machine learning" models that help the new software adjust its gene-sorting analysis according to input from scientists. That approach has helped the team, which includes NCGR research scientist Thiru Ramaraj, identify genes of interest in a yeast pangenome that includes roughly 100 strains. Ramaraj earned his doctorate in computer science in 2010 at MSU, where Mumey was his adviser.

Mumey said the researchers' next step is to continue to refine the software so it can handle larger and more complex genomes, such as those of plants. The computational techniques being used "are still in their infancy," he said.

Eventually, pangenomics could help medical professionals diagnose a variety of diseases that have a genetic component, Mudge said. Most inherited breast cancer can be traced to mutations in just two genes, but other genetic diseases are thought to stem from more complex changes across larger areas of DNA.

The improved pangenomics tool is already helping scientists break out of a mold of comparing genomes to a single, arbitrary reference, Mudge said. Instead, researchers can represent a species' entire genome with all its nuance and variety.

"It's a hard problem to solve," Mudge said. "This has been a great collaboration."