UF team develops AI tool to make genetic research more comprehensive

University of Florida researchers are addressing a critical gap in medical genetic research — ensuring it better represents and benefits people of all backgrounds. 

Their work, led by Kiley Graim, Ph.D., an assistant professor in the Department of Computer & Information Science & Engineering, focuses on improving human health by addressing "ancestral bias" in genetic data, a problem that arises when most research is based on data from a single ancestral group. This bias limits advancements in precision medicine, Graim said, and leaves large portions of the global population underserved when it comes to disease treatment and prevention. 

To solve this, the team developed PhyloFrame, a machine-learning tool that uses artificial intelligence to account for ancestral diversity in genetic data. With funding support from the National Institutes of Health, the goal is to improve how diseases are predicted, diagnosed, and treated for everyone, regardless of their ancestry. A paper describing the PhyloFrame method and how it showed marked improvements in precision medicine outcomes was published Monday in Nature Communications.  

Graim, a member of the UF Health Cancer Center, said her inspiration to focus on ancestral bias in genomic data evolved from a conversation with a doctor who was frustrated by a study's limited relevance to his diverse patient population. This encounter led her to explore how AI could help bridge the gap in genetic research. 

“If our training data doesn’t match our real-world data, we have ways to deal with that using machine learning. They’re not perfect, but they can do a lot to address the issue.” —Kiley Graim, Ph.D., an assistant professor in the Department of Computer & Information Science & Engineering and a member of the UF Health Cancer Center

“I thought to myself, ‘I can fix that problem,’” said Graim, whose research centers around machine learning and precision medicine and who is trained in population genomics. “If our training data doesn’t match our real-world data, we have ways to deal with that using machine learning. They’re not perfect, but they can do a lot to address the issue.”  

By leveraging data from population genomics database gnomAD, PhyloFrame integrates massive databases of healthy human genomes with the smaller datasets specific to diseases used to train precision medicine models. The models it creates are better equipped to handle diverse genetic backgrounds. For example, it can predict the differences between subtypes of diseases like breast cancer and suggest the best treatment for each patient, regardless of patient ancestry. 

Kiley Graim, Ph.D.

Processing such massive amounts of data is no small feat. The team uses UF’s HiPerGator, one of the most powerful supercomputers in the country, to analyze genomic information from millions of people. For each person, that means processing 3 billion base pairs of DNA. 

“I didn’t think it would work as well as it did,” said Graim, noting that her doctoral student, Leslie Smith, contributed significantly to the study. “What started as a small project using a simple model to demonstrate the impact of incorporating population genomics data has evolved into securing funds to develop more sophisticated models and to refine how populations are defined.” 

What sets PhyloFrame apart is its ability to ensure predictions remain accurate across populations by considering genetic differences linked to ancestry. This is crucial because most current models are built using data that does not fully represent the world’s population. Much of the existing data comes from research hospitals and patients who trust the health care system. This means populations in small towns or those who distrust medical systems are often left out, making it harder to develop treatments that work well for everyone.  

She also estimated 97% of the sequenced samples are from people of European ancestry, due, largely, to national and state level funding and priorities, but also due to socioeconomic factors that snowball at different levels – insurance impacts whether people get treated, for example, which impacts how likely they are to be sequenced. 

“Some other countries, notably China and Japan, have recently been trying to close this gap, and so there is more data from these countries than there had been previously but still nothing like the European data," she said. “Poorer populations are generally excluded entirely.” 

Thus, diversity in training data is essential, Graim said.  

"We want these models to work for any patient, not just the ones in our studies," she said. “Having diverse training data makes models better for Europeans, too. Having the population genomics data helps prevent models from overfitting, which means that they'll work better for everyone, including Europeans.” 

Graim believes tools like PhyloFrame will eventually be used in the clinical setting, replacing traditional models to develop treatment plans tailored to individuals based on their genetic makeup. The team’s next steps include refining PhyloFrame and expanding its applications to more diseases. 

“My dream is to help advance precision medicine through this kind of machine learning method, so people can get diagnosed early and are treated with what works specifically for them and with the fewest side effects,” she said. “Getting the right treatment to the right person at the right time is what we’re striving for.”  

Graim’s project received funding from the UF College of Medicine Office of Research’s AI2 Datathon grant award, which is designed to help researchers and clinicians harness AI tools to improve human health.