Making sense of the new scientific data published every year — including well over a million cancer-related journal articles — is a tall order for the contemporary scientist.
Even if a scientist were capable of reading every article and memorizing its content, drawing connections to answer real-world questions would require supernatural cognition.
Figuring out how to actually read hundreds of thousands of scientific papers and apply their findings to real challenges, such as the treatment of cancer patients, is an arduous, uphill battle.
But an associate professor in the University of Arizona's School of Information, Clayton Morrison, is doing just that, one algorithm at a time.
He wonders, as many others in his field do, if the solutions to big problems are already there, in extant data, but no one has been able to put it all together yet.
Morrison, as the co-principal investigator, and a team of collaborators are using a research grant of more than $3.6 million to investigate. Funded by the Defense Advanced Research Projects Agency, "REACH: Reading and Assembling Contextual and Holistic Mechanisms From Text" will create a computer system that reads papers, extracts information on biochemical pathways, and plugs it all into large-scale, interactive models.
REACH researchers are laying the foundation for interactive software that would allow drug developers, or maybe even doctors, to provide lots of information, such as a patient's genome. In turn, it could model how a specific treatment would interact with the patient.
"They'll be the Microsofts and Googles of biomedicine," Morrison said.
Its potential has mass appeal and big implications: fast, individualized and precise biomedical care.
"The REACH project is applied to cancer biology, but we have an even bigger vision than that, although cancer biology is big enough," Morrison said.
If big data is a two-part challenge, Morrison said, then storing it and moving it around is the first part. The second part is understanding it.
REACH works on the understanding part in three phases: extraction, assembly and inference.
Extraction was put to the test this summer. Over the course of a year, researchers led by Mihai Surdeanu, associate professor in the Computer Science Department and REACH's principal investigator, trained a computer system to read papers using hundreds of algorithms. One, for example, allows it to understand that "mouse," "mice" and "Mus musculus" all refer to the same thing.
Others on the UA research team include Ryan Gutenkunst, assistant professor of molecular and cellular biology; Guang Yao, assistant professor of molecular and cellular biology; and Kobus Barnard, professor of computer science.
Morrison, who also has a strong, academic background in developmental psychology, said, "I think that collaborative computers are going to be like children, and we'll have to raise them, in a way. They’ll be as smart as we’re able to teach them, and we need them to be able to communicate with us."
In the recent evaluation of this first phase of REACH, the system was able to process 1,000 papers on RAS-related cancers in a matter of hours, yielding results that exceeded state-of-the-art predecessors — all by relying on algorithms. Asking a human scientist to do the same would be outrageous.
Focusing their efforts on modeling how RAS functions in cancer cells was an easy choice, for a couple of reasons.
RAS proteins control the chemical pathways responsible for growth, migration and survival within a cell. Basically, they've got a big job. Secondly, RAS oncogenes are mutated in 33 percent of all human cancers, making them one of the most highly researched classes of oncogenes. And when you need thousands of papers on one subject, highly researched is important.
Now that the REACH system knows how to read, it needs context. Morrison is currently building that in, by teaching it to differentiate between species (a yeast cell is different from a mouse). As of now, REACH is already familiar with 30 different species affected by RAS-related cancers. It also will need to understand differences among cell types, organs and tissue types. This is all part of the project's assembly phase.
By the end of the four-year project, REACH should be able to make inferences. In other words, it will hypothesize much as a scientist or a doctor might.
"I would like to see this usher in computers understanding complex things at a level that we just can't," Morrison said.
"It's awesome. I can't tell you how excited and passionate I get that I'm able to take things I've developed and apply them to something that could potentially, directly improve peoples' lives."