There are no Erlenmeyer flasks in Christian Collberg’s lab, nor Bunsen burners or centrifuges. But there is a laptop computer, the hardware that makes the research of Collberg and University of Arizona colleague Todd Proebsting possible.
By contrast, what makes their research impossible is this: fellow scientists who are unwilling or unable to share their source code and data. Sharing that information allows computer scientists to repeat others’ experiments, an idea analogous to keeping log books in the wet sciences, such as biology.
After all, reproducibility is a cornerstone of the scientific process, and in essence it allows researchers to gain confidence in others’ work. What's more, sharing research artifacts allows researchers to build on others’ work to avoid needless replication of research and to advance science, a process known as benefaction.
After being unable to obtain code and data from a group of researchers, Collberg and Proebsting, both UA professors of computer science, wanted to learn more about how and when computer systems researchers share — or don’t share — their code and data. So the two launched a study to find out.
Collberg, Proebsting and an array of undergraduate and graduate students examined 601 peer-reviewed papers from the Association for Computing Machinery conferences and journals. They tried to locate each of the papers’ source code through the peer-reviewed paper, through Web searches, through source-code repositories or through author queries.
The researchers then looked at something they termed "weak repeatability rate" — that is, whether authors made available buildable source code or confirmed that the code was buildable. They found the weak repeatability rate fell between 32 and 54 percent.
In other findings, Collberg and Proebsting found no significant difference in repeatability rates of National Science Foundation-funded versus non-NSF-funded research. But they did find that authors from industry had a relatively lower rate of repeatability, and authors from academia had a relatively higher rate.
Also, they noted that authors’ published code doesn’t necessarily correspond to the version that was used to produce their results.
The results of their study are published in the March issue of Communications of the Association for Computing Machinery.
With the results from their study in mind, Collberg and Proebsting formulated two "modest proposals" to improve sharing, repeatability and benefaction.
The first proposal would require authors at the time of submission to inform conference organizers or journal editors as to whether they plan to share their code and data, and their answer should allow journal reviewers to consider that factor when they recommend acceptance or rejection of a paper.
"In some ways, sharing your code and data seems redundant," Proebsting says. "You’re publishing your work. You’re sharing your work. You’re sharing your conclusions. You’re sharing what you did. But there’s this one other part: Science has always said you’re supposed to share your methods, and we’re taking that to the logical extreme."
The researchers' second proposal would call on funding agencies to encourage researchers to request additional funds for repeatability. Computers, operating systems and commercial software change over time and need updating to function properly, and so does research software.
However, professors and graduate students don't have the time to fix old code or help other researchers use their code, so they need engineers whose full- time job is to help with maintenance, Collberg says.
"But it’s probable that money given toward reproducibility actually pays for itself because if, for example, I get to build on your software, I’m not investing time rebuilding it myself," Proebsting says. "Perhaps not only may I build on it, but someone else may build on it, too. So while funding reproducibility may initially look like an expense, it may pay off fantastically well in the long run."