Syndicate will be powered by supercomputers such as Stampede at the Texas Advanced Computing Center, a partner institution of the iPlant Collaborative.
Syndicate will be powered by supercomputers such as Stampede at the Texas Advanced Computing Center, a partner institution of the iPlant Collaborative.

Between Netflix and Big Data

A recent NSF grant of $3.8 million will fund development of a general-purpose data storage platform, enabled by the iPlant Collaborative’s community of scientists, developers and educators.
Sept. 8, 2015

Creating lots of data in 2015 is rather easy.

Take, for example, a whole human genome, comprised of roughly 3 billion DNA base pairs and 20,000 genes. Scientists began sequencing the first human genome in 2000. It took 13 years and $3 billion. Today, for less than $1,000 and in a matter of hours — not weeks, not months — it can be sequenced and stored as a gigabyte and a half of data that would fit on a compact disc the size of NSYNC's "No Strings Attached," the best-selling album of 2000. 

Scientists amass unprecedented amounts of data in very little time, but they cannot always manage the data as efficiently as they produce it. Syndicate, a four-year big data project led by University of Arizona professor of computer science Larry Peterson, addresses the problem.

Funded by a $3.8 million National Science Foundation research grant, Syndicate will be a general-purpose storage platform for data, adding to services of the data management infrastructure developed by the UA-led iPlant Collaborative, an all-science computational platform also funded by NSF. Peterson and his team of collaborators hope to evoke a time when the scientist didn’t also have to be the data management expert. The iPlant Collaborative will provide the infrastructure to integrate Syndicate — and the user community to pilot test the platform in its array of potential uses.

The conversation is no longer about whether scientists can turn out big data. It's about how it can be managed.

In order to build on each other's research, scientists must be able to share their data, and this does happen. But not always fast. Sending hundreds of terabytes from far-away origin servers (say, from Tucson to Beijing) can take so much time that the data becomes stale as it's passed from one research lab to another.

"If you're dealing with large datasets, the data changes. Computations happen," Peterson said.

Syndicate aims to make sharing faster, so scientists will receive only the freshest version of a dataset. The ability to more easily store and manage large amounts of data with a platform such as Syndicate will in turn make collaboration among scientists easier.

"We're trying to wean scientists off having their own local hardware, and help them tap into resources that are worldwide," Peterson said.

Slow-going data transfer is only part of the problem; currently, managing a large dataset also requires significant user involvement. Syndicate will address this, as it is designed for self-management. For example, users no longer will have to manually and individually dole out passkeys. 

As it stands, according to Peterson, "Privacy can sometimes be a nightmare."

The goal is to be minimally disruptive in the process, by creating a system that utilizes many of the same cloud storage services scientists already use, such as Google Drive and Dropbox.

The crown jewel of its system is the same technology Netflix and Amazon Prime use to transfer television episodes and films: content distribution networks, or CDNs. Using CDNs, Syndicate will pull large datasets from an origin server and put them all over the globe. This way, the scientist in Beijing will not have to wait a month for data from an origin server in Tucson, because it also will be hosted somewhere closer, such as Tokyo.

Essentially, CDNs don't move big data faster, but they bring it closer so it is received sooner. 

"CDNs are really common for video but they haven't been used a lot for big data," Peterson said.

Why?

"They're a challenge," he said. "Today, CDNs are typically used for (files) that don't change."

His team will have to integrate an element that allows the data to change due to computation — no small feat.

Peterson is hoping to deploy a pilot version of the Syndicate platform by the end of fall. The iPlant Collaborative will provide the community of scientists, developers, and educators necessary to ensure the platform is capable of translational use. Additional pilot users will include the M-Lab Consortium, for which Peterson is a founding member, and scientists who will house data from a clinical study in a Syndicate cloud.

The project brings together collaborators from all across the UA campus, including Nirav Merchant of the iPlant Collaborative and Arizona Research Laboratories; John Hartman of UA computer science; Anita Bapphu of the Norton School of Family and Consumer Sciences; and Bonnie Hurwitz of the College of Agriculture and Life Sciences and creator of the iMicrobe project, an affiliate of iPlant.