Featured Video
This Week in Quality Digest Live
Innovation Features
Laurel Thoennes @ Quality Digest
What we resist, persists
Jeffrey Phillips
Most people are paid to do what they know, not to search for what they don’t
Innovating Service With Chip Bell
Put your senses into service
Nikolaus Correll
Automation competition from China needs to be met head-on
Morten Bennedsen
Powerful values and a vested culture are the chief ingredients

More Features

Innovation News
System allocates data center bandwidth more fairly, so no part of a web page lags behind others
Data can be used across domains with limitations removed
This is a new way of harnessing the collective intelligence of thousands of people
Quantum nanoscience takes quantum leap by controlling individual atoms
Brilliant inventions from the ‘American Idol for Nerds’
With 90% power savings, it could make speech recognition ubiquitous in electronics
Study finds law-breaking property that could lead to applications in thermoelectrics, window coatings
Three phases and challenges

More News

MIT News


Big Data Without the Liabilities

Data scientists find that artificial data give the same results as real without compromising privacy

Published: Thursday, March 16, 2017 - 12:02

Although data scientists can gain great insights from large data sets—and can ultimately use these insights to tackle major challenges—accomplishing this is much easier said than done. Many such efforts are stymied from the outset, as privacy concerns make it difficult for scientists to access the data they would like to work with.

The paper, “The Synthetic Data Vault” presented at the IEEE International Conference on Data Science and Advanced Analytics, describes a system that builds machine learning models out of real databases to create artificial or synthetic data. Synthetic data can be used to develop and test algorithms and models that enable science efforts that may otherwise be thwarted due to lack of real data.

“Once we model an entire database, we can sample and recreate a synthetic version of the data that very much looks like the original database, statistically speaking,” says Kalyan Veeramachaneni, principal research scientist at MIT Laboratory for Information and Decision Systems (LIDS). “If the original database has some missing values and some noise in it, we also embed that noise in the synthetic version.... In a way, we are using machine learning to enable machine learning.”

The algorithm, called “recursive conditional parameter aggregation,” exploits the hierarchical organization of data common to all databases. For example, it can take a customer-transactions table and form a multivariate model for each customer based on his transactions.

New approach can help organizations scale their data science efforts with artificial data and crowdsourcing.

This model captures correlations between multiple fields within those transactions—for example, the purchase amount and type, the time at which the transaction took place, and so on. After the algorithm has modeled and assembled parameters for each customer, it can then form a multivariate model of the these parameters themselves, and recursively model the entire database. Once a model is learned, it can synthesize an entire database, filled with artificial data.

Outcome and impact

After building the synthetic data vault, the team used it to generate synthetic data for five different publicly available datasets. They then hired 39 freelance data scientists, working in four groups, to develop predictive models as part of a crowd-sourced experiment. The question they wanted to answer was: “Is there any difference between the work of data scientists given synthesized data, and those with access to real data?” To test this, one group was given the original data sets, while the other three were given the synthetic versions. Each group used their data to solve a predictive modeling problem, eventually conducting 15 tests across five datasets. In the end, when their solutions were compared, those generated by the group using real data and those generated by the groups using synthetic data displayed no significant performance difference in 11 out of the 15 tests (70% of the time). 

These results suggest that synthetic data can successfully replace real data in software writing and testing—meaning that data scientists can use it to overcome a massive barrier to entry. “Using synthetic data gets rid of the ‘privacy bottleneck’ so work can get started,” says Veeramachaneni.

This has implications for data science across a spectrum of industries. Besides enabling work to begin, synthetic data will allow data scientists to continue ongoing work without involving real, potentially sensitive data.

“Companies can now take their data warehouses or databases and create synthetic versions of them,” says Veeramachaneni. “So they can circumvent the problems currently faced by companies like Uber, and enable their data scientists to continue to design and test approaches without breaching the privacy of the real people—including their friends and family—who are using their services.”

In addition, the machine-learning model from Veeramachaneni and his team can be easily scaled to create very small or very large synthetic data sets, facilitating rapid development cycles or stress tests for big data systems. Artificial data are also a valuable tool for educating students; although real data are often too sensitive for them to work with, synthetic data can be effectively used in its place. This innovation can allow the next generation of data scientists to enjoy all the benefits of big data, without any of the liabilities.


About The Author

MIT News’s picture

MIT News

The MIT News is the Massachusetts Institute of Technology’s (MIT) central hub for news about MIT research, initiatives, and events. It reports MIT news directly and works with journalists around the world to help showcase the achievements of its students, faculty, and staff.