Featured Product
This Week in Quality Digest Live
Statistics Features
Fred Schenkelberg
Beware the type III error
Adam Conner-Simons
An open-source system makes it possible to create interactive scatterplots of large datasets
Jay Arthur—The KnowWare Man
Here’s a simple way to use Excel PivotTables to dig into your data
Matthew Bundy
Fire protection system design and regulation of flammable materials can be improved with accurate knowledge of fire growth
Douglas Allen
Removing the random noise component from the observation, leaving the signal component

More Features

Statistics News
Collect measurements, visual defect information, simple Go/No-Go situations from any online device
Good quality is adding an average of 11 percent to organizations’ revenue growth
Ability to subscribe with single-user minimum, floating license, and no long-term commitment
A guide for practitioners and managers
Gain visibility into real-time quality data to improve manufacturing process efficiency, quality, and profits
Tool for nonstatisticians automatically generates models that glean insights from complex data sets
Version 3.1 increases flexibility and ease of use with expanded data formatting features
Provides accurate visual representations of the plan-do-study-act cycle
SQCpack and GAGEpack offer a comprehensive approach to improving product quality and consistency

More News

Adam Conner-Simons

Statistics

Less Scatterbrained Scatterplots

An open-source system makes it possible to create interactive scatterplots of large datasets

Published: Wednesday, February 10, 2021 - 12:03

This story was originally published by MIT Computer Science & Artificial Intelligence Lab (CSAIL).

Scatterplots. You may not know them by name, but if you spend more than 10 minutes online, you’ll find them everywhere.

They’re popular in news articles, in the data science community, and perhaps most crucially, for internet memes about the digestive quality of pancakes.

By depicting data as a mass of points across two axes, scatterplots are effective in visualizing trends, correlations, and anomalies. But using them for large datasets often leads to overlapping dots that make them more or less unreadable.

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) say they’ve solved this with a new open-source system that makes it possible to create interactive scatterplots based on large-scale datasets that have upward of billions of distinct data points.

Called “Kyrix-S,” the system has an interface that allows users to pan, zoom, and jump around a scatterplot as if they were looking at directions on Google Maps. Whereas other systems developed for large datasets often focus on very specific applications, Kyrix-S is generalizable enough to work for a wide range of visualization styles, including heatmaps, pie charts, and radar-style graphics. (The team showed that the system allows users to create visualizations with 800 times less code compared to a similar state-of-the-art authoring system.)

Users can produce a scatterplot by just writing a few dozen lines of JSON, a human-readable text format. For example, the system turns this:

... into this:

Lead developer Wenbo Tao, a Ph.D. student at MIT CSAIL, gives the following example of a static scatterplot from The New York Times (below) that he says would improve by being made interactive via a system like Kyrix-S.

“In these scatterplots, you are able to see overall trends and outliers, but the overplotting and the static nature of the plot limit the user’s ability to interact with the chart,” says Tao.

In contrast, Kyrix-S can produce a version (below) that puts data in several zoom levels, enabling interaction with each county. To avoid overplotting, Kyrix-S’ scatterplot also shows only the most important examples, like the most populous counties.

“As a visualization researcher, I am constantly at the edge of data sizes that are possible to visualize, which forces me to summarize or partition my data to get any insights,” says Kristi Potter, a data visualization scientist at the National Renewable Energy Laboratory who was not involved in the research. “With Kyrix-S, it’s possible to use all of the data, providing much more confidence in visualizations of large-scale data.”

Kyrix-S is currently being used by Data Civilizer 2.0, a data integration platform developed at MIT. An earlier version was also employed to help Massachusetts General Hospital analyze a massive brain activity dataset (EEG) that clocks in at 30 terabytes—the equivalent of more than 50,000 hours of digital music. (The goal of that study was to train a model that predicts seizures, given a series of two-second EEG segments.)

Moving forward, the researchers will be adapting Kyrix-S to work as part of a graphical user interface. They also plan to add functionality so that the system can handle data that are being continuously updated.

Reprinted with permission of MIT CSAIL News.

Discuss

About The Author

Adam Conner-Simons’s picture

Adam Conner-Simons

Adam Conner-Simons is a communications professional, consultant, and content creator who has 15+ years of experience in journalism and public relations. He oversees communications and public relations for MIT’s largest research lab, the Computer Science and Artificial Intelligence Lab (CSAIL) leading all efforts related to media outreach, digital strategy, social media, web content, as well as speechwriting, and translating difficult concepts for general audiences. He regularly speaks about communications and media relations at conferences. As a freelance writer he contributes regularly to outlets such as The New York TimesSlate Magazine, and The Boston Globe.

Comments

Great article (did I find a typo?)

Hi!

Very interesting and useful article. 

In the following line, I think the "800" could be a typo because 100% is the largest reduction possible from an original quantity.

"The team showed that the system allows users to create visualizations with 800-percent less code compared to a similar state-of-the-art authoring system."

Thank you.

Nice catch

Nice catch Eunice! Checking with author but I believe what he meant is "800 times less." "times" not "percent"