Statistics Article

Multiple Authors
By: David Darais, Joseph Near

In our last article, we discussed how to determine how many people drink pumpkin spice lattes in a given time period without learning their identifying information. But say, for example, you would like to know the total amount spent on pumpkin spice lattes this year, or the average price of a pumpkin spice latte since 2010. You’d like to detect these trends in data without being able to learn identifying information about specific customers to protect their privacy. To do this, you can use summation and average queries answered with differential privacy.

In this article, we will move beyond counting queries and dive into answering summation and average queries with differential privacy. Starting with the basics: In SQL, summation and average queries are specified using the SUM and AVG aggregation functions:

SELECT SUM(price) FROM PumpkinSpiceLatteSales WHERE year = 2020
SELECT AVG(price) FROM PumpkinSpiceLatteSales WHERE year > 2010

In Pandas, these queries can be expressed using the sum() and mean() functions, respectively. But how would we run these queries while also guaranteeing differential privacy?

Multiple Authors
By: David Darais, Joseph Near

How many people drink pumpkin spice lattes in October, and how would you calculate this without learning specifically who is drinking them, and who is not?

Although they seem simple or trivial, counting queries are used extremely often. Counting queries such as histograms can express many useful business metrics. How many transactions took place last week? How did this compare to the previous week? Which market has produced the most sales? In fact, one paper showed that more than half of queries written at Uber in 2016 were counting queries.

Counting queries are often the basis for more complicated analyses, too. For example, the U.S. Census releases data that are constructed essentially by issuing many counting queries over sensitive raw data collected from residents. Each of these queries belongs in the class of counting queries we will discuss below and computes the number of people living in the United States with a particular set of properties (e.g., living in a certain geographic area, having a particular income, belonging to a particular demographic).

Donald J. Wheeler’s picture

By: Donald J. Wheeler

Inspection sounds simple. Screen out the bad stuff and ship the good stuff. However, measurement error will always create problems of misclassification where good stuff is rejected, and bad stuff gets shipped. While guard-bands and tightened inspection have been offered as a way to remedy the problem of shipping bad stuff, it turns out that they are often prohibitively expensive in practice. Here we look at how tightened inspection improves the quality of the product stream and compare those improvements with the associated excess costs.

The problem of inspection

A product measurement, X, may be thought of as consisting of the product value, Y, plus some measurement error, E, so that X = Y + E. With this model, the relationship between X and Y can be shown using a bivariate normal probability model where:

Scott A. Hindle’s picture

By: Scott A. Hindle

A quick Google search returns many instances of the saying, “A man with a watch knows what time it is. A man with two watches is never sure.” The doubt implied by this saying extends to manufacturing plants: If you measure a product on two (supposedly identical) devices, and one measurement is in specification and the other out of specification, which is right?

The aforementioned doubt also extends to healthcare, where measurement data abound. As part of the management of asthma, I measure my peak expiratory flow rate (discussed below), and I now have two handheld peak flow meter devices. Are the two devices similar or dissimilar? How would I know? To see how I investigated this, and to see the outcome, read on. A postscript is included for those wanting to dig a bit deeper.


In 2015, I was diagnosed with asthma, a chronic condition where the airways in the lungs can narrow and swell, making breathing more difficult. The worst of it occurred at my in-laws, where I experienced wheezing and had difficulty breathing. The cause? The family cat!

William A. Levinson’s picture

By: William A. Levinson

Traditional statistical methods for computing the process performance index (Ppk) and control limits for process-control purposes assume that measurements are available for all items or parts. If, however, the critical-to-quality (CTQ) characteristic is something undesirable, such as a trace impurity, trace contaminant, or pollutant, the instrument or gauge may have a lower detection limit (LDL) below which it cannot measure the characteristic. When this is the case, a measurement of zero or “not detected” does not mean zero; it means that the measurement is somewhere between zero and the LDL.

If the statistical distribution is known and is unlikely to be a normal (i.e., bell curve) distribution, we can nonetheless fit the distribution’s parameters by means of maximum likelihood estimation (MLE). This is how Statgraphics handles censored data, i.e., data sets for which all the measurements are not available, and the process can even be done with Microsoft Excel’s Solver feature. Goodness-of-fit tests can be performed to test the distributional fit, whereupon we can calculate the process performance index Ppk and set up a control chart for the characteristic in question.

Adam Conner-Simons’s picture

By: Adam Conner-Simons

This story was originally published by MIT Computer Science & Artificial Intelligence Lab (CSAIL).

Scatterplots. You may not know them by name, but if you spend more than 10 minutes online, you’ll find them everywhere.

They’re popular in news articles, in the data science community, and perhaps most crucially, for internet memes about the digestive quality of pancakes.

Jay Arthur—The KnowWare Man’s picture

By: Jay Arthur—The KnowWare Man

There are two ways to increase profits: increase sales or reduce costs. Although most data analysis seeks to find more ways to sell more stuff to more people, addressing preventable problems is an often overlooked opportunity. Preventable problems consume a third or more of corporate expenses and profits.

Data analysis can pinpoint problems and eliminate them forever. Problem solving with data is a much more reliable and controllable way to cut costs and increase profits. Sadly, few people know how to do this consistently.

How do you solve operational problems with 100-percent success rate? Take out the guesswork. The vast majority of improvement projects involve reducing or eliminating defects, mistakes, and errors. If you have raw data about when the defect occurred, where it happened, and what type of defect it was, you can create a world-class improvement project that eliminates the guesswork. And you can do it using a tool you most likely already have: Microsoft Excel.

Matthew Bundy’s picture

By: Matthew Bundy

Untitled Document

Burning plastic cart carrying a fax machine, a laptop computer, and a three-ring binder. Click here for larger image. Credit: FCD/NIST

Several centuries ago, scientists discovered oxygen while experimenting with combustion and flames. One scientist called it “fire air.” Today, at the National Institute of Standards and Technology (NIST), we continue to measure oxygen to study the behavior of fires.

Douglas Allen’s picture

By: Douglas Allen

Any number derived from real observation is made up of three components. The first of these is the intended signal, the “perfect” value from the object being observed. The second is error (or noise) caused by environmental disturbance and/or interference. The third is bias, a regular and consistent deviation from the perfect value.

O = S + N + B, or observation equals signal plus noise plus bias

The signal usually is predictably constant, as is the bias. Identifying and eliminating bias requires a set of techniques beyond the scope of this article, so for the remainder of this, we will consider both as components of the signal, leaving a somewhat simpler equation for our observation.

O = S + N, or observation equals signal plus noise

This article focuses on removing the random noise component from the observation and leaving the signal component. The noise is in the form of chance variation, which sometimes enhances the signal and sometimes detracts from it. If we could separate the noise from the signal and eliminate it, our observation would be pure signal, or a precise and consistent value.

Anthony D. Burns’s picture

By: Anthony D. Burns

Augmented reality (AR) means adding objects, animations, or information, that don’t really exist, to the real world. The idea is that the real world is augmented (or overlaid) with computer-generated material—ideally for some useful purpose.

Augmented reality has been around for about 30 years. But it’s only during the last five years or so that it has been widely used on mobile devices. If you have wondered why your new iPhone 12 has a LiDAR depth sensor, the answer is, in part, for augmented reality. Almost all modern phones now have depth sensors for AR. LiDAR makes depth sensing more accurate.

Unlike virtual reality (VR), AR on mobiles requires no special equipment. There’s no need for headsets or handheld devices. All you need is your mobile phone.

More than fun and games

Although games are probably the most notable use of AR on mobiles (Pokémon Go is a good example), there are business and training applications as well. Perhaps the simplest AR business application is labeling real-world objects. Google Maps, for example, recently launched Live View, adding real-world labeling of objects and directions via the mobile phone’s camera. Real-world objects, when viewed through the mobile phone, can show added text, objects, or 3D animations. Live View has all of these.

Syndicate content