Our PROMISE: Our ads will never cover up content.
Our children thank you.
Published: Wednesday, April 21, 2021 - 12:03 How many people drink pumpkin spice lattes in October, and how would you calculate this without learning specifically who is drinking them, and who is not? Although they seem simple or trivial, counting queries are used extremely often. Counting queries such as histograms can express many useful business metrics. How many transactions took place last week? How did this compare to the previous week? Which market has produced the most sales? In fact, one paper showed that more than half of queries written at Uber in 2016 were counting queries. Counting queries are often the basis for more complicated analyses, too. For example, the U.S. Census releases data that are constructed essentially by issuing many counting queries over sensitive raw data collected from residents. Each of these queries belongs in the class of counting queries we will discuss below and computes the number of people living in the United States with a particular set of properties (e.g., living in a certain geographic area, having a particular income, belonging to a particular demographic). Counting queries over tabular data are the simplest kind of query to answer with differential privacy and have the support of a large amount of research. Informally, counting queries have the form: “How many rows in the database have the property X?” For example, each row could correspond to a survey respondent, and the property X could be “replied yes to the survey.” As we mentioned above, this simple form can be extended to fit a number of common-use cases when analyzing data. Our treatment of counting queries requires some background knowledge of tabular data and database queries—ideally, the basics of either SQL or Pandas. In SQL, counting queries look like the following: Here, T is a database table we want to search, P1... are the desired properties of the rows to be counted, and A1... are the columns of the data to group by. The number and complexity of the properties in the query does not affect the difficulty of answering the query with differential privacy (as long as the properties don’t depend on other rows in the database). When the list A1... is non-empty, the query is called a histogram. In Pandas, the same class of queries can be expressed using a series of filters over a data frame, followed by the use of len or shape: df_1 = d_f[P1] The class of simple counting queries excludes queries that perform other operations (e.g., SUM and AVG), and queries with joins and semijoins (more on these later in the series). To achieve differential privacy for counting queries, we add noise to the query’s final answer. To correctly satisfy the definition of differential privacy, we scale the noise to the sensitivity of the query. In general, determining the sensitivity of an arbitrary query is not an easy task. However, determining the sensitivity of counting queries is essentially a solved problem, and what follows is a summary of the general solution. As described in our first post, we add noise drawn from the Laplace distribution to the answer of a counting query to achieve differential privacy. The noise needs to be scaled to the sensitivity of the query. Sensitivity measures how much the answer changes based on a single user’s contribution of data. For counting queries, this value is always 1: The final count can only change by 1 when a single user’s data are added or removed. Crucially, this argument holds no matter what the property is, or the columns being grouped. As a rule of thumb, counting queries have a sensitivity of 1. This means we can easily determine the scale of the noise required. The formula for the noise as Python code is: len(d_f[P1]) + np.random.laplace(loc=0, scale=1/ε) For counting queries, this is all that’s required to achieve differential privacy! In fact, some of the software tools described below take exactly this approach. A common pitfall to avoid when using differential privacy: Ensure the signal is significantly larger than the noise to achieve useful results. Differential privacy works extremely well for counting large groups (e.g., residents of New York), but as the group size gets smaller, accuracy degrades. In the worst case—a group of size 1—the “signal” (i.e., the exact count) and the noise (the added Laplacian noise) have the same scale. This is exactly what we want to protect privacy—a query that examines a single individual would definitely violate that individual’s privacy, so it should return a useless result in order to guarantee protection of privacy. Let’s look at two of the more actively supported and accessible tools for answering counting queries with differential privacy. The first, Diffprivlib, is best suited for datasets that fit in memory; queries are written as Python programs. The second, Google’s differential privacy library, integrates with PostgreSQL to support larger datasets; queries are written in SQL. Both tools can be found in NIST’s Privacy Engineering Collaboration Space, where you are invited to share feedback or use cases related to these tools. Differentially private counts in Python: Diffprivlib from diffprivlib import tools as dp Diffprivlib is suitable for differentially private analyses on any data that will easily fit in memory. Diffprivlib relies on libraries like NumPy to perform the underlying query tasks, and these libraries are not well-suited to large-scale data that don’t fit in memory. For any kind of analysis you might normally perform on your laptop, Diffprivlib is likely to work well; for larger problems, other solutions are likely to be faster. Programmers track the privacy budget in Diffprivlib explicitly, using a BudgetAccountant—which provides considerable flexibility but also requires the programmer to understand privacy budgets. The following example will throw an error while computing the second histogram, since it exceeds the total allowed privacy budget. acc = BudgetAccountant(epsilon=1, delta=0) Differentially private counts in SQL: Google’s differential privacy library Through this integration, the library is capable of directly answering queries written in SQL. In most cases, the programmer can simply replace the COUNT aggregation function with the ANON_COUNT function (passing a value for ε), and the system will output differentially private results. For example, the following SQL query will compute a differentially private histogram with ε=0.1: SELECT ANON_COUNT(*, 0.1) Because Google’s library integrates directly with a commonly used database system, it is likely to scale more easily to large datasets than Diffprivlib. Unlike NumPy, PostgreSQL is designed to work well even when the database being queried does not fit in memory, and Google’s approach retains this benefit. Data environments which use another DBMS would need to switch to PostgreSQL or implement a similar integration for their own DBMS. For environments already using PostgreSQL, Google’s library provides a directly deployable solution for answering queries written in SQL. The library also provides APIs in Go and Java for building more complex differentially private algorithms. The two tools discussed here are both robust and well-maintained, but both require at least some expertise in differential privacy. Systems with additional tools to aid in exploring data and understanding the implications of differential privacy have been proposed in academic projects but have not yet been developed into practical, well-supported tools. For example, the Ψ system, developed as part of the Harvard Privacy Tools project, has a user interface designed specifically around helping the analyst to understand how best to allocate the privacy budget and how much error to expect from each query. What if instead of the total, you’d like to know the average number of people drinking pumpkin spice lattes across different geographic areas? Our next post will go beyond counting queries into other kinds of aggregations—like sums and averages—to answer questions like these while maintaining differential privacy. These kinds of queries can be more challenging to answer with differential privacy, and require special care to ensure privacy and accuracy in the results. First published Oct. 29, 2020, on NIST’s Cybersecurity Insights blog. Quality Digest does not charge readers for its content. We believe that industry news is important for you to do your job, and Quality Digest supports businesses of all types. However, someone has to pay for this content. And that’s where advertising comes in. Most people consider ads a nuisance, but they do serve a useful function besides allowing media companies to stay afloat. They keep you aware of new products and services relevant to your industry. All ads in Quality Digest apply directly to products and services that most of our readers need. You won’t see automobile or health supplement ads. So please consider turning off your ad blocker for our site. Thanks, David Darais is a principal scientist at Galois and supports NIST as a moderator for the Privacy Engineering Collaboration Space. Joseph Near is an assistant professor of computer science at the University of Vermont who supports NIST as a moderator for the Privacy Engineering Collaboration Space.Counting Queries: Extracting Key Business Metrics From Datasets
How to achieve differential privacy with common counting queries
Defining counting queries
SELECT COUNT(*)
FROM T
WHERE P1
AND P2
AND P3
...
GROUP BY A1, A2, ...
df_2 = df_1[P2]
df_3 = df_2[P3]
...
df_n.groupby([A1, ...]).size()Achieving differential privacy
Software tools
IBM recently released an open-source Python library for enforcing differential privacy, called Diffprivlib. Diffprivlib integrates with other commonly used Python libraries like NumPy and scikit-learn, and makes it easy for Python programmers to add differential privacy to their programs. For example, the following Python code constructs a differentially private histogram with an ε value of 0.1:
dp_hist, _ = dp.histogram(data, epsilon=0.1)
dp_hist1, _ = dp.histogram(data, epsilon=0.5, accountant=acc)
dp_hist2, _ = dp.histogram(data, epsilon=0.8, accountant=acc)
Google’s differential privacy library is an open-source library written in C++ for high-performance, differentially private analytics. The library includes other functionality in addition to the C++ API—for example, an extension to PostgreSQL for executing differentially private database queries.
FROM data
GROUP BY ageLimitations and areas for improvement
Coming up next
Our PROMISE: Quality Digest only displays static ads that never overlay or cover up content. They never get in your way. They are there for you to read, or not.
Quality Digest Discuss
About The Authors
David Darais
Joseph Near
© 2023 Quality Digest. Copyright on content held by Quality Digest or by individual authors. Contact Quality Digest for reprint information.
“Quality Digest" is a trademark owned by Quality Circle Institute, Inc.