Production Perig/Shutterstock
Building gold-standard AI systems requires gold-standard AI measurement science—the scientific study of methods used to assess AI systems’ properties and effects. The National Institute of Standards and Technology (NIST) works to improve measurements of AI performance, reliability, and security that American companies and consumers rely on to develop, adopt, and benefit from AI technologies.
|
ADVERTISEMENT |
Among other groups at NIST, the Center for AI Standards and Innovation (CAISI) works with the larger community of AI practitioners to identify and make progress on open questions that are key to maturing the field of AI measurement science and advancing AI innovation. This article highlights an initial selection of such questions that CAISI has identified through its initiatives to date.
The need for improved AI measurement science
Many modern evaluations of AI systems don’t precisely articulate what’s been measured, much less whether the measurements are valid. Metrology, the field of measurement science, provides methods for rigorous and trustworthy use of measured values, qualified with assessments of uncertainty.
In line with its long tradition of research and leadership in measurement science, NIST has been charged with leading “the development of the science of measuring and evaluating AI models.” Working as part of NIST’s efforts, CAISI is developing new methods and frameworks to measure capabilities and risks of leading AI models as part of its mandate to “facilitate testing and collaborative research related to harnessing and securing the potential of commercial AI systems.”
A selection of open AI measurement science questions
AI measurement science is an emerging field. CAISI’s work to develop robust evaluations of AI systems in domains such as cybersecurity or biosecurity relies on ongoing research into methods for rigorous measurement. Below, we spotlight several nonexhaustive examples of challenging AI measurement science questions.
Ensuring measurement validity
What concepts do current AI evaluations measure, and do they generalize to other domains and real-world settings? How can evaluators ensure measurement instruments are reliable and valid?
Construct validity
Often, claims about the capabilities (e.g., mathematical reasoning) of AI systems don’t match the construct actually measured by the benchmark (e.g., accuracy at answering math problems). A critical step in AI evaluation is the assessment of construct validity, or whether a testing procedure accurately measures the intended concept or characteristic.
Construct validity is needed to accurately assess whether AI systems meet expectations:
• How can evaluation designers and practitioners develop construct validity?
• How should evaluators select appropriate measurement targets, metrics, and experimental designs for an evaluation goal? For example, how can the concept of AI-driven productivity be systematized and measured in different domains?
• What evaluation approaches can distinguish best- or worst-case performance from average-case reliability?
Generalization
Some AI evaluation results are unjustifiably generalized beyond the test setting. Research is needed to determine the extent to which evaluation results apply to other contexts and predict real-world performance:
• To what extent do domain- or task-specific evaluations generalize beyond a specific use case or between domains?
• How well do measures of general-purpose problem solving predict downstream performance on specific tasks?
• How well do predeployment evaluations predict postdeployment functionality, risk, and effects? What evaluation design practices enhance real-world informativeness?
Benchmark design and assessment
Researchers rely heavily on standardized benchmarks to compare performance across models and over time. But there are few guidelines for constructing reliable, rigorous benchmarks:
• What are valid and reliable methods to create and grade AI benchmarks? How can evaluators assess the validity and quality of existing AI benchmarks using publicly available information?
• To what degree are benchmark results sensitive to prompt selection and task design?
• How can evaluators determine the degree of train-test overlap affecting an evaluation? How does public release of benchmarks enhance or constrain future testing?
Measurement instrument innovation
Researchers are developing new approaches to measuring and monitoring AI system characteristics at scale. Appropriate uses, limitations, and best practices for these emerging methods are not yet clear:
• How and when can information from model reasoning and hidden activations be used to conduct more reliable measurements than system outputs alone?
• How and when should AI systems be used to test, evaluate, or monitor AI systems? What are best practices for reliable use and validation of LLM-as-a-judge?
Interpreting results and claims
How can practitioners appropriately interpret and act on the results of benchmarks, field studies, and other evaluations?
Uncertainty
All measurements involve some degree of uncertainty. Accurate claims about AI systems require honest and transparent communication of this uncertainty. But some presentations of benchmark results omit error bars and other basic expressions of uncertainty.
How should evaluators identify, quantify, and communicate sources of uncertainty in an evaluation setup?
Baselines
Many evaluations lack relevant human or other non-AI baselines. Baselines are necessary to accurately interpret results (e.g., comparing AI diagnostic tool accuracy to expert physicians).
What are appropriate baselines for interpreting AI system performance compared to non-AI alternatives and to AI-assisted human performance?
Model and benchmark comparison
Benchmark results are helpful for comparing the performance of different AI systems. But many reports ignore uncertainty and other factors needed for accurate comparison. Practitioners need valid methods to rank models, measure AI system improvement over time, resolve conflicting benchmark results, and make other comparisons of evaluation results.
How can practitioners compare or combine the results of different AI evaluations? And how should testing procedures for comparing multiple models differ from testing a single model?
Reporting
Measurement validity depends not only on the measurement instrument but also the context of the evaluation and the claims it supports. Often, reports of model performance from developers and researchers lack key details needed to assess validity. Accurate and useful interpretations of evaluation results depend on evaluators reporting sufficient detail.
What are best practices or standards for reporting the results of AI evaluations? What information should AI benchmark developers include when publishing benchmarks in order to support their sound usage in evaluations?
Taking measurements in the field
What methods enable measurement of AI systems in real-world settings?
Downstream outcome measurement
Postdeployment evaluations are often neglected, but are necessary to assess AI systems’ performance, risks, and effects in the real world:
• What are reproducible experimental methods to measure AI-driven changes in downstream outcomes over time? For example, can AI assist or uplift human ability to carry out physical tasks such as realistic biological laboratory research?
• What are methods to measure the causal effect of safeguards and other interventions on downstream outcomes?
• What are methods to measure, categorize, and track components of complex, multiturn human-AI interaction workflows?
Stakeholder consultation
Domain stakeholders have valuable subject matter expertise needed to ensure successful AI adoption. But researchers and developers typically conduct AI evaluations with little involvement from the public or other important stakeholders in the success of American AI.
What are best practices for including stakeholders in the assessment process, including end users, subject matter experts, and the public?
CAISI welcomes your engagement as it evaluates how it can best support stakeholders in advancing AI innovation through measurement science. Feel free to share comments or feedback via email to caisi-metrology@nist.gov.
Published Dec. 2, 2025, by CAISI @ NIST.

Add new comment