In the social sciences as well as in engineering it is common practice to use benchmarks as indicators of performance. Thereby, several countries or regions within a country are compared with respect to quantitative indicator. Let’s take employment ratios. A higher employment ratio, which includes many persons working few hours in part-time work, is different from a slightly lower employment ratio, but hardly any part-time employees.
The same rationale holds true for benchmarks of AI systems or the newer versions of agentic AI that are under construction in many fields. The paper by Yuxuan Zhu et al. (2025) proposes the ABC (agentic behavior checklist) for agentic AI developers. The reporting of benchmarks by such models should include (1) transparency and validity, (2) Mitigation efforts of limitations and (3) result interpretation using statistical significance measures and interpretation guidelines.
The aim of this research is to establish a good practice in establishing benchmarks in the field of agentic AI. The sets of criteria to test for is large and the focus of how the agentic AI treats, for example, statistical outliers much above or below the average i.e. (> 2 standard deviations from the average) assuming a normal distribution, is one case of application only.
We welcome the efforts to bench the benchmarks in the field of AI as is common practice in other sciences as well.