Benchmarks

AI benchmarks are standardized datasets, tests, or evaluation methods used to measure the performance of various AI systems.

Benchmark	Description
MMLU	Tests a model’s ability to perform well on a wide range of tasks across 57 different domains like math, history, law, and more.
HellaSwag	Challenges LLMs to demonstrate commonsense reasoning and inference abilities.
PIQA	Evaluates a model’s ability to answer science questions grounded in physical intuition and world knowledge.
SIQA	Assesses an LLM’s commonsense reasoning and understanding of social situations.
BoolQ	Measures models on yes/no questions that often require complex reasoning.
Winogrande	A challenging benchmark focusing on commonsense reasoning by resolving pronoun ambiguity.
CQA	Tests conversational question answering where LLMs need to follow the flow of conversational history.
OBQA	Evaluates a model’s ability to answer open-ended questions requiring factual knowledge retrieval.
ARC-e/ARC-c	A set of science exam questions measuring reasoning and understanding. ‘e’ stands for easy and ‘c’ for challenging.
TriviaQA	Assesses LLMs on open-domain trivia questions obtained from real sources.
NQ	Evaluates question answering on challenging real-world Google search queries.
HumanEval	Involves direct human judgment of LLM-generated text for coherence, relevance, and other qualities.
MBPP	Examines model performance on different mathematical problem-solving subtasks.
GSM8K	Evaluates LLMs on challenging multi-step grade-school mathematical problems.
MATH	Another dataset for assessing mathematical reasoning skills in language models.
AGIEval	Tests an LLM’s ability to reason and answer questions based on scenes and images.
BBH	Benchmark for logical, multi-hop reasoning on different types of relations.