Skip to Content
Nextra 4.0 is released 🎉
ResourcesBenchmarks

Benchmarks

AI benchmarks are standardized datasets, tests, or evaluation methods used to measure the performance of various AI systems.

BenchmarkDescription
MMLU Tests a model’s ability to perform well on a wide range of tasks across 57 different domains like math, history, law, and more.
HellaSwag Challenges LLMs to demonstrate commonsense reasoning and inference abilities.
PIQA Evaluates a model’s ability to answer science questions grounded in physical intuition and world knowledge.
SIQA Assesses an LLM’s commonsense reasoning and understanding of social situations.
BoolQ Measures models on yes/no questions that often require complex reasoning.
Winogrande A challenging benchmark focusing on commonsense reasoning by resolving pronoun ambiguity.
CQA Tests conversational question answering where LLMs need to follow the flow of conversational history.
OBQA Evaluates a model’s ability to answer open-ended questions requiring factual knowledge retrieval.
ARC-e/ARC-c A set of science exam questions measuring reasoning and understanding. ‘e’ stands for easy and ‘c’ for challenging.
TriviaQA Assesses LLMs on open-domain trivia questions obtained from real sources.
NQ Evaluates question answering on challenging real-world Google search queries.
HumanEval Involves direct human judgment of LLM-generated text for coherence, relevance, and other qualities.
MBPP Examines model performance on different mathematical problem-solving subtasks.
GSM8K Evaluates LLMs on challenging multi-step grade-school mathematical problems.
MATH Another dataset for assessing mathematical reasoning skills in language models.
AGIEval Tests an LLM’s ability to reason and answer questions based on scenes and images.
BBH Benchmark for logical, multi-hop reasoning on different types of relations.
Last updated on