Benchmarks

Benchmarks

AI benchmarks are standardized datasets, tests, or evaluation methods used to measure the performance of various AI systems.

BenchmarkDescription
MMLUTests a model’s ability to perform well on a wide range of tasks across 57 different domains like math, history, law, and more.
HellaSwagChallenges LLMs to demonstrate commonsense reasoning and inference abilities.
PIQAEvaluates a model’s ability to answer science questions grounded in physical intuition and world knowledge.
SIQAAssesses an LLM’s commonsense reasoning and understanding of social situations.
BoolQMeasures models on yes/no questions that often require complex reasoning.
WinograndeA challenging benchmark focusing on commonsense reasoning by resolving pronoun ambiguity.
CQATests conversational question answering where LLMs need to follow the flow of conversational history.
OBQAEvaluates a model’s ability to answer open-ended questions requiring factual knowledge retrieval.
ARC-e/ARC-cA set of science exam questions measuring reasoning and understanding. ‘e’ stands for easy and ‘c’ for challenging.
TriviaQAAssesses LLMs on open-domain trivia questions obtained from real sources.
NQEvaluates question answering on challenging real-world Google search queries.
HumanEvalInvolves direct human judgment of LLM-generated text for coherence, relevance, and other qualities.
MBPPExamines model performance on different mathematical problem-solving subtasks.
GSM8KEvaluates LLMs on challenging multi-step grade-school mathematical problems.
MATHAnother dataset for assessing mathematical reasoning skills in language models.
AGIEvalTests an LLM’s ability to reason and answer questions based on scenes and images.
BBHBenchmark for logical, multi-hop reasoning on different types of relations.