Benchmarks
AI benchmarks are standardized datasets, tests, or evaluation methods used to measure the performance of various AI systems.
Benchmark | Description |
---|---|
MMLU  | Tests a model’s ability to perform well on a wide range of tasks across 57 different domains like math, history, law, and more. |
HellaSwag  | Challenges LLMs to demonstrate commonsense reasoning and inference abilities. |
PIQA  | Evaluates a model’s ability to answer science questions grounded in physical intuition and world knowledge. |
SIQA  | Assesses an LLM’s commonsense reasoning and understanding of social situations. |
BoolQ  | Measures models on yes/no questions that often require complex reasoning. |
Winogrande  | A challenging benchmark focusing on commonsense reasoning by resolving pronoun ambiguity. |
CQA  | Tests conversational question answering where LLMs need to follow the flow of conversational history. |
OBQA  | Evaluates a model’s ability to answer open-ended questions requiring factual knowledge retrieval. |
ARC-e/ARC-c  | A set of science exam questions measuring reasoning and understanding. ‘e’ stands for easy and ‘c’ for challenging. |
TriviaQA  | Assesses LLMs on open-domain trivia questions obtained from real sources. |
NQ  | Evaluates question answering on challenging real-world Google search queries. |
HumanEval  | Involves direct human judgment of LLM-generated text for coherence, relevance, and other qualities. |
MBPP  | Examines model performance on different mathematical problem-solving subtasks. |
GSM8K  | Evaluates LLMs on challenging multi-step grade-school mathematical problems. |
MATH  | Another dataset for assessing mathematical reasoning skills in language models. |
AGIEval  | Tests an LLM’s ability to reason and answer questions based on scenes and images. |
BBH  | Benchmark for logical, multi-hop reasoning on different types of relations. |
Last updated on