Benchmarks
Benchmarks
AI benchmarks are standardized datasets, tests, or evaluation methods used to measure the performance of various AI systems.
| Benchmark | Description |
|---|---|
| MMLU | Tests a model’s ability to perform well on a wide range of tasks across 57 different domains like math, history, law, and more. |
| HellaSwag | Challenges LLMs to demonstrate commonsense reasoning and inference abilities. |
| PIQA | Evaluates a model’s ability to answer science questions grounded in physical intuition and world knowledge. |
| SIQA | Assesses an LLM’s commonsense reasoning and understanding of social situations. |
| BoolQ | Measures models on yes/no questions that often require complex reasoning. |
| Winogrande | A challenging benchmark focusing on commonsense reasoning by resolving pronoun ambiguity. |
| CQA | Tests conversational question answering where LLMs need to follow the flow of conversational history. |
| OBQA | Evaluates a model’s ability to answer open-ended questions requiring factual knowledge retrieval. |
| ARC-e/ARC-c | A set of science exam questions measuring reasoning and understanding. ‘e’ stands for easy and ‘c’ for challenging. |
| TriviaQA | Assesses LLMs on open-domain trivia questions obtained from real sources. |
| NQ | Evaluates question answering on challenging real-world Google search queries. |
| HumanEval | Involves direct human judgment of LLM-generated text for coherence, relevance, and other qualities. |
| MBPP | Examines model performance on different mathematical problem-solving subtasks. |
| GSM8K | Evaluates LLMs on challenging multi-step grade-school mathematical problems. |
| MATH | Another dataset for assessing mathematical reasoning skills in language models. |
| AGIEval | Tests an LLM’s ability to reason and answer questions based on scenes and images. |
| BBH | Benchmark for logical, multi-hop reasoning on different types of relations. |