Reasoning benchmarks

There is no Reasoning Index but there are various benchmarks, tests and evaluations used within the AI community to assess how models perform in different reasoning tasks.

Find links to all benchmarks at the bottom of this text...

1. GLUE and SuperGLUE Benchmarks:

2. ARC (AI2 Reasoning Challenge):

3. Big-Bench (Beyond the Imitation Game):

4. MMLU (Massive Multitask Language Understanding):

5. CommonsenseQA:

6. Winograd Schema Challenge (WSC):

7. HumanEval:

8. SQuAD (Stanford Question Answering Dataset):

9. TruthfulQA:

10. OpenAI and Anthropic Evaluations:

Links:

SuperGLUE

The ARC-PRIZE

google/BIG-bench on Hugging Face

MMLU (Massive Multitask Language Understanding) on paperswithcode.com

CommonsenseQA

WSC (Winograd Schema Challenge) on paperswithcode.com

HumanEval on paperswithcode.com

SQuAD (Stanford Question Answering Dataset)  on paperswithcode.com

TruthfulQA on paperswithcode.com