A standard evaluation bench for NLG evaluations

Table of Contents

1. Downstream Datasets
- 1.1. Downstream Tasks: Status
- 1.2. Generalized Ability: Status
  - 1.2.1. Safety
  - 1.2.2. Reasoning (math)

1. Downstream Datasets

Card Name	selected SeqLen	Comments
wikisql	512
spider	512
allenai/common_gen	256
e2e_nlg	512
UCL-DARK/openai-tldr-filtered	2048	filter
cnn_dailymail	4096	filter
samsum	2048	filter
piqa	256
truthful_qa	256
allenai/ai2_arc	256

1.1. Downstream Tasks: Status

1.1.1. Translation

REVIEW WMT16: cs-en
REVIEW WMT16: de-en
REVIEW WMT16: fi-en

1.1.2. Summarization

TODO TLDR
TODO cnn_dailymail ? too long
TODO samsum

1.1.3. Structure Text generation

TODO WikiSQL
TODO Spider

1.1.4. Data-to-Text

TODO e2e_nlg
TODO CommonGen

1.1.5. Question Answering

TODO piqa
TODO truthful_qa
TODO allenai/ai2_arc

1.2. Generalized Ability: Status

1.2.1. Safety

TODO HELM: bias, toxicity, fairness
TODO …

1.2.2. Reasoning (math)

TODO MMLU (59 subsets)

Author: Zi Liang (zi1415926.liang@connect.polyu.hk) Create Date: Tue Mar 26 11:08:33 2024 Last modified: 2026-05-12 Tue 20:20 Creator: Emacs 31.0.90 (Org mode 9.8.5)