A standard evaluation bench for NLG evaluations

Table of Contents

1. Downstream Datasets

Card Name selected SeqLen Comments
wikisql 512  
spider 512  
allenai/common_gen 256  
e2e_nlg 512  
UCL-DARK/openai-tldr-filtered 2048 filter
cnn_dailymail 4096 filter
samsum 2048 filter
piqa 256  
truthful_qa 256  
allenai/ai2_arc 256  

1.1. Downstream Tasks: Status

1.1.1. Translation

  1. REVIEW WMT16: cs-en
  2. REVIEW WMT16: de-en
  3. REVIEW WMT16: fi-en

1.1.2. Summarization

  1. TODO TLDR
  2. TODO cnn_dailymail ? too long
  3. TODO samsum

1.1.3. Structure Text generation

  1. TODO WikiSQL
  2. TODO Spider

1.1.4. Data-to-Text

  1. TODO e2e_nlg
  2. TODO CommonGen

1.1.5. Question Answering

  1. TODO piqa
  2. TODO truthful_qa
  3. TODO allenai/ai2_arc

1.2. Generalized Ability: Status

1.2.1. Safety

  1. TODO HELM: bias, toxicity, fairness
  2. TODO

1.2.2. Reasoning (math)

  1. TODO MMLU (59 subsets)

Author: Zi Liang (zi1415926.liang@connect.polyu.hk) Create Date: Tue Mar 26 11:08:33 2024 Last modified: 2024-03-26 Tue 11:09 Creator: Emacs 28.1 (Org mode 9.5.2)