Benchmark Lineage

Details

Filter by Tag

Click a node or edge

Edge Types

samples from

derived from

uses method of

uses source

annotated with

raw source (non-benchmark)

About This Dashboard

What is this?

This dashboard visualizes the lineage of tabular benchmark datasets used in NLP and data management research. Each node is a dataset; edges show how datasets relate to one another — whether one was sampled from another, extends it, or shares methodology. All data and source code used to generate this dashboard are open and available in this GitHub repo: https://github.com/inwonakng/table-corpus-lineage.

Click any node to see its metadata (year, venue, paper link, tags) in the right-hand panel. Click an edge to see the relationship type and any notes. Use the tag filter to highlight subsets of the graph.

Organization philosophy

Data lives in two places:

datasets/<name>.yaml — one file per benchmark dataset, containing its metadata and any relations it introduces.
sources.yaml — raw, non-benchmark sources (corpora, annotation pools, external resources) that datasets draw from, plus any cross-cutting relations.

Keeping each dataset in its own file makes it easy to add, remove, or update a single entry without touching anything else. Relations are stored with the dataset that introduces the dependency, so provenance stays close to its source.

Edge types

samples_from — draws a subset of examples from the parent.
extends — adds labels, splits, or tasks on top of the parent.
derived_from — substantively re-processes or re-annotates the parent.
uses_method_of — borrows a collection or annotation methodology.
uses_source — pulls raw text/data from a non-benchmark source.
misc — any other meaningful relationship.

Extending locally

To add a new dataset, create datasets/my_dataset.yaml:

dataset:
  id: my_dataset
  label: "My Dataset"
  year: 2024
  venue: ACL
  url: https://...
  tags: [qa, table]
  notes: "Optional free-text notes."
relations:
  - from: parent_dataset
    to: my_dataset
    type: extends

Then regenerate the dashboard by running python generate.py (or python generate.py -o my_output.html for a custom filename). The script reads every *.yaml in datasets/ automatically — no other configuration needed.