This dashboard visualizes the lineage of tabular benchmark datasets used in NLP and data management research. Each node is a dataset; edges show how datasets relate to one another — whether one was sampled from another, extends it, or shares methodology. All data and source code used to generate this dashboard are open and available in this GitHub repo: https://github.com/inwonakng/table-corpus-lineage.
Click any node to see its metadata (year, venue, paper link, tags) in the right-hand panel. Click an edge to see the relationship type and any notes. Use the tag filter to highlight subsets of the graph.
Data lives in two places:
datasets/<name>.yaml — one file per benchmark
dataset, containing its metadata and any relations it introduces.
sources.yaml — raw, non-benchmark sources (corpora,
annotation pools, external resources) that datasets draw from,
plus any cross-cutting relations.
Keeping each dataset in its own file makes it easy to add, remove, or update a single entry without touching anything else. Relations are stored with the dataset that introduces the dependency, so provenance stays close to its source.
samples_from — draws a subset of examples from the
parent.
extends — adds labels, splits, or tasks on top of the
parent.
derived_from — substantively re-processes or
re-annotates the parent.
uses_method_of — borrows a collection or annotation
methodology.
uses_source — pulls raw text/data from a
non-benchmark source.
misc — any other meaningful relationship.
To add a new dataset, create datasets/my_dataset.yaml:
dataset:
id: my_dataset
label: "My Dataset"
year: 2024
venue: ACL
url: https://...
tags: [qa, table]
notes: "Optional free-text notes."
relations:
- from: parent_dataset
to: my_dataset
type: extends
Then regenerate the dashboard by running
python generate.py (or
python generate.py -o my_output.html for a custom
filename). The script reads every *.yaml in
datasets/ automatically — no other configuration
needed.