Data Sources & Licenses

I built this on top of several open linguistic datasets. Every analysis you run surfaces data from one or more of them. Their authors did the work — this page names them and their licenses so attribution travels with the data.

Data sources used by Diachronica's etymology analyzer
Source License What I use it for
Wiktionary CC BY-SA 4.0 Etymology text, definitions, pronunciations, related words
Glottolog CC BY 4.0 Language families, tree structure, coordinates (~8,500 languoids)
Lexibank CC BY 4.0 (per dataset) Expert-annotated cognate sets (4,981 sets, 25,741 members). Cards on the etymology page mark Lexibank-sourced entries with a ★ badge.
IE-CoR CC BY 4.0 Indo-European Cognate Relationships — 25,731 lexemes across 160 languages with LIV²/NIL references. Distributed via Lexibank.
PHOIBLE CC BY-SA 3.0 Phoneme inventories (via the sister /api/linguistic/ endpoints)
WALS CC BY 4.0 Typological features (via /api/linguistic/)
ISO 639-3 Open (SIL attribution) Canonical three-letter language codes
COCA Licensed (paid) Modern American English word frequency, 1990–2019. Licensed from Mark Davies / english-corpora.org; redistribution restricted, summary statistics surfaced here.
Wikimedia Commons Varies (mostly CC BY-SA) Illustrative images; each hero caption links back to the file page with its specific license

A note on share-alike

Wiktionary and PHOIBLE are share-alike licenses: anything I ship that meaningfully incorporates them inherits the same license terms. That means the etymology graphs and data tables you see here are re-distributable under CC BY-SA — take them, build on them, credit the source.

Something missing?

If I surfaced data from a source not named here, or if an attribution needs fixing, let me know. Reach out to luke@lukesteuber.com or open an issue on the code at github.com/lukeslp/diachronica.