Data Sources & Licenses
I built this on top of several open linguistic datasets. Every analysis you run surfaces data from one or more of them. Their authors did the work — this page names them and their licenses so attribution travels with the data.
| Source | License | What I use it for |
|---|---|---|
| Wiktionary | CC BY-SA 4.0 | Etymology text, definitions, pronunciations, related words |
| Glottolog | CC BY 4.0 | Language families, tree structure, coordinates (~8,500 languoids) |
| Lexibank | CC BY 4.0 (per dataset) | Expert-annotated cognate sets (4,981 sets, 25,741 members). Cards on the etymology page mark Lexibank-sourced entries with a ★ badge. |
| IE-CoR | CC BY 4.0 | Indo-European Cognate Relationships — 25,731 lexemes across 160 languages with LIV²/NIL references. Distributed via Lexibank. |
| PHOIBLE | CC BY-SA 3.0 | Phoneme inventories (via the sister /api/linguistic/ endpoints) |
| WALS | CC BY 4.0 | Typological features (via /api/linguistic/) |
| ISO 639-3 | Open (SIL attribution) | Canonical three-letter language codes |
| COCA | Licensed (paid) | Modern American English word frequency, 1990–2019. Licensed from Mark Davies / english-corpora.org; redistribution restricted, summary statistics surfaced here. |
| Wikimedia Commons | Varies (mostly CC BY-SA) | Illustrative images; each hero caption links back to the file page with its specific license |
A note on share-alike
Wiktionary and PHOIBLE are share-alike licenses: anything I ship that meaningfully incorporates them inherits the same license terms. That means the etymology graphs and data tables you see here are re-distributable under CC BY-SA — take them, build on them, credit the source.
Something missing?
If I surfaced data from a source not named here, or if an attribution needs fixing, let me know. Reach out to luke@lukesteuber.com or open an issue on the code at github.com/lukeslp/diachronica.