Data Sources & Licenses

I built this on top of several open linguistic datasets. Every analysis you run surfaces data from one or more of them. Their authors did the work; this page names them and their licenses so attribution travels with the data.

Data sources used by Diachronica's etymology analyzer
Source	License	What I use it for
Wiktionary	CC BY-SA 4.0	Etymology text, definitions, pronunciations, related words
Glottolog	CC BY 4.0	Language families, tree structure, coordinates (~8,500 languoids)
Lexibank	CC BY 4.0 (per dataset)	Expert-annotated cognate sets (4,981 sets, 25,741 members). Cards on the etymology page mark Lexibank-sourced entries with a ★ badge.
IE-CoR	CC BY 4.0	Indo-European Cognate Relationships: 25,731 lexemes across 160 languages with LIV²/NIL references. Distributed via Lexibank.
WOLD	CC BY 4.0	World Loanword Database: documented borrowings with source language and confidence. Powers the loanword badges (➜ glyph) and the "borrowed from X" pills on headwords and cognates. Haspelmath, Martin & Tadmor, Uri (eds.) 2009. WOLD. Leipzig: Max Planck Institute for Evolutionary Anthropology.
ASJP	CC BY 4.0	Automated Similarity Judgment Program: 40 Swadesh-list concepts surveyed across 11,540 languages. Powers the "attested in N languages worldwide" coverage stat. Wichmann, Søren, Eric W. Holman & Cecil H. Brown (eds.) 2022. The ASJP Database (version 20).
CLICS³	CC BY 4.0	Database of Cross-Linguistic Colexifications: which concepts share a single word across language families. Powers the "in other languages this word also means..." section. Rzymski, Tresoldi et al. 2020. The Database of Cross-Linguistic Colexifications, reproducible analysis of cross-linguistic polysemies.
PHOIBLE	CC BY-SA 3.0	Phoneme inventories (via the sister `/api/linguistic/` endpoints)
WALS	CC BY 4.0	Typological features (via `/api/linguistic/`)
ISO 639-3	Open (SIL attribution)	Canonical three-letter language codes
COCA	Licensed (paid)	Modern American English word frequency, 1990–2019. Licensed from Mark Davies / english-corpora.org; redistribution restricted, summary statistics surfaced here.
Wikimedia Commons	Varies (mostly CC BY-SA)	Illustrative images and pronunciation recordings; each image caption and audio credit links back to the file page with its specific license
Google Books Ngrams	CC BY 3.0	Word frequency per decade, 1500s–2010s (English corpus, version 3), accessed via ngrams.dev. Powers the usage sparkline.
kaikki.org (wiktextract)	CC BY-SA (from Wiktionary)	Machine-readable pronunciation data (IPA and audio recording links). Tatu Ylonen: Wiktextract: Wiktionary as Machine-Readable Structured Data, LREC 2022, pp. 1317–1325.
Natural Earth	Public domain	Simplified 1:110m coastlines behind the journey map in the mobile app, via world-atlas.

A note on share-alike

Wiktionary and PHOIBLE are share-alike licenses: anything I ship that meaningfully incorporates them inherits the same license terms. That means the etymology graphs and data tables you see here are re-distributable under CC BY-SA: take them, build on them, credit the source.

Something missing?

If I surfaced data from a source not named here, or if an attribution needs fixing, let me know. Reach out to luke@lukesteuber.com or open an issue on the code at github.com/lukeslp/diachronica.