Error bars for lexicostatistical estimates, with a case study comparing the diversity of Chinese and Romance

Title: Error bars for lexicostatistical estimates, with a case study comparing the diversity of Chinese and Romance
Source document: Linguistica Brunensia. 2024, vol. 72, iss. 1, pp. 5-21
Extent
5-21
  • ISSN
    2336-4440 (online)
Type: Article
Language
Rights access
open access
 

Notice: These citations are automatically created and might not follow citation rules properly.

Abstract(s)
This paper applies statistical techniques for measuring sampling error to lexicostatistics, a field in which error has often been discussed, but only rarely measured. We specifically calculate a margin of error for lexicostatistical comparisons based on Swadesh-type vocabulary lists, and use chi-squared tests to estimate a minimum threshold for when two lexicostatistical measurements will be statistically significantly different from one another. The article includes charts which mathematically unsophisticated scholars can easily use to check margins or error. We use margin of error calculations to test the claim that the relative internal diversity of Romance "languages" and Chinese "dialects" is equivalent, finding that no result is possible with extant lexicostatistical studies. We end by suggesting that lexicostatistical dendrograms depict uncertainty with "fat branches," that is, branches whose width corresponds to statistical uncertainty.
References
[1] Brown, Cecil – Holman, Eric – Wichmann, Søren – Velupillai, Viveka. 2008. Automated classification of the world's languages: A description of the method and preliminary re¬sults. Language Typology and Universals / Sprachtypologie und Universalienforschung 61(4), pp. 285–308.

[2] Chomsky, Noam. 1977. Dialogues avec Mitsou Ronat. Paris: Flammarion.

[3] Chomsky, Noam. 1979. Language and Responsibility: Based on Conversations with Mitsou Ronat. New York: Pantheon.

[4] Chomsky, Noam. 1986. Knowledge of Language: Its Nature, Origin and Use. Westport: Praeger.

[5] Chomsky, Noam. 1988. Language and Politics. Montreal: Black Rose Books.

[6] Chrétien, C. Douglas. 1962. The mathematical models of glottochronology. Language 38, pp. 11–37.

[7] Crowley, Terry. 1992. An Introduction to Historical Linguistics. Oxford: Oxford University Press.

[8] Dobson, Annette – Kruskal, Joseph – Sankoff, David – Savage, Leonard. 1972. The mathe¬matics of glottochronology revisited. Anthropological Linguistics 14(6), pp. 205–212.

[9] Dolgopol'sky, Aharon. 1986. A probabilistic hypothesis concerning the oldest relationships among the language families of Northern Eurasia In: Shevoroshkin, Vitaly – Markey, Thomas, eds. Typology, Relationship and Time: A collection of Papers on Language Change and Relationship by Soviet Linguists. Ann Arbor: Karoma, pp. 27–50. | DOI 10.1525/aa.1987.89.3.02a00250

[10] Dyen, Isidore. 1962. The lexicostatistical classification of the Malayopolynesian languages Language 38(1), pp. 38–46. | DOI 10.2307/411187

[11] Dyen, Isidore. 1975. Linguistic Subgrouping and Lexicostatistics. The Hague: Mouton.

[12] Dyen, Isidore – Kruskal, Joseph B. – Black, Paul. 1992. An Indoeuropean Classification: A Lexicostatistical Experiment. Philadelphia: American Philosophical Society.

[13] Embleton, Sheila. 1986. Statistics in Historical Linguistics. Bochum: Brockmeyer.

[14] Embleton, Sheila. 2015. Historical linguistics: Numerical methods. In: Wright, James, ed. International Encyclopedia of the Social and Behavioral Sciences. Oxford: Elsevier, pp. 23–26.

[15] Feld, Jan – Maxwell, Alexander. 2019. Sampling error in lexicostatistical measurements: A Slavic case study Diachronica 36(1), pp. 100–120. | DOI 10.1075/dia.18004.fel

[16] Fodor, István. 1961. The validity of glottochronology on the basis of the Slavonic languages. Studia Slavica 7, pp. 295–346.

[17] Geisler, Hans – List, Johannes-Mattis. 2010. Beautiful trees on unstable ground: Notes on the data problem in lexicostatistics. In: Hettrich, Heinrich, ed. Die Ausbereitung des Indogermanischen: Thesen aus Sprachwissenschaft, Archäologie und Genetik. Wiesbaden: Reichert, pp. 1–10.

[18] Geisler, Hans – List, Johannes-Mattis. 2013. Do languages grow on trees? The tree metaphor in the history of linguistics. In: Fangerau, Heiner – Geisler, Hans – Halling, Thorsten – Martin, William, eds. Classification and Evolution in Biology, Linguistics, and the History of Science. Stuttgart: Franz Steiner, pp. 111–124.

[19] Gray, Russell D. – Atkinson, Quentin D. 2003. Language-tree divergence times support the Anatolian theory of Indo-European origin Nature 426, pp. 435–439. | DOI 10.1038/nature02029

[20] Greenhill, Simon. 2011. Levenshtein distances fail to identify language relationships accurately Computational Linguistics 37(4), pp. 689–698. | DOI 10.1162/coli_a_00073

[21] Gudschinsky, Sarah. 1956. The ABCs of lexicostatistics (glottochronology) . Word 12(2), pp. 175–210.

[22] Heggarty, Paul. 2010. Beyond lexicostatistics: How to get more out of “word list” comparisons. Diachronica 27(2), pp. 301–324.

[23] Heggarty, Paul – McMahon, April – McMahon, Robert. 2011. From phonetic similarity to dialect classification: A principled approach Delbecque, Nicole – Auwera, Johan van der – Geeraerts, Dirk, eds. Perspectives on Variation: Sociolinguistic, Historical, Comparative. Berlin: De Gruyter, pp. 43–92. | DOI 10.1515/9783110909579.43

[24] Hymes, Dell. 1960. Lexicostatistics so far. Current Anthropology 1(1), pp. 3–44.

[25] Kornai, András. 2002. How many words are there? Glottometrics 4, pp. 61–86.

[26] Koryakov, Yuri. 2017. Language vs. dialect: A lexicostatistic approach. Voprosy Jazykoznanija 6, pp. 79–101.

[27] Levenshtein, V. I. 1965. Dvoichnye kody s ispravleniem vypadenij, vstavok i zameshhenij simvolov. Doklady Akademii Nauk SSSR 163(4), pp. 845–848.

[28] Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), pp. 707–710.

[29] Lyons, Louis. 2013. Discovering the significance of 5 sigma. arXiv preprint, arXiv:1310.1284.

[30] Maguire, Warren – McMahon, April. 2011. Quantifying relations between dialects In: Maguire, Warren – McMahon, April, eds. Analysing Variation in English. Cambridge: Cambridge University Press, pp. 93–120. | DOI 10.1017/cbo9780511976360.006

[31] Mańczak, Witold. 2009. The original homeland of the Slavs Studia Mythologica Slavica 12, pp. 135–145. | DOI 10.3986/sms.v12i0.1667

[32] McElhanon, Kenneth A. 1971. Classifying New Guinea languages. Anthropos 66(1–2), pp. 120–144.

[33] McMahon, April – McMahon, Robert. 2005. How do linguists classify languages? In: McMahon, April – McMahon, Robert, eds. Language Classification by Numbers. Oxford: Oxford University Press, pp. 20–49. | DOI 10.1093/oso/9780199279012.003.0001

[34] Nicholls, Geoff – Gray, Russell D. 2006. Quantifying uncertainty in a stochastic model of vocabulary evolution. In: Forster, Peter – Renfew, Colin, eds. Phylogenetic Methods and the Prehistory of Language. Cambridge: McDonald Institute, pp. 161–172.

[35] Olmsted, David. 1957. Three tests of glottochronological theory American Anthropologist 59(5), pp. 839–842. | DOI 10.1525/aa.1957.59.5.02a00090

[36] Oswalt, Robert. 1971. Towards the construction of a standard lexicostatistic list. Anthropological Linguistics 13(9), pp. 421–434.

[37] Pereltsvaig, Asya – Lewis, Martin W. 2015. The Indo-European Controversy: Facts and Fallacies in Historical Linguistics. Cambridge: Cambridge University Press.

[38] Rea, John. 1958. Concerning the validity of lexicostatistics. International Journal of American Linguistics 24(2), pp. 145–150.

[39] Serva, Maurizio – Petroni, Filippo. 2008. Indo-European languages tree by Levenshtein distance EPL (Europhysics Letters) 81(6), 68005. URL: https://doi.org/10.1209/0295-5075/81/68005 | DOI 10.1209/0295-5075/81/68005

[40] Starostin, George. 2013. Lexicostatistics as a basis for language classification: Increasing the pros, reducing the cons In: Fangerau, Heiner – Geisler, Hans – Halling, Thorsten – Martin, William, eds. Classification and Evolution in Biology, Linguistics and the History of Science: Concepts – Methods – Visualization. Stuttgart: Franz Steiner, pp. 125–146. | DOI 10.25162/sudhoff-2014-0023

[41] Starostin, Sergei. 1995. Altajskaja problema i proiskhozhdenie japonskogo jazyka Moscow: Nauka: Glavnaja redakcija vostochnoj literatury. | DOI 10.1075/dia.11.2.11kri

[42] Swadesh, Morris. 1952. Lexico-statistic dating of prehistoric ethnic contacts: With special reference to North American Indians and Eskimos. Proceedings of the American philosophical society 96(4), pp. 452–463.

[43] Swadesh, Morris. 1954. Perspectives and problems of Amerindian comparative linguistics. Word 10(2–3), pp. 306–332.

[44] Swadesh, Morris. 1955. Towards greater accuracy in lexicostatistic dating, International Journal of American Linguistics 21, pp. 121–137. | DOI 10.1086/464321

[45] Swadesh, Morris. 1972. What is glottochronology? In: Swadesh, Morris, The Origin and Diversification of Language. London: Routledge & Kegan Paul, pp. 271–284.

[46] Tadmor, Uri – Haspelmath, Martin – Taylor, Bradley. 2010. Borrowability and the notion of basic vocabulary. Diachronica 27(2), pp. 226–246.

[47] Wang, Yude [Yu Te]. 1960. The lexicostatistic estimation of the time depths of the five main Chinese dialects, Gengo Kenkyu: Journal of the Linguistic Society of Japan 38, pp. 33–105.

[48] Wells, John C. 1994. Computer-coding the IPA: A proposed extension of SAMPA, Speech, Hearing and Language, Work in Progress 8, pp. 271–289.

[49] Wichmann, Søren. (2020). How to distinguish languages and dialects. Computational Linguistics 45(4). URL: https://doi.org/10.1162/coli_a_00366

[50] Wurm, Stephen Adolphe – Laycock, Donald C. 1961. The question of language and dialect in New Guinea. Oceania 32(2), pp. 128–143.

[51] Xu, Tonquiang. (1991). Lishi Yuyanxue. Beijing: Shangwu Yinshuguan.

[52] Zhuravlev, Anatolij. 1994. Leksikostatisticheskoe modelirovanie sistemy slavjanskogo jazykovogo rodstva. Moscow: Indrik.