The present book discusses the possible applications, as well as limitations, of corpus and quantitative methods in the field of historical linguistics, with special attention paid to the history of the Polish language. The authors examine several language changes, well-known to the previous generations of scholars, in order to obtain a more detailed picture of language evolution, but also aim at detecting language phenomena whose existence was not reported in literature using machine-learning methods of classification. In other words, the book combines both the corpus-based and the corpus-driven approaches.

Chapter 1 provides a general overview of the advances of corpus and quantitative linguistics in examining the history of a language. Understandably, the vast majority of the studies conducted focuses on the history of English, which is the most well-resourced language both in terms of the available corpora and when it comes to natural language processing tools.

Chapter 2 discusses the research material used in the present study, namely an annotated diachronic corpus of the Polish language, covering the timespan 1380–1850, compiled of existing diachronic corpora and supplemented by several texts scraped from the internet or OCR-ed from printed sources, or manually transcribed from early modern prints. Different text normalization and modernization strategies applied to particular texts, in order to research the corpus in a consistent way over the centuries, are also discussed in this chapter. In its final form, the diachronic corpus consists of over 12 million words. Even if inherently opportunistic, the corpus tries to meet the requirement of representativeness, and to provide an even coverage of texts over the centuries, so as to reduce the timespans not represented by any text.

Chapter 3 is devoted to modeling a few of the language changes in Polish which took place in the 15th–18th centuries. These include isolated changes such as więtszy > większy, abo > albo, barzo > bardzo, wszytek > wszystek, morphological changes of the superlative marker (na- > naj-), verbal inflection (-bych > -bym, -bychmy > -byśmy, and -łech > -łem), and a phonological change (-ir- > -ier-). Since the process of replacement of an older (or recessive) form by a new (or innovative) one is prolonged in time and gradual, yet never linear, it can be best modelled by logistic regression. It is a mathematical model, designed to capture a dynamic change between two states or two phases of a given phenomenon, and it seems to describe language changes quite well. In fact, most of the changes we have chosen for our study could be modelled with reasonable accuracy. A special case is abo > albo, which turned out to be a reversed (or not fully accomplished) change. It can be modelled either with polynomial logistic regression (showing relatively high accuracy) or, to a lesser degree, by a combination of two independent logistic models.

When modeling any diachronic process, the researcher has to divide the corpus into chronologically ordered subcorpora of an arbitrarily-set size: such a decision affects (or might affect) the final results. Obviously, bigger yet fewer subcorpora (e.g., 10 units covering 50 years each) provide smoother results, while smaller yet denser subcorpora (e.g. 50 units per 10 years) lead to fine-grained outcomes. To examine the degree of a model’s stability despite the changing size of the subcorpora, the fit of the logistic regression and the input parameters was systematically tested. This has proven that 20 years is a minimal size of a subcorpus yielding credible data. Still, the size of the subcorpus affected the goodness of fit to a very limited extent in the case of those changes, where the observed and expected values were close to each other. And reversely, where the values observed were far removed from the expected ones, the goodness of fit increased with the timespan of the subcorpus.

Chapter 4 discusses different methods of automatic text classification, including multidimensional scaling, bootstrap consensus networks, supervised binary classifiers, and so forth. In the first place, however, it introduces a new method of finding turning points in the history of a language. The underlying assumption is that, although the language evolves continuously, there are certain moments when this evolution accelerates. Here, a corpus is a collection of chronologically-ordered texts. If there is a turning point in the evolution of language, then the texts written before this date should be more similar to each other rather than to those created after the assumed turning point and vice versa. The corpus is divided into two subcorpora, one preceding, the other following the hypothetical date of maximal change; for the sake of convenience we call them ante and post, respectively. An unsupervised classification is then conducted in order to attribute each text either to the ante or post class. The entire procedure is repeated several times, with each iteration shifting the date dividing the corpus into two subcorpora by a fixed number of years. The highest accuracy over the iterations indicates that the corpus is divided in two most heterogenous subcorpora, or – in other words – that the texts written before and after that date show the biggest difference. This, in turn, suggests that such a date indicates a major change in the course of the language’s evolution.

Another approach to detecting major turning points is a variant of the hierarchical cluster analysis method. In its classical flavor, it sequently merges most similar subcorpora, finally grouping them into two clusters. In the variant used here, it is exclusively the neighboring corpora that are allowed to merge. Consequently, the two top clusters cover subcorpora preceding and following a certain date. The two most diverse adjacent subcorpora are the one directly before and the one directly after that date, which can also be seen as the moment of maximal change in the phenomenon observed. In this study, prepositions were used as a discriminator. The most significant change in the relative frequency of prepositions occurred in the 16th century, which is commonly assumed as the beginning of the Middle Polish period.

Chapter 5 is a case study on the history of a selected grammatical category, namely the adverbial past participle. Since the frequency of this category underwent a substantial change, the question arises what was the underlying factor responsible for the change. Two answers were considered; either it was mere linguistic fashion which made the speakers overuse the form in the 17th century, or alternatively, the form became more productive. In the first case a similar set of verbs would be used with higher frequency, in the second there would be a much higher number of types, which would indicate that the number of word-types has increased significantly, presumably because the semantic restrictions imposed on coining these participles became somewhat looser. Productivity was estimated only with quantitative criteria. The data shows that there are some changes in productivity, however they are less significant than the changes in frequency. This, in turn, supports both claims: the form did become somewhat fashionable, but this consequently led to higher productivity. The drop of frequency was followed by a much more restricted drop of productivity.

Górski R.L., Król, M. & Eder, M. (2019). Zmiana w języku. Studia kwantytatywno-korpusowe. Kraków: IJP PAN.

A pdf version to download from here. Electronic versions are here (epub) and here (kindle). Original images (figures) can be found here, and the datasets used in the books are put here.