C. Beck and M. Köllner, “GHisBERT -- Training BERT from scratch for lexical semantic investigations across historical German language stages,” in
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change, N. Tahmasebi, S. Montariol, H. Dubossarsky, A. Kutuzov, S. Hengchen, D. Alfter, F. Periti, and P. Cassotti, Eds., in Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 33--45. [Online]. Available:
https://aclanthology.org/2023.lchange-1.4Abstract
While static embeddings have dominated computational approaches to lexical semantic change for quite some time, recent approaches try to leverage the contextualized embeddings generated by the language model BERT for identifying semantic shifts in historical texts. However, despite their usability for detecting changes in the more recent past, it remains unclear how well language models scale to investigations going back further in time, where the language differs substantially from the training data underlying the models. In this paper, we present GHisBERT, a BERT-based language model trained from scratch on historical data covering all attested stages of German (going back to Old High German, c. 750 CE). Given a lack of ground truth data for investigating lexical semantic change across historical German language stages, we evaluate our model via a lexical similarity analysis of ten stable concepts. We show that, in comparison with an unmodified and a fine-tuned German BERT-base model, our model performs best in terms of assessing inter-concept similarity as well as intra-concept similarity over time. This in turn argues for the necessity of pre-training historical language models from scratch when working with historical linguistic data.BibTeX
M. Butt, L. Carnesale, and T. Ahmed, “Experiencers vs. agents in Urdu/Hindi nominalized verbs of perception,” in
Proceedings of the Lexical Functional Grammar Conference, in Proceedings of the Lexical Functional Grammar Conference, vol. 28. 2023, pp. 90--113. [Online]. Available:
https://lfg-proceedings.org/lfg/index.php/main/article/view/46Abstract
Urdu/Hindi displays a curious construction in which a nominalized verb of perception combines with the verb ‘give’. As an experiencer predicate, it takes a dative subject; however, there is no other instance in the language in which the subject of ‘give’ is a dative. Furthermore, the verb ‘give’ is a three-place predicate, but the N-V experiencer predicate is only two-place. We propose an analysis by which the construction originates in a ditransitive agentive N-V complex predicate whose goal argument is reanalyzed into an experiencer. We propose that the mechansim is similar to that posited by Schätzle (2018) for the rise of dative subjects in Icelandic, where an originally locative predication gave rise to experiencer predicates.BibTeX
D. Hägele
et al., “Uncertainty Visualization: Fundamentals and Recent Developments,”
it - Information Technology, vol. 64, no. 4–5, Art. no. 4–5, 2022, doi:
10.1515/itit-2022-0033.
Abstract
This paper provides a brief overview of uncertainty visualization along with some fundamental considerations on uncertainty propagation and modeling. Starting from the visualization pipeline, we discuss how the different stages along this pipeline can be affected by uncertainty and how they can deal with this and propagate uncertainty information to subsequent processing steps. We illustrate recent advances in the field with a number of examples from a wide range of applications: uncertainty visualization of hierarchical data, multivariate time series, stochastic partial differential equations, and data from linguistic annotation.BibTeX
H. Booth and C. Beck, “Verb-second and Verb-first in the History of Icelandic,”
Journal of Historical Syntax, vol. 5, no. 27, Art. no. 27, 2021, doi:
10.18148/hs/2021.v5i28.112.
Abstract
The occurrence of V1 declaratives in Icelandic has attracted much attention in the generative literature (e.g. Sigurðsson 1990, Franco 2008), and such structures are known to be more frequent in earlier stages compared to the modern language. In this paper, we provide an account for the diachrony of V1 and V2 in Icelandic where the decreasing frequency of V1 is argued to be related to an ongoing change concerning the preferred structural position for subject topics. Our claims are supported by corpus evidence from IcePaHC (Wallenberg, Ingason, Sigurðsson & Rögnvaldsson 2011) and the formal analysis is conducted within Lexical Functional Grammar, which allows us to neatly capture the changing associations between clause structure and information structure. As we show, this overall change can also be linked to wider diachronic developments in Icelandic involving Stylistic Fronting and expletives.BibTeX
R. Sevastjanova, A.-L. Kalouli, C. Beck, H. Schäfer, and M. El-Assady, “Explaining Contextualization in Language Models using Visual Analytics,” in
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, Aug. 2021, pp. 464--476. doi:
10.18653/v1/2021.acl-long.39.
Abstract
Despite the success of contextualized language models on various NLP tasks, it is still unclear what these models really learn. In this paper, we contribute to the current efforts of explaining such models by exploring the continuum between function and content words with respect to contextualization in BERT, based on linguistically-informed insights. In particular, we utilize scoring and visual analytics techniques: we use an existing similarity-based score to measure contextualization and integrate it into a novel visual analytics technique, presenting the model's layers simultaneously and highlighting intra-layer properties and inter-layer differences. We show that contextualization is neither driven by polysemy nor by pure context variation. We also provide insights on why BERT fails to model words in the middle of the functionality continuum.BibTeX
C. Beck, “DiaSense at SemEval-2020 Task 1: Modeling Sense Change via Pre-trained BERT Embeddings,” in
Proceedings of the Fourteenth Workshop on Semantic Evaluation, in Proceedings of the Fourteenth Workshop on Semantic Evaluation. Barcelona (online): International Committee for Computational Linguistics, Dec. 2020, pp. 50--58. [Online]. Available:
https://www.aclweb.org/anthology/2020.semeval-1.4Abstract
This paper describes DiaSense, a system developed for Task 1 `Unsupervised Lexical Semantic Change Detection' of SemEval 2020. In DiaSense, contextualized word embeddings are used to model word sense changes. This allows for the calculation of metrics which mimic human intuitions about the semantic relatedness between individual use pairs of a target word for the assessment of lexical semantic change. DiaSense is able to detect lexical semantic change in English, German, Latin and Swedish (accuracy = 0.728). Moreover, DiaSense differentiates between weak and strong change.BibTeX
C. Schätzle and M. Butt, “Visual Analytics for Historical Linguistics: Opportunities and Challenges,”
Journal of Data Mining and Digital Humanities, 2020, doi:
10.46298/jdmdh.6707.
Abstract
In this paper we present a case study in which Visual Analytic methods for interactive data exploration are applied to the study of historical linguistics. We discuss why diachronic linguistic data poses special challenges for Visual Analytics and show how these are handled in a collaboratively developed web-based tool: HistoBankVis. HistoBankVis allows an immediate and efficient interaction with underlying diachronic data and we go through an investigation of the interplay between case marking and word order in Icelandic and Old Saxon to illustrate its features. We then discuss challenges posed by the lack of annotation standardization across different corpora as well as the problems we encountered with respect to errors, uncertainty and issues of data provenance. Overall we conclude that the integration of Visual Analytics methodology into the study of language change has an immense potential but that the full realization of its potential will depend on whether issues of data interoperability and annotation standards can be resolved.BibTeX
C. Beck, H. Booth, M. El-Assady, and M. Butt, “Representation Problems in Linguistic Annotations: Ambiguity, Variation, Uncertainty, Error and Bias,” in
Proceedings of the 14th Linguistic Annotation Workshop, in Proceedings of the 14th Linguistic Annotation Workshop. Barcelona, Spain: Association for Computational Linguistics, Dec. 2020, pp. 60--73. [Online]. Available:
https://www.aclweb.org/anthology/2020.law-1.6Abstract
The development of linguistic corpora is fraught with various problems of annotation and representation. These constitute a very real challenge for the development and use of annotated corpora, but as yet not much literature exists on how to address the underlying problems. In this paper, we identify and discuss five sources of representation problems, which are independent though interrelated: ambiguity, variation, uncertainty, error and bias. We outline and characterize these sources, discussing how their improper treatment can have stark consequences for research outcomes. Finally, we discuss how an adequate treatment can inform corpus-related linguistic research, both computational and theoretical, improving the reliability of research results and NLP models, as well as informing the more general reproducibility issue.BibTeX
C. Schätzle, F. L. Dennig, M. Blumenschein, D. A. Keim, and M. Butt, “Visualizing Linguistic Change as Dimension Interactions,” in
Proceedings of the International Workshop on Computational Approaches to Historical Language Change, in Proceedings of the International Workshop on Computational Approaches to Historical Language Change. 2019, pp. 272–278. doi:
10.18653/v1/W19-4734.
Abstract
Historical change typically is the result of complex interactions between several linguistic factors. Identifying the relevant factors and understanding how they interact across the temporal dimension is the core remit of historical linguistics. With respect to corpus work, this entails a separate annotation, extraction and painstaking pair-wise comparison of the relevant bits of information. This paper presents a significant extension of HistoBankVis, a multilayer visualization system which allows a fast and interactive exploration of complex linguistic data. Linguistic factors can be understood as data dimensions which show complex interrelationships. We model these relationships with the Parallel Sets technique. We demonstrate the powerful potential of this technique by applying the system to understanding the interaction of case, grammatical relations and word order in the history of Icelandic.BibTeX
Abstract
In this paper, we present a revised LFG account for Icelandic clause structure, factoring in new historical data from IcePaHC (Wallenberg et al., 2011).This builds on previous work by Sells (2001, 2005) and Booth et al. (2017), focusing more closely on the syntactic encoding of information structure.Based on findings from a series of corpus-based investigations, we argue thatthe functional category I was already obligatory in Old Icelandic, accounting for both V1 and V2 orders and the absence of V3/V-later orders. Moreover,we show that the basic c-structure skeleton persists throughout the diachrony; what changes is the way in which information structure is syntactically encoded, i.e. the association between c- and i-structure. Topics increasingly target SpecIP, which allows the finite verb in I to serve as a boundary between topic and comment. This goes hand in hand with certain discourse adverbs losing their function as a discourse partitioner in the midfield and ties in with other changes shown for Icelandic (Booth et al., 2017).BibTeX
C. Schätzle and H. Booth, “DiaHClust: an Iterative Hierarchical Clustering Approach for Identifying Stages in Language Change,” in
Proceedings of the International Workshop on Computational Approaches to Historical Language Change, in Proceedings of the International Workshop on Computational Approaches to Historical Language Change. Association for Computational Linguistics, 2019, pp. 126–135. doi:
10.18653/v1/W19-4716.
Abstract
Language change is often assessed against a set of pre-determined time periods in order to be able to trace its diachronic trajectory. This is problematic, since a pre-determined periodization might obscure significant developments and lead to false assumptions about the data. Moreover, these time periods can be based on factors which are either arbitrary or non-linguistic, e.g., dividing the corpus data into equidistant stages or taking into account language-external events. Addressing this problem, in this paper we present a data-driven approach to periodization: `DiaHClust'. DiaHClust is based on iterative hierarchical clustering and offers a multi-layered perspective on change from text-level to broader time periods. We demonstrate the usefulness of DiaHClust via a case study investigating syntactic change in Icelandic, modelling the syntactic system of the language in terms of vectors of syntactic change.BibTeX
Abstract
The Icelandic case system presents an interesting linguistic puzzle. Languages tend to use either word order, case and/or agreement to signal grammatical relations (Kiparsky 1987, 1988, 1997). Icelandic is atypical in this respect as it has a rather rigid word order, but also retained a rich morphological case system over the centuries. Moreover, non-nominative subjects exist in the language, with in particular the synchronic existence of dative subjects being well-established (Andrews 1976, Zaenen et al. 1985). From a diachronic perspective, dative subjects have also attracted a good deal of research, specifically with respect to the question about whether dative subjects are a common Proto-Indo-European feature or whether they are a more recent historical innovation (see, e.g., Haspelmath 2001, Barðdal and Eythórsson 2009, Barðdal et al. 2012). In this thesis, I investigate factors conditioning the diachronic occurrence of dative subjects in the Icelandic Parsed Historical Corpus (IcePaHC, Wallenberg et al. 2011) to provide a window of understanding of the complex system licensing grammatical relations in the language, contributing to the discussion which evolved around the historical origin of dative subjects. As method of investigation, I utilize novel visualization techniques coming from the field of Visual Analytics (Keim et al. 2008). The investigations presented in this thesis show that dative subjects are part of a complex interlinked system in which case, word order, grammatical relations, lexical semantics and event structure interact in the mapping of arguments to grammatical relations. For one, I provide my findings with respect to the interaction between dative subjects, thematic roles, event structure and voice in IcePaHC, showing that the distribution of dative subjects has been changing in the history of Icelandic, in particular with respect to an increasingly systematic association between dative subjects and experiencer semantics. This correlates with an increasing use of verbs carrying middle morphology, which have been lexicalized as stative experiencer predicates with a dative subject over time. I furthermore present an investigation of the interaction between subject case and word order which examines the interrelation between dative case, subject positions, and verb placement in IcePaHC. This investigation provides evidence for the diachronic development of structure and the rise of positional licensing in the language (in line with Kiparsky 1997); developments in which dative subjects consistently lag behind. For the theoretical analysis of the historical developments observed in IcePaHC, I present a novel linking theory couched in the Lexical-Functional Grammar (LFG) framework in this thesis. My linking theory builds on the enhancements of LFG’s Lexical Mapping Theory by Zaenen (1993) and Kibort (2014) with respect to lexical semantic entailments and argument positions, separating out lexical semantics from structural positions. As core component of the linking system, I implement a reference frame in the form of Talmy’s (1978) figure-ground division, which functions as mediator between word order, lexical semantics, and event structure. Grammatical relations are linked to arguments via a set of lexical semantic entailments which follow from the event structure, the reference frame, and the sentience of arguments, associating grammatical relations with particular structural positions. Event structure is encoded in the linking system via the event participants assumed in Ramchand’s (2008) event-decompositional framework of the first-phase syntax and is taken to license case marking in Icelandic as has been suggested by Svenonius (2002). Overall, the linking analysis of the diachronic corpus data shows that the licensing conditions for case and grammatical relations have been changing over time, which questions the inheritance of a stable and monolithic dative subject construction from earlier language stages.BibTeX
A. Hautli-Janisz, C. Rohrdantz, C. Schätzle, A. Stoffel, M. Butt, and D. A. Keim, “Visual Analytics in Diachronic Linguistic Investigations,” Linguistic Visualizations, 2018.
BibTeX
C. Schätzle, “Genitiv als Stilmittel in der Novelle,”
Scalable Reading. Zeitschrift für Literaturwissenschaft und Linguistik (LiLi), vol. 47, pp. 125–140, 2017, doi:
10.1007/s41244-017-0043-9.
Abstract
In this paper, I present several corpus linguistic studies that show the continuity of the diachronic loss of the German genitive within novellas from the past two centuries. Moreover, I found that not all genitive constructions are diachronically receding and that e.g. the adnominal genitive is particularly stable along the analyzed time frame. Furthermore, some authors in Paul Heyse’s Deutscher Novellenschatz use genitives as stylistic device in order to relate their novellas to a specific register or an exalted stylistic level.BibTeX
C. Schätzle, M. Hund, F. L. Dennig, M. Butt, and D. A. Keim, “HistoBankVis: Detecting Language Change via Data Visualization,” in
Proceedings of the NoDaLiDa 2017 Workshop Processing Historical Language, G. Bouma and Y. Adesam, Eds., in Proceedings of the NoDaLiDa 2017 Workshop Processing Historical Language. Linköping University Electronic Press, 2017, pp. 32–39. [Online]. Available:
https://www.aclweb.org/anthology/W17-0507Abstract
We present HistoBankVis, a novel visu-alization system designed for the inter-active analysis of complex, multidimen-sional data to facilitate historical linguisticwork. In this paper, we illustrate the vi-sualization’s efficacy and power by meansof a concrete case study investigating thediachronic interaction of word order andsubject case in Icelandic.BibTeX
Abstract
We present the results of research on two areas of Icelandic historical syntax: dative subjects and V1 word order. These strands of syntax had previously been examined independently, but were found to be intimately connected as part of a broader collaboration between theoretical and computational linguistics involving the Icelandic Parsed Historical Corpus (IcePaHC). The interaction we found between V1 declaratives and dative subjects provides evidence for: a) changes over time with respect to the association of dative arguments with the subject role (contra Barðdal and Eythórsson 2009); b) the gradual development of left peripheral structure and the rise of positional licensing (in line with Kiparsky 1995, 1997). We provide an analysis of positional licensing in LFG terms and account for the newly observed complex interaction between datives, subjects and word order presented in this paper.BibTeX
C. Schätzle and D. Sacha, “Visualizing Language Change: Dative Subjects in Icelandic,” in
Proceedings of the LREC 2016 Workshop VisLRII: Visualization as Added Value in the Development, Use and Evaluation of Language Resources, in Proceedings of the LREC 2016 Workshop VisLRII: Visualization as Added Value in the Development, Use and Evaluation of Language Resources. 2016, pp. 8–15. [Online]. Available:
http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-VisLR%20II_Proceedings.pdfAbstract
This paper presents a visualization tool for the analysis of diachronic multidimensional language data. Our tool was developed withrespect to a corpus study of dative subjects in Icelandic based on the Icelandic Parsed Historical Corpus (Wallenberg et al., 2011) whichinvestigates determining factors for the appearance of dative subjects in the history of Icelandic. The visualization provides an interactiveaccess to the underlying multidimensional data and significantly facilitates the analysis of the complex diachronic interactions of factorsat hand. We were able to identify various interactions of conditioning factors for dative subjects in Icelandic via the visualization tooland showed that dative subjects are increasingly associated with experiencer arguments in Icelandic across time. We also found that therise of dative subjects with experiencer arguments is correlated with an increasing use of middle voice. This lexical semantic changeargues against dative subjects as a Proto Indo-European inheritance. Moreover, the visualization helped us to draw conclusions aboutuncertainties and problems of our lexical semantic data annotation which will be revised for future work.BibTeX
C. Schulz
et al., “Generative Data Models for Validation and Evaluation of Visualization Techniques,” in
Proceedings of the Workshop on Beyond Time and Errors: Novel Evaluation Methods for Visualization (BELIV), in Proceedings of the Workshop on Beyond Time and Errors: Novel Evaluation Methods for Visualization (BELIV). ACM, 2016, pp. 112–124. doi:
10.1145/2993901.2993907.
Abstract
We argue that there is a need for substantially more research on the use of generative data models in the validation and evaluation of visualization techniques. For example, user studies will require the display of representative and uncon-founded visual stimuli, while algorithms will need functional coverage and assessable benchmarks. However, data is often collected in a semi-automatic fashion or entirely hand-picked, which obscures the view of generality, impairs availability, and potentially violates privacy. There are some sub-domains of visualization that use synthetic data in the sense of generative data models, whereas others work with real-world-based data sets and simulations. Depending on the visualization domain, many generative data models are "side projects" as part of an ad-hoc validation of a techniques paper and thus neither reusable nor general-purpose. We review existing work on popular data collections and generative data models in visualization to discuss the opportunities and consequences for technique validation, evaluation, and experiment design. We distill handling and future directions, and discuss how we can engineer generative data models and how visualization research could benefit from more and better use of generative data models.BibTeX