Event date:  January 22, 2018 9:00 AM  to January 22, 2018 6:00 PM

Co-located with the Treebanks and Linguistic Theory (TLT) conference 2018 in Prague
Charles University, Prague, Czech Republic
January 23-24, 2018

------------------------------------------------------------------------------------------------------------------------------------------------------------------

Call for posters!

Deadline for submission: December 22 nd, 2017
Notification of acceptance: December 31 st, 2017

Further information about the call for posters on https://typo.uni-konstanz.de/dataprovenance/index.php/call-for-posters/

------------------------------------------------------------------------------------------------------------------------------------------------------------------

 

Invited Speakers

Adriane Boyd, Universität Tübingen
Peter Buneman, University of Edinburgh
Nicoletta Calzolari, Italian National Research Council
Sarah Cohen Boulakia, Université Paris Sud

 

The workshop seeks to bring together researchers from the fields of provenance, data annotation, and data curation with researchers working within computational linguistics and dealing with the annotation of language data. Provenance is concerned with understanding how to model, record, and share metadata about the origin of data and the further sharing or processing that data has undergone. While provenance has been studied in various domains (e.g., for business applications or in the life sciences), many of the central issues are also of vital interest for computational linguistics.

For example, issues of "data cleaning“ and data curation both have serious repercussions for the reproducibility of analyses or experiments. In general, computational linguistic work with data tends to involve several pre-processing steps (stop-lists, data normalization, filtering out of information that is considered to be not at-issue or error correction). However, these steps are seldom documented or described in detail. Data sets may also undergo several rounds of pre-processing, with information about the successive changes again not well documented. Data may also be automatically or semi-automatically generated. In computational linguistics this often takes the form of automatic or semi-automatic data annotation. This, as well as manual annotation, is prone to errors and inter-annotator disagreement, leading to rounds of adjucation or correction. This work with data is also generally not documented (in detail) so that annotation decisions may be hard to „undo“. Finally, once a data set is released, newer versions will inevitably also have to be released to deal with data expansion or correction. In this case, proper versioning and data curation is vital to ensure experimental and analytical reproducability.

While computational linguists deal with these issues on a daily basis, there is little awareness of established methodology and best practices coming from the field of data provenance. The aim of this workshop is to begin a dialog. On the one hand, we aim to create awareness of the needs and challenges posed by linguistic data in the data provenance community. On the other hand, we aim to import an understanding of the experiences and best practices established with respect to data provenance into the computational linguistics community.

 

Organising committee

Miriam Butt, University of Konstanz
Melanie Herschel, University of Stuttgart
Christin Schätzle, University of Konstanz

 

For more information about the workshop, follow this link: https://typo.uni-konstanz.de/dataprovenance/