Preface:

Since their emergence a few years ago, the FAIR criteria have been the subject of intense debate in the scientific environment because the terms summarised in this acronym, although intuitively plausible, are not clearly distinguishable as they are functionally intertwined. Even within VerbaAlpina, the in-depth discussion of FAIR has sometimes led to a situation where not all texts written by individual VerbaAlpina members on this topic have turned out to be completely congruent in terms of content. Regardless of these individually divergent interpretations, VerbaAlpina defines the criteria formulated by FORCE11 (https://www.force11.org/group/fairgroup/fairprinciples) as the benchmark for FAIR compliance. From a methodological point of view, the adherence to these criteria is well thought out. With regard to practical implementation, there is currently still a limitation in that the accumulation of generic metadata (according to the metadata schema of DataCite) has not yet been carried out. However, this will be done in the course of transferring the VA data to the LMU's Open Data repository, the operationalisation of which is currently being worked on.
--
In 2016, a large number of scientists from a whole range of countries published an article in the scientific journal Nature to formulate guidelines for handling research data (Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016). đź”—). Ultimately, the ideas presented in this publication stem from a workshop held in January 2014 at the Lorentz Center at Leiden University in the Netherlands. The title of the workshop was: Jointly designing a data FAIRPORT

Meanwhile, these ideas, which are summed up in the acronym FAIR, have been established as a point of reference in the current debate on the proper handling of research data (this became clear at the Network meeting of the GeRDI project in October 2018; cf. also the FAIRGROUP of the FORCE11 community).

The acronym FAIR stands for the following central, partly mutually dependent, postulates that should guide the handling of research data (đź”—):


These keywords implicitly entail a whole series of consequences for the handling of digital research data.

To ensure that data is retrievable, there should be at least one central portal through which search queries can be initiated. It would make sense to incorporate the research data – essentially their content and their location – into the long-established library catalogues. All concepts that would require a search operation in different places should be avoided.

In order to be found, data must of course exist physically. This is not so much a question of technical realisation, which can be achieved, for example, by the computer centres that operate throughout the country, but rather a question of institutional responsibility. From this point of view, too, the libraries are the obvious candidates for this task because of their history, their genuine role as preservers of knowledge and their long-term perspective. They should take responsibility for the sustainable preservation of digital data. In what form this is ultimately achieved, whether the libraries set up and manage their own repositories or rely on data centres as service providers, is of secondary importance and can be handled individually in each case.

The conception and allocation of metadata, which make the actual research data findable, is of great importance. The use of at least one binding, hierarchically structured metadata schema appears to be indispensable. With the integration of likewise binding controlled vocabularies, it permits a categorisation of the stored research data. For the time being, VerbaAlpina has opted for the widely used Datacite Schema. The use of several competing metadata schemas would be a possibility, but would only make sense if they were created consistently for all the research data collected. Subordinate subject-specific metadata schemas can be a useful addition to the higher-level metadata schemas.

"Accessible" refers primarily to the accessibility of data that is not restricted by legal barriers such as copyright. Those who collect or produce data have the least influence on this point. In addition to copyright, the protection of personal rights often has to be taken into account when collecting data. Accordingly, the demand for accessibility is primarily aimed at ensuring that all data that is not subject to legal restrictions is not imposed with legal access limitations by the producers of this data themselves. In concrete terms, this mainly means renouncing copyright and applying a licensing model that conforms to the conditions of Open Access. The use of Creative Commons (CC) licences is widespread in the academic environment, although not all of them meet the criteria for Open Access. The prohibition of commercial use, in particular, which can be part of a CC licence, violates the concept of Open Access. The reason for this is that almost any use of data can be considered "commercial use" under certain circumstances and a clear demarcation in this respect is virtually impossible from a legal point of view (see also the methodology article "Licensing").

Just like the retrievability of data, interoperability, too, has two sides, namely a technical and a theoretical-organisational one. In many cases, in order to effectively link data sets and allow them to refer to each other, the logical and fine granulation of the data, which is also oriented towards (mostly subject-specific) rules, is required. A central role in this context is played by so-called Authority Files, which are defined and, ideally, standardised concept categories whose individual instances (digital objects) are "distinct", i.e., singular, with respect to a clearly defined type and number of properties. The assignment of numeric or alphanumeric identifiers ("ID"s) to the individual objects of a concept category allows the unambiguous referencing of objects. The granulation of data sets along the boundaries of categories and their individual instances/objects in conjunction with the use of the specific identifiers then allows the linking of separate data sets with congruent content. However, real added value only arises when it is also technically possible to reference individual objects directly and thus to move from one data stock to an object in another data stock with just one click. This only seems possible if each individual data object ("granum") is actually assigned its own URL. Finally, for the sake of sustainability, each individual URL must also be assigned a DOI.

Ultimately, the reusability of datasets results from careful attention to and implementation of the three preceding postulates. Among others, VerbaAlpina's technology has been replicated by the APPI project of the University of Lille. A corresponding documentation can be found under the following link: https://github.com/anr-appi/verba-picardia-doc/wiki/Documentation-du-syst%C3%A8me-Verba.
The collaboration between VerbaAlpina and the T-Migrants project is another example of the implementation of the FAIR principle of reusability. The WebGL map technology, which was originally developed for VerbaAlpina and allows the visualisation of large amounts of data, was successfully exported to the T-Migrants project.
However, in order to adopt the WebGL map technology, it had to be partially adapted to the specific requirements of the T-Migrants project. These adaptations required some changes in the implementation, such as supporting persistent animation, to best meet the specific needs and goals of T-Migrants. By considering the reusability of WebGL map technology, the T-Migrants project was able to benefit from existing technology without the need to spend its own resources on developing similar technology. This saves time and resources and promotes research efficiency, even though some adaptations were necessary for the specific requirements of the new project. The successful application of reusability in this case demonstrates the importance of making scientific data and technologies available in a documented, structured and licensed form that allows other projects to use and develop them in new contexts. This topic was also presented and discussed at the workshop "Challenges of Linguistic Data Visualisation/LDDB 2022".

VerbaAlpina strives to align all data-related procedures and regulations with the FAIR principles. Thomas Krefeld essentially regards this as the basis of a DH research ethic (Thomas Krefeld [2018]: Linguistic theories in the context of digital humanities. Corpus in Text. Version 2 (05.11.2018, 11:35). Paragraph 4. url: http://www.kit.gwi.uni-muenchen.de/?p=28010&v=2#p:4). The findability of the data is aided by the cooperation with the UB of the LMU as well as the DFG project GeRDI, which is currently being carried out as part of the project e-humanities – interdisziplinär. Above all, the central database in the module VA_DB will be provided with metadata version by version and transferred in several forms to the UB of the LMU, where it will be stored in the Open Data Repository. At least the metadata will then also be incorporated into the index, which is currently being built as part of the GeRDI project. The aim is the centralised retrievability of the data collected and processed by VerbaAlpina via the library catalogue of the UB and also via the search portal of the GeRDI project, which is still under development. All data managed by VerbaAlpina will, as far as possible, be placed under an Open Access-compliant Creative Commons licence (up to version 18/1 CC BY SA 3.0 de, from 18/2 CC BY SA 4.0). Interoperability is achieved, in part, through fine granulation of the dataset. This is also guided by the concept of norm data by linking already existing norm data with VerbaAlpina's data material. One possible way of doing this is with geographical data, such as the political municipalities, which represent VerbaAlpina's central geographical reference system. For the data categories "morpho-lexical type" and "Concept", which are central to VerbaAlpina, there are, at least in part, no Authority Files as of yet to which the VerbaAlpina data could be referred. In these cases, VerbaAlpina is endeavouring to set up appropriate Authority Files or authority data categories in cooperation with designated institutions such as the German National Library (DNB). To serve the technical requirements for efficient interoperability, the central lexical data material is stored record by record in a large number of small files, which can be accessed via individual DOIs on Open Data LMU. Each individual file is also accompanied by a metadata file in the Datacite format. The entirety of the metadata ultimately enables the targeted retrieval of individual files via the library catalogue.

As part of the "eHumanities – interdisziplinär" project funded by the Bavarian Ministry of Science, the core data collected by VerbaAlpina (individual records, morpholexical types, concepts, geo-references) are being mapped according to the so-called CIDOC CRM schema. CIDOC CRM is an (informatic) ontology that has been developed since at least the beginning of the 1990s and whose roots lie in the museum sector. The origins of the Conceptual Reference Model (CRM) can be traced back to a working group of the Comité International pour la Documentation (CIDOC), which in turn is a branch of the International Council of Museums (ICOM). The intention behind this was to make data retrievable independently of the variable category designations. Thus, instead of "author", the terms "Autor", "writer" or "auteur" may also be used to identify the category of author. CIDOC CRM provides the abbreviation E39 for this designation, so that the corresponding information can be found completely independently of individual labels. The ICOM/CRM Special Interest Group is continuously developing CRM. The latest version of the standard (it is even an ISO standard: ISO 21127:2014, which further advocates its use) can be downloaded from the following page: http://cidoc-crm.org/versions-of-the-cidoc-crm. Currently (June 2020), the standard comprises a total of 99 entities which are supplemented by a total of 197 "Properties". The latter are mainly used to describe the relationships between different entities of the model (examples: P1: "is identified by", P15: "was influenced by" etc.). The following preliminary diagram, generated by Julian Schulz, shows an attempt to assign the VerbaAlpina entities to the CRM categories (E and P numbers) (PDF version: https://www.verba-alpina.gwi.uni-muenchen.de/wp-content/uploads/cidoc-verbaalpina_v2.pdf):


CIDOC-CRM-Schema der VerbaAlpina-Kerndaten


The medium-term goal is to transfer VerbaAlpina's core data stock into the LMU University Library's research data repository in a finely granulated form with the standardised metadata of the CIDOC CRM. There, a search portal realised with the Ruby-on-Rails Engine Blacklight and based on an Apache Solr index will enable access to the data.