M 6.2.1 – Spezifikation von Diensten (public version)
Version 15.07.2014
Cluster 6.2
Verantwortliche Partner STI, TUD, BBAW
DARIAH-DE. Aufbau von Forschungsinfrastrukturen für die e-Humanities
BMBF Förderkennzeichen: 01UG1110A bis N
Laufzeit: 2011-03 – 2016-02
This research and development project has been and still is funded by the Bundesministerium für Bildung und Forschung (BMBF, federal ministry for education and research), reference number 01UG1110A bis N and supervised by the project sponsor at the Deutsches Zentrum für Luft- und Raumfahrt (PT-DLR, German centre for aeronautics and astronautics).
Dokumentstatus: Final
Verfügbarkeit: public
Autoren:
Rainer Becker, TUD Luise Borek, TUD Thomas Kollatz, STI Harald Lordick, STI Beata Mache, STI Ruth Reiche, TUD Niels-Oliver Walkowski, BBAW
Introduction
Aim
The overall purpose of this milestone is a conceptual specification of technical services to be developed by work package 6.3 of the DARIAH-DE-2-Cluster on Annotation. In order to be included in the DARIAH-DE-Infrastructure, work package 6.2 specifies annotation tools with an area of scientific application as broad as possible: generic requirements lead to complex user scenarios and relations of annotations which have been investigated in an early project stage. To further examine and analyze these contexts in the near future, visualization-techniques may serve as powerful tools and are therefore included in the cluster's conceivable endeavors.
After our selection of suitable, existent and emerging annotation and visualisation tools, the resources relevant during each concrete annotation process (e.g. authority files, ontologies) remain an essential factor for further considerations. Equally relevant remains the methodical/conceptual and field-specific attendance to an integration of the services provided.
Result: DARIAH-DE annotation services
The conception of the DARIAH II annotation cluster provides two already given tools that are associated with involved partners. They are shaped for annotations of different objects in several fields. Whereas the MEI Score Editor allows to view and edit MEI-encoded music documents, SemToNotes covers a wide range of semantic enrichment of texts and images. Both tools are being further developed and integrated in the DARIAH-DE-infrastructure as described in the project proposal.
Apart from those services agreed on at the beginning of the project, we present some results of contextualisation of our cluster activities. In addition to the tools mentioned above the work package defines the integration of an annotation service to the DARIAH-DE infrastructure that is both generic and easy to use. To achieve this, we have looked at a plurality of tools . As a result, work package 6.2 recommends the annotation tool Annotator.js (see http://annotatorjs.org/) to be adapted, further developed and integrated in the DARIAH-DE infrastructure.
Procedure
- Description of the declared tools MEISE and SemToNotes
- Catalogue of criteria of a generic, infrastructural DARIAH-DE annotation service
- Outline of existing annotation tools
- Identification of candidates
- Specification of candidates
- Selection and specification of one candidate tool
Catalogue of criteria of a generic, infrastructural DARIAH-DE annotation service
Work package 6.2 comprised a preliminary selection of annotation tools potentially suited for the infrastructure. Among others, features such as generic character, usability, field-specific interest, relations to already existent infrastructure components were taken into account during this project stage. Similarly, attention was paid to synergies with external projects. In this context members of the work package 6.2 team participated in the international round table discussion 'Annotations in Digital Media' (see http://dhd-blog.org/?p=3831). After the event, a mailing list was compiled in order to establish the annotation cluster for DARIAH-DE as a basis for further talks about annotations as the topic of shared interest. We furthermore take interest in relevant external projects, e.g. Hypothes.is (see http://hypothes.is) which aims to "be an open platform for the collaborative evaluation of knowledge". Topics of discussion included the developments in the fields of entity recognition of texts, authority file services and tools for manual annotations. This approach ultimately led to a proposal for a generic annotation service based on Annotator.js and its respective specification in DARIAH-DE.
A catalogue of criteria for a generic DARIAH-DE annotation service applicable to multiple generic technical areas included:
- generic approach: wide benefit across disciplinary borders
- usability with accent on humanists as primary users: user interface should be intuitive (for humanists expectations)
- relatively low complexity of the user interface: should be connectible to users main expectations (clearly specified functionality)
- technical interoperability: keep in mind specifications of the infrastructure while offering some plug-in-functionality for extension as well as using sustainable formats etc. (sustainability-specifications)
- institutional interoperability: needs of specific scientific communities, eHumanities-cooperations as well as compatibility with both already 'set' tools (without functional overlaps, potentials to combine tools)
Outline of existing annotation tools
Below you find an overview of already existent tools in the annotation context (mainly based on initiatives such as Bamboo DiRT, OKF, the Research Space Project, an overview of Video Annotation Tools (VAT) as well as technical additions of our own). This overview was conducted during work on the Milestone. It contains a collection of about two hundred annotation tools.
This overview is accessible as an open document: https://docs.google.com/spreadsheet/ccc?key=0AgJnN0WXLOardFBVWmlJbk9HUk5iRW9LN19WYVhIVnc&usp=drive_web#gid=0
At least parts of this overview are based on information from the following sources:
- Bamboo Dirt: http://dirt.projectbamboo.org/categories/annotation?field_platform_value=All&field_cost_value=All&term_node_tid_depth=All&sort_by=title&sort_order=ASC
- Open Knowledge Foundation: https://docs.google.com/spreadsheet/ccc?key=0Aujm_HldNh4WdHJrcTNIZ2tOQWhETjNDMGdkRjZEVGc&usp=drive_web#gid=0
- Research Space Project: https://confluence.ontotext.com/display/ResearchSpace/07%2C18+Image+Annotation+%28w+Deep+Zoom%29
- Overview: Video Annotation Tools (VAT): https://docs.google.com/spreadsheet/ccc?key=0Ai4YoIsQkBJEdG5ndHVWM1RxMVJ4WGhCMEFZaEdkU0E&usp=drive_web#gid=0
- DARIAH
Short-list of annotation tools and choice for a generic DARIAH-DE annotation service
Within the work package, several stages of selection resulted in a short-list of tools that might serve as a service suitable for integration in the infrastructure. As mentioned above, criteria that were taken into account included: the generic character of the tool, its usability, its scholarly relevance, its embedding into the existing set of components and services of the infrastructures as well as a contextualization concerning developments of external projects or existing use cases.
Among others, a closer look was taken at tools that enable automatic enrichment, focus on authority file services or support manual annotations. This procedure lead to Annotator.js as a generic annotation service that meets all the criteria to be integrated in DARIAH-DE. Annotator.js is a state-of-the-art tool. Numerous remarkable projects and institutions rely on Annotator.js for their activities, among others http://Hypothes.is, The Open Video Annotation Project (e.g. Harvard), EdX (e.g. Berkeley) or Annotation Studio (e.g. MIT) (see http://annotatorjs.org/showcase.html) and the Open Knowledge Foundation (Member of DARIAH-DE).
Specification of a Generic Annotation Service in DARIAH-DE
Annotations in Between Perspectives
Annotations are a type of academic resource which is created and dependent on highly dynamical and context related research situations: it represents different stages in research, different opinions in plural environments, it accentuates what is important for a specific research situation and so forth. This aspect constitutes its potential as well as its complexity. The Generic DARIAH Annotation Service will be implemented around this theme.
There are 4 action points to do for an implementation of such a service:
- The Annotator which was chosen as the core component of a DARIAH Annotation Service must be integrated with existing DARIAH services.
- Two features of the Annotator must be extended:
a. the referencing component must also permit to reference XML nodes with an ID – an ideal environment, among others, would be a repository with high-quality encoded sources like the TextGridRep where many people work on.
b. the tag-plugin has to be extend so that it can integrate external semantics. - Two semantics – the TaDiRAH taxonomy and the MPT – have to be serialized and put into the Schema Registry so that they are accessible by the tag-plugin of the Annotator.
- In cooperation with Cluster 5 a bridge to Visualization Services like for example d3.js has to be implemented that can visualize annotation data in the way expressed in the question before using the tags from the two taxonomies.
The formal semantic integration plugin could provide the possibility to offer pre-defined semantics, like TaDiRAH and MPT, to make existing semantics importable (from the Schema Registry or the Semantic Web) or to create semantics during annotation (like in the tag plugin. These semantics could be become sustainable by the DARIAH infrastructure afterwards). What could be realised will depend on a resource evaluation.
Core Component: OKF-Annotator
- An Annotation is a JSON document that contains a number of fields describing the position and content of an annotation within a specified document.
- Store Plugin
- Stored are users, tags, annotations and annotated text fragments.
Opportunities and advantages
Compared to other evaluated tools it is easy to install, well documented, the OKF is partner in DARIAH, the main structure is modular and developed to be extended. In summary it would be a solution which can be integrated into DARIAH and extended with comparable less time ressources.
Potential risks and limitations
While elastic search is a well proved state of the art query engine a capable person should evaluate the python code of the server for its sustainability and stability.
There are developments to bring the Open Annotation Model to Annotator but those are not finished until today.
Potential for future development
In the current state the Annotator project only permits free text annotations.
There is however a plugin to use tags but these tags are user generated, are not distributable and can not be imported from existing Semantics/Ontologies. For academic use it is important to work on the top of existing, often domain specific semantics. Thus, it would be great to extend the tag plugin for this purpose. For academic use it is also important to use predefined projectspecific tagsets.
Annotator does not reference an academic community in the first place. The focus is on Web Documents. Hypothes.is claim that they will extend this functionality to PDF by pdf.js, images and video. However, there are many objects of primary data in academics that could not be rendered in a browser or that do not belong to the object types stated. In Digital Humanities XML is a core data format. Still, for presentation and publication purposes it is processed to HTML or other formats. If Annotations should be really of value for digital methods and not only for looking at them it necessary to backreference them to the primary data that is most often XML. As part of a generic digital annotation tool for DARIAH Cluster 6 could develop a plugin that tries to store referencing information in the Annotation which permits to connect the Annotations to the XML data.
Potential as a DARIAH Service
There are three areas for potentials when the Annotator is integrated as a DARIAH service:
- The infrastructural perspective of sustainability of service and data.
- The enrichment of the Annotator functionality provided by other DARIAH services.
- The promotion and provision of other DARIAH activities like TaDiRAH and dat:Motivation as part of a DARIAH service.
Integration into DARIAH Infrastructure and Implementation of Generic Features
Action Point 1: Integration into DARIAH Infrastructure
The DARIAH Service could be to install the annotation store on a DARIAH server and offer it for Humanities projects (DARIAH Storage). This would make the Annotation software easy to use for non-technical audiences. The annotation Storage hosted by DARIAH should be connected to the DARIAH-AAI. Thus, DARIAH could provide a project-specific annotation layer on the texts delt within a research project. Additionally, the annotation data could be connected to other DARIAH services, like Long Term Preservation thereby making annotation data sustainable. Annotations could be included into the Generic Search or mapped by the DARIAH-DE Geo-Browser when spatio-temporal informations are included. The Schema Registry could be used to create and map project/domain semantics which can be used as predicates for the Annotations. Thereby, the Generic Search could also distinguish between different contexts in which the annotations were created.
This development began with the annotation workshop in DARIAH I and is carried on in various activities of the annotation cluster like TaDiRAH which is a joint endeavor of a DH Ontology by NeDiMAH and DARIAH, the Scholarly Domain Model and the creation of the Multilayered Provenance Terms (MPT). DARIAH AAI and Schema Registry are wonderful services to organize and to query interdisciplinary and multiperspective annotation data. By this, real added value would be created because annotation could work as a communicative entity for cross domain/project research whereby the current annotators lack this functionality because it is not connected to any infrastructure and does not have the mentioned services. On the other hand projects like the pund.it (see https://thepund.it/) which offers parts of these services do not have the research community behind them and do also work on an infrastructure that does not specifically refer to an academic commmunity.
Action Point 2: Extension of the Tag-PlugIn
DARIAH should enhance the functionality of the TAG-Plugin by allowing tagsets specifically designed for individual research projects of high value for collaborative research, offer commonly used or generic tagsets (TaDiRAH, MPT) by default or to import tagsets stored in the Schema Registry.
Action Point 3: Referencing of XML Nodes
A base URL is given to any annotation. We propose to extend this to XML:IDs. Therefore annotator.js has to be extended to reference xml:id und xml nodes. Since TextGrid Rep is based on TEI XML the embedding of annotator.js into the TGRep could serve as a proof of concept for referencing of XML nodes or even xml:ids. Each \<div>\<p>\<l>\<lb> etc. is provided with its very own xml:id in TGRep.
Action Point 4: Connection to Visualization Services
As mentioned above with Annotator as a DARIAH-DE Service it should be possible to annotate a text by multiple users and to enrich it by this with multiple perspectives. That is why a network of knowledge is inherent in the stored annotations. To have a closer look on this network it may be helpful to visualize the connections between the four categories annotated text fragements, the annotations itself, given tags and users. To reach that goal in cooperation with Cluster 5 a bridge to Visualization Services like for example d3.js has to be implemented that can visualize annotation data in the way expressed in the question before using the tags from the two taxonomies.
Dissemination
Outcomes of the specification-process outlined above will be part of 'report 6.2.1' in month 18. Revised parts of this report could and should be part of an already planned publication supported by the DARIAH-DE Working Paper-Series ("Digitale Annotationen: ‚Best Practices‘ und Potentiale").
In parallel, a lot of the planned cluster-activities (like expert- and method-workshops) will also support the outlined specification- and development-process (see R 6.4.1).
An DARIAH expert workshop in cooperation with Cluster 5 (BBAW/ Uni Hamburg) is planned as a Use Case to work with multi-perspective annotation data that uses MPT and TaDiRAH (https://github.com/dhtaxonomy/TaDiRAH) and that identifies visualisation and query strategies to make use out of this data. It will make extensive use of the further developed extension to the Tag-Plugin (Action Point 2) The outcome of these workshops can be used for input to Report 6.2.1 because it definitely identifies and evaluates new ways in which annotation can be used in an digital environment. Experiences wil also go back to the MPT model and to TaDiRAH for the purpose of appropriate model description.
An already planned, proposed and accepted DARIAH expert workshop on text and image annotations could serve as a use Case for collaborative and interdisciplinary annotations and their requirements.
In the workshop Annotation scenarios could refer to Use-Cases from the DARIAH II Application using them as sources to annotate.