Details about the integrated system can be found in the KYOTO deliverable D10.2. On this page, we present a summarized overview.
KyotoCore is the central module of KYOTO that is used to process textual data for generating knowledge. It consists of a collection of separate modules that read and write text representations in the Kyoto Annotation Format (see the web page for explaining KAF). Each module adds a new layer of analysis to KAF, taking previous layers as input. The KAF representations of text are stored in separate databases that are maintained by the document base. The sequential processing of text into KAF according to a pipeline of modules is controlled by a job-dispatcher. The job-dispatcher checks the status of each document in the databases and applies the next module in a pipeline that is associated to a database. The so-called PipeT module can be used to create any pipeline of modules. The KyotoCore system thus consists of the set of modules built for KYOTO, together with the document base and the job-dispatcher, where we created specific pipelines for the KYOTO process flow. With PipeT it is possible for any developer or system integrator to build any other pipeline.
KyotoCore is embedded in the overall architecture of KYOTO, where documents and websites are collected in WikyPlanet (a Semantic Media Wiki for the environment). The Capture module can collect these sources (or any other set of documents) and will push them into a database in the document-base. The job-dispatcher will then apply the pipeline associated with that database to each document in the database.
A Multilingual Knowledge Base is used by many modules to carry out the semantic processing of text. It contains a central ontology and generic wordnets in various languages, stored in the DebVisDic platform. The Knowledge Base can be extended through the Wikyoto editor for a specific domain or application. Through one of the KYOTO pipelines, we can build a database with terms and relations from any set of documents in KAF. Wikyto can take these concepts and terms as input and combine it with background vocabularies and generic resources in the Multilingual Knowledge base. This will result in a domain wordnet (possibly in multiple languages) that is combined with the generic resources for processing KAF. Another pipeline can then generate factual data out of the KAF on the basis of the domain model and the generic resources.
In the next figure, we give a detailed overview of the KyotoCore architecture and its embedding in the complete KYOTO platform:
The KyotoCore system includes the following components:
- The KYOTO document base, which maintains, documents, databases, and users with rights (as a usual DMS) but which also assigns pipelines of processing modules to databases; Any type of document can be uploaded into the document-base but they need to be converted to text or html to be able to further process them.It is also possible to directly upload KAF documents.
- The KYOTO job-dispatcher that continuously monitors the documents in databases, checks the status of the documents and tries to perform the next step in the pipeline for each document;
- The file system for storing the KAF files that are produced and processed by the modules in KYOTO;
- The Kyoto modules, which are combined in a pipeline architecture to produce KAF, a term databases and facts:
- Capture module: a wrapper around third party software to convert PDF to text. The wrapper takes care of page boundaries and corrects conversion errors.
- Linguistic processors (LP), which are client programs that send HTML to a LP server which returns KAF representations including tokenization, lemmatized term representation, chunks and dependencies; Servers are available for Dutch, Spanish, Basque, English, Italian. For Chinese and Japanese applications are available and server versions are being deployed.
- Multiword (MW) tagger, which reads KAF and groups sequences of terms as multiword terms on the basis of the multiwords in generic wordnets and domain wordnets;
- Sense tagger (UKB), which is a word-sense-disambiguation system that uses a graph of semantic relations (based on wordnets) and a personalized page-rank algorithm to detect the synsets of words in context; UKB reads KAF and generates KAF with synsets added to the term layer;
- Named-entity (NE) tagger: which detects time points and places in KAF as named-entities. It applies named-entity disambiguation and represents the named-entities in a separate layer in KAF with GeoNames properties and Wordnet mappings for locations;
- Ontotology (ON) tagger: reads the synsets in KAF and inserts the full set of ontological implications that apply into the term structure of KAF, where the ontological implications are drawn from the central ontology and through synset to ontology mappings;
- Tybot: reads KAF and extracts the terms and their relations using structural, distributional and pattern-based rules. The results are stored in a MySql database that is input for the Wikyoto system for editing the domain wordnet;
- Kybot: reads KAF and a specified set of profiles to extract events and facts from KAF, where the profiles can specify patterns at any level of KAF (wordform, terms, synsets, ontological implications, named-entities, etc.).
- A term database in MySql with new terms that are learned from KAF representations of documents;
- PipeT: a platform for creating pipelines of processing modules through input and outputstream connections;
The overview figure also shows the WikyPlanet module for feeding KyotoCore with textual sources and the Wikyoto system for editing the domain wordnets and the central ontology. The latter are stored in the Multilingual Knowledge Base that is implemented in the DebVisDic environment. WikyPlanet, Wikyoto and DebVisDic are external to KyotoCore. It is possible to install and run KyotoCore without these external modules. For example, documents can be captured directly through the Capture module from a list of URLs. Furthermore, although many modules in KyotoCore use external resources such as wordnets and an ontology, these can be provided as separate data files independently of the Multilingual Knowledge Base. Wikyoto and the Multilingual Knowledge Base are only needed to build the domain specific resources. Through tailoring the knowledge to a specific domain, the quality of the output of KyotoCore will be improved (both in recall and precision). The cyclic nature of KYOTO allows you to continuously improve the ontology and domain wordnet which directly feeds back into all the modules that are applied after the LPs. As shown in this overview, Wikyoto connects to large background databases such as Species2000 and DBPedia, which contain millions of concepts and terms. Any domain wordnet can thus be aligned to wordnets, the central ontology and to the background documents and it can draw from any of these for creating concepts in addition to the term database. Nevertheless, KyotoCore can also run using generic resources.
At the heart of KyotoCore lays the document-base and the job-dispatcher. The document-base keeps track of databases, users and documents. Uses get rights to work on specific databases, or to create/delete databases. Each database can contain any number of documents and the document-base keeps track of the administration, versioning and status of each document in the database. To process documents in KYOTO, a registered user needs to add a document to a database. Another important function of the document-base is the administration of the modules that should be applied to each database. Users can register which modules and pipelines of modules should run on which database. In the case of KYOTO, there are a number of standard pipelines of modules that are applied. The first step in the process is the processing of the text in any HTML file in the database to generate the KAF representation of the text with a structural analysis: tokenization, term detections, chunking and dependency structures. Once the text is represented in KAF, other KYOTO modules will apply and add further data (layers) to the KAF representation.
The processing of the documents in the databases is controlled by the job-dispatcher. The job-dispatcher permanently monitors the status of each file in the document-base and checks what module should be applied next to the file, given the modules and pipelines that are associated with the database. The KYOTO engine is thus started by pushing documents into the document-base. Adding documents to the document-base can be done through the document-base API. If PDF documents are added, the Capture module converts them to textual sources to prepare them for processing in KYOTO. Next, the LPs take the HTML files as input streams and produce KAF/LP as output streams (as is specified in the pipeline configuration). Note that it is also possible to generate KAF outside the KYOTO platform (e.g. using your own LP for your language) and add KAF representations of your document to the document-base directly, bypassing both Capture and LP. This will also kick-start the KYOTO processing but since the status of the document is already set to KAF/LP, the next step in the pipeline is immediately applied, i.e. multiword tagging.
An important aspect of this architecture is that any pipeline can be created from any set of modules using the PipeT system and that all processing is centred around KAF as an input and output streams. This makes the system very flexible and easily extendible. Furthermore, after the LPs have produced a structural representation of the text in their language, the remaining modules in KYOTO are language-neutral, except for some basic and simple patterns and the wordnets that are used for each language. Note that currently the linguistic processors (LPs) are servers that are hosted at different sites, while the KYOTO LP modules are clients that simply pass HTML files to the server and send the KAF back to the job-dispatcher for storage in the document-base. Parsers that generate KAF can be obtained for most KYOTO languages. Otherwise, developers can include their own parser in the pipeline or simply directly submit KAF to the document-base, which will start the further processing. Likewise, there is a minimal language-dependency for processing text in KYOTO.
Given the total set of modules that are developed for KYOTO, shown at the top-right side, we can create any set of pipelines reading and writing KAF back into the database. In the overview figure, we represented two different databases where each database has a different pipeline associated. The pipeline database at the right side ends with a Tybot that generates the mysql term database for the KAF documents in the database. The term database is used to build the domain wordnet in the Wikyoto editor and map the wordnet to the central ontology. The domain modelling in the multilingual knowledge base can be exported to data files that are used by the modules in the pipeline again. For example, the multiword tagging can use multiwords from the generic wordnet in combination with the multiwords from any domain wordnet. Similarly, WSD can add the relations from a domain wordnet to the relations from the generic wordnet.
The second cycle, shown for the database at the left side, ends with the Kybot that generates the final output for KYOTO. Typically, we expect domain experts first to do domain modelling for a selection of representative documents after which they can apply the domain model to another database for extracting information and knowledge (facts) from a larger database.
How to install the KyotoCore system
KyotoCore has been tested on Linux/Unix platforms. Most modules will also run on Windows platform. Since KyotoCore incudes a large variety of software and packages, a large and dedicated server is required. Some modules make calls to remote servers, i.e. for parsing text to the initial KAF. Access to these remote servers needs to be required on the basis of IP registration (see the webpage for on LPs for more details). All servers are also available for local installations.
A minimal installation requires the installation of the document-base with the job-dispatcher. Using PipeT, developers and integrators can build their own pipelines. We also include standard pipelines for KYOTO that include all the KYOTO modules. These pipelines are specific for each language and require available lexical resources such as wordnets for these languages. To obtain the wordnets, you should consult the builders of these wordnets. All wordnets in KYOTO are free for research: see the web page on the Multilingual Knowledge Base for more details.
Wikyoto is provided as a separate package, as described in the corresponding web page.