TR008 Vossen & Hielkema & de Jonge "Evaluation of PDF to TEXT conversion" (April 2, 2010)
We researched the conversion of PDF documents thoroughly, in order to find a suitable existing software package that can handle these problems. In addition, we require such a package to run on both Windows and Linux and to run through an API; to support new versions of Adobe PDF, and preferably to be open source. Unfortunately, none of the ten open source and commercial packages that we considered quite fit our requirements, as none of the considered packages truly interprets the PDF as text objects. Instead, they generate an approximation of the visual data. For example, they all apply a standard approach to handle columns. A multi-column page can be interpreted in two different ways:
- All text on a single line belongs together: n-columns with m-rows result in m-lines ;
- All text in a column represent lines: n-columns with m-rows results in n*m-lines.
One of the fiercest challenges of PDF-conversion is that its pages may contain combinations of the two columns - in which case either strategy fails. For example, if a text has two columns (a & b) that represent adjacent text, followed by another object (e.g. a table), followed by again two columns (c & d) with adjacent text, then all the solutions will generate at best the order: a-c-b-d, whereas they should generate a-b-c-d. The only way to decide correctly on which text block follows an earlier text block, is to apply some linguistic processing to the text which checks if a concatenation results in valid text. None of the approaches try to do this and therefore the quality depends on the degree to which the input PDF format fits their default approach. As none of packages solves the fundamental problems, we decided to use:
- httrack (version 3.42) to capture websites;
- pdftk (version 1.41) to split pdfs into pages;
- pdftotext (version 3.00) to convert to text.
Together, these form a reasonably robust suite of software that can be integrated easily and
can handle Asian fonts to some extent.
In this document, we describe in detail the result of the evaluation of the differerent systems.
TR007 Alvez & Lucio & Rigau "Providing First-Order Reasoning Support to Large and Complex Ontologies" (March 18, 2010)
In this paper, we summarize the results produced, by using first-order theorem provers as inference engines, for adapting a large and complex ontology to
allow for its use in formal reasoning. In particular, our study focuses on providing first-order reasoning support to SUMO (Suggested Upper Merged Ontology). Our contribution to the area of ontological formal reasoning is threefold. Firstly, we describe our procedure for translating SUMO from its original format into the standard first order language. Secondly, we use first-order theorem provers as inference engines for debugging the ontology. Thus, we detect and repair several significant problems with the axiomatization of the SUMO ontology. Problems we encountered include incorrectly defined axioms, redundancies, non-desirable properties, and axioms that do not produce expected logical consequences. Thirdly, as a result of the process of adapting the SUMO ontology, we have discovered a basic design problem of the ontology which impedes its appropriate use with first order theorem provers.
Consequently, we also propose a new transformation to overcome this limitation. As a result of this process, we obtain a validated and consistent first order version of the ontology to be used by first-order theorem provers.
TR009 Agirre, Artola, Diaz de Ilarraza, German Rigau, Soroa, Bosma "KAF: Kyoto Annotation Framework (July 29, 2009)
This document presents the current draft of KAF (as of July 29, 2009): Kyoto Annotation
Framework to be used within the KYOTO project. KAF aims to provide a reference
format for the representation of semantic annotations.
TR002 Soria & Monachini "Kyoto-LMF: WordNet representation format" (Oct.29, 2008)
The format described in this technical paper is the current proposal for representing wordnets inside the Kyoto project (henceforth, “Kyoto-LMF wordnet format”). The reference model is Lexical Markup Framework (LMF), version 16, probably one of the most widely recognized standards for the representation of NLP lexicons. LMF is a model providing a common standardized framework for the description and representation of NLP lexicons. The goals of LMF are to provide a common model for the creation and use of such lexical resources, to manage the exchange of data between and among them, and to enable the merging of a large number of individual resources to form extensive global electronic resources.
TR006 Bosma & Vossen "Fact Annotation Format for KYOTO" (Sept. 16, 2008)
This report describes Kyoto FactAF, an XML annotation layer of facts. Within the Kyoto project, the goal of concept extraction is to acquire generic domain knowledge – knowledge which is true under any circumstances. Specific knowledge (so-called ‘facts’) is extracted by means of fact mining. Facts generally refer to instances rather than classes of processes and concepts. A fact is an assertion of something which may or may not happen (can be true or false)at a particular place and time, given one or more entities. The goal of this document is to define a representation format of facts, using existing standards where possible.
TR004 Vossen & Bosma "The representation of terms" (June 23, 2008)
In this report, we present the representation format of the terms and term data that is automatically extracted from the document collection provided by the user.
TR003 Marchetti & Ronzano & Tesconi & Minutoli "Formalizing Knowledge by Ontologies: OWL and KIF" (May 29, 2008)
During the last years, the activities of knowledge formalization and sharing useful to allow for semantically enabled management of information have been attracting growing attention, expecially in distributed environments like the Web.
In this report, after a general introduction about the basis of knowledge abstraction and its formalization through ontologies, we briefly present
a list of relevant formal languages used to represent knowledge: CycL, FLogic, OOM, KIF, Ontolingua, RDF(S) and OWL. Then we focus our
attention on the Web Ontology Language (OWL) and the Knowledge interchange Format (KIF). OWL is the main language used to describe and share ontologies over the Web: there are three OWL sublanguages with a growing degree of expressiveness. We describe its structure as well as the way it is used in order to reasons over asserted knowledge. Moreover we briefly present Three relevant OWL ontology editors: Prot´eg´e, SWOOP and Ontotrack
and two important OWL reasoners: Pellet and FACT++.
KIF is mainly a standard to describe knowledge among different computer systems so as to facilitate its exchange. We describe the main elements of KIF syntax; we also consider Sigma, an environment for creating, testing, modifying, and performing inference with KIF ontologies. We comment some meaningful example of both OWL and KIF ontologies and, in conclusion, we compare their main expresive features.
TR001 Vossen "User Scenarios Wikyoto" (May 25, 2008)
In this report, we present a design of the user-scenarios for editing domain knowledge in the Wikyoto system. The Wikyoto system helps domain experts to anchor their knowledge to a common ontology backbone that can be shared across languages and cultures in their domain. The knowledge from their domain is firstly automatically extracted from textual documents that they can upload to the system. The domain terms in the language of the documents are automatically extracted and presented to the user in a knowledge hierarchy. This term extraction is done by so-called TYBOTS: Term Yielding Robots. Via wordnets and an ontology the knowledge structures are then used by the KYOTO system to extract facts from the documents and to build smart semantic indexes. This is done by the so-called KYBOTS: Knowledge Yielding Robots. The Wikyoto system hides the complex knowledge structures behind an easy-to-use interface and can still generate the complex and rich knowledge-structures that we need for mining the knowledge in the source documents.
TR005 Agirre & Rigau "Storyboard: "to mine by example" for building Kybots" (May 16, 2008)
The Kybot wiki is the Kybot Editor system used by the Kybot editor to mine by example the domain corpus. This system helps to define the Kybots. The Kybot compiler is a software component of the whole system. Kybot wiki is software. Kybot editor is a person. Kybot compiler is software.