8.4: Text Processing to Extract Content for Ontologies

Last updated
Save as PDF

Page ID: 6446

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

If all else fails, and there happens to be a good amount of text available in the subject domain of the (prospective) ontology, one can try Natural Language Processing (NLP) to develop the ontology⁵. Which approaches and tools suit best depends on the goal (and background) of its developers and prospective users, ontological commitment, and available resources.

There are two principal possibilities to use NLP for ontology development:

Use NLP to populate the TBox of the ontology, i.e., obtaining candidate terms from the text, which is also called ontology learning (from text).
Use NLP to populate the ABox of the ontology, i.e., obtaining named entities, which is also called ontology population (from text).

An review of NLP and (bio-)ontologies can be found in [LHC11] and some examples in [CSG⁺10, AWP⁺08].

But why the “if all else fails...” at the start of the section? The reason is that information in text is unstructured and natural language is inherently ambiguous. The first step researchers attempted was to find candidate terms for OWL classes. This requires a Part-of-Speech (POS) tagger so as to annotate each word in the text with its category; e.g., ‘apple’ is a noun and so forth. Then one selects the nouns only and counts how often it occurs, taking into account synonyms so as to group those together and assesses which ones are homonyms and used in different ways and therefore have to be split into different buckets. This process may be assisted by, e.g., WordNet⁶. Challenges arise with euphemisms, slang, and colloquialisms, as well as with datedness of texts as terms may have undergone concept drift (i.e., mean something else now) and new ones have been invented. The eventual resulting candidate list is then assessed by humans on relevance, and subsequently a selection will be added to the ontology.

The process for candidate relations is a bit more challenging. Although one easily can find the verbs with a POS tagger, it is not always easy to determine the scope of what denotes the subject and what denotes the object in the sentence, and authors are ‘sloppy’ or at least imprecise. For instance, one could say (each) ‘human has a heart’, where ‘has’ actually refers to structural parthood, ‘human has a house’ where ‘has’ probably means ownership, and ‘human has a job’ which again has a different meaning. The taxonomy of part-whole relations we have seen in Section 6.2: Part-Whole Relations has been used to assist with this process (e.g., [THU⁺16]). Consider that DOLCE and WordNet are linked and thus for a noun in the text that is also in WordNet, then one can find the DOLCE category. Knowing the DOLCE category, one can check which part-whole relation fits with that thanks to the formal definitions of the relations. For instance, ‘human’ and ‘heart’ are both physical endurants, which are endurants, which are particulars. One then can use OntoPartS’s algorithm: return only those relations where the domain and range are either of those three, but not any others. In this case, it can be (proper) structural parthood or the more generic plain (proper) parthood, but not, say involvement because ‘human’ and ‘heart’ are not perdurants. A further strategy that could be used is, e.g., VerbNet⁷ that uses compatible roles of the relations that the nouns play in the relation.

Intuitively, one may be led to think that simply taking the generic NLP tools will do also for specialized domains, such as (bio-)medicine. Any application does indeed use those techniques and tools, but, generally, they do not suffice to obtain ‘acceptable’ results. Domain specific peculiarities are many and wide-ranging. For instance, 1) to deal with the variations of terms (e.g., scientific name, variants, abbreviations, and common misspellings) and the grounding step (linking a term to an entity in a biological database) in the ontology-NLP preparation and instance classification [WKB07]; 2) to characterize the question in a question answering system correctly (e.g., [VF09]); and 3) to find ways to deal with the rather long strings and noun phrases that denote a biological entity or concept or universal [AWP⁺08]. Taking into account such peculiarities does generate better overall results than generic or other domain-specific usages of NLP tools, but it requires extra manual preparatory work and a basic understanding of the subject domain and its applications to include also such rules. For instance, enzyme names always end with ‘-ase’, so one can devise a rule with a regular expression to detect these terms ending in ‘-ase’ and add them in the taxonomy as a subclass of Enzyme.

Ontology population in the sense of actually adding a lot of objects in the ABox of the OWL file is not exciting, for it is not good in scalability of reasoning, partially due to the complexity of OWL 2 and partially because the default setting of the ODEs is that it will load the whole OWL file into main memory and by default settings at least, the ODE will run out of memory. There are alternatives to that, such as putting the instances in a database or annotating the instances named in the text with the terms of the ontology and store those texts in a digital library, which then can be queried. The process to realize it requires, among others, a named entity tagger so that is can tag, say, the ‘Kruger park’ as a named entity. It then has to find a way to figure out that that is an instance of Nature Reserve. For geographic entities, a gazetteer can be used. As for nouns, also named entities can have different strings yet refer to the same entity; e.g., the strings ‘Luis Fonsi’ and ‘L. Fonsi’ refer to the same singer-songwriter of the smash-hit Despacito. It has further issues, such as referring expressions in the same as well as successive sentences; e.g., in the sentence “he wrote the song during a sizzling Sunday sunset”, “he” refers to Fonsi and “the song” to Despacito, which has to be understood and represented formally as a triple, say, \(\langle\texttt{Fonsi, songwriter, Despacito}\rangle\) and linked to the classes in the ontology.

NLP for ontology learning and population is its own subfield in ontology learning. The brief summary and illustration of some aspects of it does not cover the whole range, but may at least have given some idea of non-triviality of the task. If you are interested in this topic: a more comprehensive overview is described in [CMSV09] and there are several handbooks.

Footnotes

⁵of course, once the ontology is there, it can be used as a component in an ontology-driven information system, and an NLP application can be enhanced with an ontology, but that is a separate theme.

⁶https://wordnet.princeton.edu/

⁷https://verbs.colorado.edu/verbnet/