10.1: Toward Multilingual Ontologies

Last updated
Save as PDF

Page ID: 6457

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Let us first have a look at just one natural language and an ontology (Section 9.1: Toward Multilingual Ontologies "Linking a Lexicon to an Ontology") before complicating matters with multiple languages in Section 9.1: Toward Multilingual Ontologies "Multiple Natural Languages".

Linking a Lexicon to an Ontology

Most, if not all, ontologies you will have inspected and all examples given in the preceding chapters simply gave a human-readable name to the DL concept or OWL class. Perhaps you have loaded an ontology in the ODE, and the class hierarchy showed numbers, alike GO:00012345, and you had to check the class’s annotation what was actually meant with that cryptic identifier. This is an example of a practical difference between OBO and OWL (recall Section 7.1: Relational Databases and Related ‘legacy’ KR), which is, however, based on different underlying modelling principles. DLs assume that a concept is identified by the name given to it; that is, there is a 1:1 correspondence between what a concept, say, being a vegetarian, means and the name we give to it. Natural language and the knowledge, reality etc. are thus tightly connected and, perhaps, even conflated. Not everybody agrees with that underlying assumption. An alternative viewpoint is to assume that there are language-independent entities—i.e., they exist regardless whether humans name them or not—that somehow have to be identified and then one sticks one or more labels or names to it. Put differently: knowledge is one thing and natural language another, and they should be kept as distinct kinds of things.

My impression is that the second view is prevailing at least within the ontology engineering arena. To date, two engineering solutions have been proposed how to handle this in an ontology. The first one is the OBO solution, which was used since its inception in 1998 by the Gene Ontology Consortium [Gen00]: each languageindependent entity gets an identifier and it must have at least one label that is human-readable. This solution clearly allows also for easy recording of synonyms, variants, and abbreviations, which were commonplace in genetics especially at the inception of the GO.

The second option emerged within the Semantic Web field [BCHM09]. It also acknowledges that distinction between the knowledge layer and the language layer, yet it places the latter explicitly on top of the knowledge layer. This has as effect that the solution does not ‘overload’ OWL’s annotation fields, but proposes to store all the language and linguistic information in a separate file that interacts with the OWL file. How that separate file should look like, what information should be stored in it, and how it should interact with the OWL file is open to manifold possible solutions. One such proposal will be described here to illustrate how something like that may work, which is the Lemon model [MdCB⁺12, MAdCB⁺12], of which a fragment has been accepted as a community standard by the W3C³. It also aims to cater for multilingual ontologies.

Consider the Lemon model as depicted in Figure 9.1.1, which depicts the kind of things one can make annotations of, and how those elements relate to each other. At the bottom-center of the figure, there is the ontology to which the language information is linked. Each vocabulary element of the ontology will have an entry in the Lemon file, with more or less lexical information, among others: in which sense that word is meant, what the surface string is, the POS tag (noun, noun phrase, verb etc.), gender, case and related properties (if applicable).

Screenshot (113).png

Figure 9.1.1: The Lemon model for multilingual ontologies (Source: [MdCB⁺12])

A simple entry in the Lemon file could look like this, which lists, in sequence: the location of the lexicon, the location of the ontology, the location of the Lemon specification, the lexical entry (including stating in which language the entry is), and then the link to the class in the OWL ontology:

@base <www.example.org/lexicon>

@prefix ontology: <www.example.org/AfricanWildlinfeOntology1#>

@prefix lemon: <www.monnetproject.eu/lemon#>

:myLexicon a lemon:Lexicon ;

lemon:language "en" ;

lemon:entry :animal .

:animal a lemon:LexicalEntry ;

lemon:form [ lemon:writtenRep "animal"@en ] ;

lemon:sense [ lemon:reference AfricanWildlinfeOntology1:animal ] .

One can also specify rules in the Lemon file, such as how to generate the plural from a singular. However, because the approach is principally a declarative specification, it is not as well equipped at handling rules compared to the well-established grammar systems for NLP. Also, while Lemon covers a fairly wide range of language features, it may not cover all that is needed; e.g., the noun class system emblematic for the indigenous language spoken in a large part of sub-Saharan Africa does not quite fit [CK14]. Nonetheless, Lemon, and other proposals with a similar idea of separation of concerns, are a distinct step forward for ontology engineering where interaction with languages is a requirement. Such a separation of concerns is even more important when the scope is broadened to a multilingual setting, which is the topic of the next section.

Multiple Natural Languages

Although this textbook is written in one language, English, for it is currently the dominant language in science, the vast majority of people in the world speak another language and they both have information systems in their own language as well as that they may develop an ontology in their own language, or else localise an ontology into their own language. One could just develop the ontology in one’s own language in the same way as the examples were given in English in the previous chapters and be done with it. But what if, say, SNOMED CT [SNO12] should be translated in one’s own language for electronic health records, like with OpenMRS [Ope], or the ontology has to import an existing ontology that happens to be not represented in the target language and compatibility with the original ontology has to be maintained? What if some named class is not translatable into one single term? For instance, in French, there are two words for the English ‘river’: one for a river that ends in the sea and another word for a river that doesn’t (fleuve and rivière), and isiZulu has two words and corresponding meanings for the participation relation: one as we have see in Section 6.2: Part-Whole Relations and another for participation of collectives in a process (-hlanganyela). The following example illustrates some actual (unsuccessful) ‘struggling’ trying to handle this when there is not even a name for the entity in the other language (example from [AFK12]); a more extensive list of the type of issues can be found in [LAF14].

Example \(\PageIndex{1}\):

South Africa has a project on indigenous knowledge management systems, but the example equally well can be generalised to cultural historic museum curation in any country (AI for cultural heritage). Take ingcula, which is a ‘small bladed hunting spear’ (in isiZulu), that has no equivalent term in English. Trying to represent it in the ‘English understanding’, i.e., adding it not as a single class but as a set of axioms, then one could introduce a class Spear that has two properties, e.g., \(\texttt{Spear}\sqsubseteq\exists\texttt{hasShape.Bladed}\cap\exists\texttt{participatesIn.Hunting}\). To represent the ‘small’, one could resort to fuzzy concepts; e.g., following [BS11]’s fuzzy OWL notation, then, e.g.,

\(\texttt{MesoscopicSmall : Natural}\to \texttt{[0, 1]}\) is a fuzzy datatype,

\(\texttt{MesoscopicSmall(x)} = \texttt{trz(x, 1, 5, 13, 20)}\), with trz the trapezoidal function,

so that a small spear can be defined as

\(\texttt{SmallSpear}\equiv\texttt{Spear}\cap\exists\texttt{size.MesoscopicSmall}\)

Then one can create a class in English and declare something alike

\(\texttt{SmallBladedHuntingSpear}\equiv\texttt{SmallSpear}\cap\exists\texttt{hasShape.Bladed}\cap\)

\(\exists\texttt{participatesIn.Hunting}\)

This is just one of the possibilities of a formalised transliteration of an English natural language description⁴ , not a definition of ingcula as it may appear in an ontology about indigenous knowledge of hunting.

Let’s assume for now the developer does want to go in this direction, then it requires more advanced capabilities than even lexicalised ontologies to keep the two ontologies in sync: lexicalised ontologies only link dictionaries and grammars to the ontologies, but here one now would need to map sets of axioms between ontologies.

That is, what was intended as a translation exercise ended up as a different ontology file at least⁵. It gets even more interesting in multilingual organisations and societies, like the European Union with over 20 languages and, e.g., South Africa that has 11 official languages, for then it would require some way of managing all those versions.

those versions. Several approaches have been proposed for the multilingual setting, both for localisation and internationalisation of the ontology with links to the original ontology and multiple languages at the same time in the same system. The simplest approach is called semantic tagging. This means that the ontology is developed ‘in English’, i.e., naming the vocabulary elements in one language and for other languages, labels are added, such as Fakultät and Fakulteit for the US-English School. This may be politically undesirable and anyhow it does not solve the issue of non 1:1 mappings of vocabulary elements. It might be a quick ‘smart’ solution if you’re lucky (i.e., there happen to be only 1:1 mappings for the vocabulary elements in your ontology), but a solid reusable solution it certainly is not. OBO’s approach of IDs and labels avoids the language politics: one ID with multiple labels for each language, so that it at least treats all the natural languages as equals.

However, both falter as soon as there is no neat 1:1 translation of a term into another single term in a different language—which is quite often the case except for very similar languages—though within the scientific realm, this is much less of an issue, where handling synonyms may be more relevant.

One step forward is a mildly “lexicalised ontology” [BCHM09], of which an example is depicted in Figure 9.1.2. Although it still conflates the entity and its name and promotes one language as the primary, at least the handling of other languages is much more extensive and, at least in theory, will be able to cope with multilingual ontologies to a greater extent. This is thanks to its relatively comprehensive information about the lexical aspects in its own linguistic ontology, with the WordForm etc., which is positioned orthogonally to the domain ontology. In Figure 9.1.2, the English OralMucosa has its equivalent in German as Mundschleimhaut, which is composed here of two sub-words that are nouns themselves, Mund ‘mouth’ and Schleimhaut ‘mucosa’. It is this idea that has been made more precise and comprehensive in its successor, the Lemon model, that is tailored to the Semantic Web setting [MdCB⁺12]. Indeed, the same Lemon from the previous section. The Lemon entries can become quite large for multiple languages and, as it uses RDF for the serialisation, it is not easily readable. An example for the class Cat in English, French, and German is shown diagrammatically in Figure 9.1.3, and two annotated short entries of the Friend Of A Friend (FOAF)⁶ structured vocabulary in Chichewa (a language spoken in Malawi) are shown in Figure 9.2.1.

Screenshot (114).png

Figure 9.1.2: Ontologies in practice: Semantic Tagging—Lexicalized Ontologies. (Source: www.deri.ie/fileadmin/documen...lNLP.final.pdf)

Screenshot (115).png

Figure 9.1.3: The Lemon model for multilingual ontologies to represent the class Cat (Source: [MdCB⁺12])

There are only few tools that can cope with ontologies and multiple languages. A web-based tool for creating Lemon files is under development at the time of writing. It would be better if at least some version of language management were to be integrated in ODEs. At present, to the best of my knowledge, only MoKI provides such a service partially for a few languages, inclusive of a localised interface [BDFG14].

As a final note: those non-1:1 mappings of the form of having one class in ontology \(O_{1}\) and one or more axioms in \(O_{2}\), sets of axioms in both, like in Example 9.1.1 with the hunting spear, as well as non-1:1 property alignments, are feasible by now with the (mostly) theoretical results presented in [FK17, Kee17b], so this ingcula example could be solved in theory at least. Its details are not pursued here, because it intersects with the topic of ontology alignment. Further, one may counter that an alternative might be to SKOSify it, for it would avoid the complex mapping between a named class to a set of axioms. However, then the differences would be hidden in the label of the concepts rather than solving the modeling problem.

Footnotes

²E.g., it has been shown to enhance precision and recall of queries (including enhancing dialogue systems [VF09]), to sort results of an information retrieval query to the digital library [DAA⁺08], (biomedical) text mining, and annotating textbooks for ease of navigation and automated question generation [CCO⁺13] as an example of adaptive e-learning.

³www.w3.org/community/ontolex..._Specification

⁴Plain OWL cannot deal with this, though, for it deals with crisp knowledge only. Refer to Section 10.1: Uncertainty and Vagueness "Fuzzy Ontologies" for some notes on fuzzy ontologies

⁵whether it is also a different conceptualisation is a separate discussion.

⁶http://xmlns.com/foaf/spec/