# 8.3: Thesauri


A thesaurus is a simple concept hierarchy where the concepts are related through three core relations: $$\textbf{BT}$$ broader term, $$\textbf{NT}$$ narrower term, and $$\textbf{RT}$$ related term (and auxiliary ones UF/USE, use for/use). For instance, a small section of the Educational Resources Information Center thesaurus looks like this:

reading ability

BT ability

RT reading

RT perception

Figure 7.3.1: Generating OWL axioms based on a ‘macro’ approach in spreadsheets. left: a spreadsheet with in column A the subclass and in column B its superclass, and in column D the subclass/class that will have a property declared and in column E what the class in column D is eating. Right: the JSON script to convert columns A and B into axioms of the type “$$\texttt{A}\sqsubseteq\texttt{B}$$”.

and the AGROVOC thesaurus about agriculture of the Food and Agriculture Organisation (FAO) of the United Nations has the following asserted, among others:

milk

NT cow milk

NT milk fat

How to go from this to an ontology? Three approaches exists (thus far):

• Automatically translate the ‘legacy’ representation of the ontology into an OWL file and call it an ontology (by virtue of being represented in OWL, regardless the content);
• Find some conversion rules that are informed by the subject domain and foundational ontologies (e.g., introducing parthood, constitution, etc.);
• Give up on the idea of converting it into an ontology and settle for the W3Cstandardized Simple Knowledge Organization System3 format to achieve compatibility with other Semantic Web Technologies.

We will look at the problems with the first option, and achievements with the second and third option.

## Converting a Thesaurus Into an Ontology

Before looking at conversions, one first has to examine what a typical thesaurus really looks like, in analogy to examining databases before trying to port them into an ontology.

### Problems

The main issues with thesauri, and for which thus a solution has to be found, are that:

• Thesauri are generally a lexicalisation of a conceptualization, or: writing out and describing concepts in the name of the concept, rather than adding characterizing properties;
• Thesauri have low ontological precision with respect to the categories and the relations: there are typically no formal details defined for the concept names, and BT/NT/RT are the only relations allowed in the concept hierarchy.

As thesauri were already in widespread use before ontologies came into the picture for ontology-driven information systems, they lack basic categories alike those in DOLCE and BFO. Hence, an alignment activity to such foundational ontology categories will be necessary. Harder to figure out, however, are the relations. RT can be anything, from parthood to transformation, to participation, or anything else, and BT/NT turns out not to be the same as class subsumption; hence, the relations are overloaded with (ambiguous) subject domain semantics. This has as result that those relationships are used inconsistently—or at least not precise enough for an ontology. For instance, in the aforementioned example, milk and milk fat relate in a different way to each other than milk and cow milk, for milk fat is a component of milk and cow milk indicates its origin (and, arguably, it is part of the cow), yet both were NT-ed to milk.

### A Sample Solution: Rules as You Go

Because of the relatively low precision of a thesaurus, it will take a bit more work to convert it into an ontology cf. a database. Basically, the ontological analysis that hasn’t been done when developing the thesaurus—in favor of low-hanging fruit for system development—will have to be done now. For instance, a nebulous term like “Communication (Thought Transfer)” in the ERIC thesaurus will have to be clarified and distinguished from other types of communication like in computer networks. They then could be aligned to a foundational ontology or a top-domain ontology after some addition analysis of the concepts in the hierarchy and aided by a decision diagram like D3. One also should settle on the relations that will replace BT/NT/RT. An approach to this particular aspect of refinement is presented in [KJLW12].

This is a lot of manual work, and there may be some ways to automate some aspects of the whole process. Soergel and co-authors [SLL+04] took a ‘rules as you go’ approach that can be applied after the aforementioned ontological analysis. This means that as soon some repetitiveness was encountered in the manual activity, a rule was devised, the rest of the thesaurus assessed on the occurrence of the pattern, and converted in one go. A few examples are included below.

Example $$\PageIndex{1}$$:

For instance, Soergel and co-authors observed that, e.g., cow NT cow milk should become $$\texttt{cow} <\texttt{hasComponent}>\texttt{cow milk}$$. There are more animals with milk; hence, a pattern could be animal <hasComponent> milk, or, more generally animal <hasComponent> body part. With that rule, one can find automatically, e.g., goat NT goat milk and convert that automatically into $$\texttt{goat} <\texttt{hasComponent}>\texttt{goat milk}$$. Other pattern examples were, e.g., plant <growsIn> soil type and geographical entity <spatiallyIncludedIn> geographical entity.

## Avoiding Ontologies with SKOS

Thesauri tend to be very large, and it may well be too much effort to convert them into a real ontology, yet one still would want to have some interoperation of thesauri with other systems so as to avail of the large amounts of information they contain. To this end, the W3C developed a standard called Simple Knowledge Organization System(s): SKOS4 [MB09]. More broadly, it is intended for converting thesauri, classification schemes, taxonomies, subject headings etc. into one interoperable syntax, thereby enabling concept-based search instead of text-based search, reuse of each other’s concept definitions, facilitate the ability to search across institution boundaries, and to use standard software. This is a step forward compared to the isolated thesauri.

However, there are also some limitations to it: ‘unusual’ concept schemes do not fit into SKOS because sometimes the original structure too complex, skos:Concept is without clear properties like in OWL, there is still much subject domain semantics in the natural language text which makes it less amenable to advanced computer processing, and the SKOS ‘semantic relations’ have little semantics, as skos:narrower does not guarantee it is is a or part of, as it just is the standardized version of NT.

Then there is a peculiarity in the encoding. Let us take the example where Enzyme is a subtype of Protein, hence, we declare:

SKOSPaths:protein rdf:type skos:Concept

SKOSPaths:enzyme rdf:type skos:Concept

SKOSPaths:enzyme SKOSPaths:broaderGeneric SKOSPaths:protein

in the SKOSPaths SKOS file, which are, mathematically, statements about instances. This holds true also if we were to transform an OWL file to SKOS: each OWL class becomes a SKOS instance due to the mapping of skos:Concept to owl:Class [IS09]. This is a design decision of SKOS. From a purely technical point of view, that can be dealt with easily, but one has to be aware of it when developing applications.

As the scope of this book is ontology engineering, SKOS will not be elaborated on further.

## Footnotes

This page titled 8.3: Thesauri is shared under a CC BY-NC-SA license and was authored, remixed, and/or curated by Maria Keet.