Semantic search improves both search and display of metadata records over what was previously possible. For example, with traditional, free-text search, if a metadata record described a dataset on American green tree frogs (Hyla cinerea) but never included the term “amphibian” in the metadata record’s text, a free-text search for “data on amphibians” may not return the record. With semantic search, we can index ahead of time that fact that American green tree frogs are amphibians using a suitable knowledge graph and the record would be returned.
Note: At the moment, annotations are only supported on EML 2.2.0 records via the EML’s semantics module, see What’s New in EML 2.2.0. However, the design allows other metadata standards (e.g., schema.org) to make use of these features. Here, we’ll focus on how semantic search applies to EML.
Annotations are inlined into the EML record in various locations and take the following structure:
<attribute id="myatt">
...
<annotation>
<propertyURI label="of characteristic">
http://ecoinformatics.org/oboe/oboe.1.2/oboe-core.owl#ofCharacteristic
</propertyURI>
<valueURI label="Mass">
http://ecoinformatics.org/oboe/oboe.1.2/oboe-characteristics.owl#Mass
</valueURI>
</annotation>
...
</attribute>
The above annotation asserts that the attribute myatt
is “of characteristic” “Mass”. Both terms are defined in the OBOE ontology with specific definitions and logical relationships with other terms. This annotation is both searchable and displayed on the dataset’s landing page.
Harvesting of an EML record with one or more annotations triggers the normal SystemMetadataSubporcessor
and the appropriate ScienceMetadataSubprocessor
for the record but also triggers the EMLAnnotationSubprocessor
which does two things:
//annotation/valueURI
on the recordsem_annotation
field in the search indexSee below for a simplified architectural diagram:
In the above diagram, the EMLAnnotationSubprocessor
uses the OntologyModelService
to perform query expansion.
The OntologyModelService
implements a simple Jena OntologyModel
which loads a set of Whitelisted Ontologies at startup into a single Jena OntologyModel
which can be queried by other index subprocessors at index time.
Each term is turned into a Property Path query to find all superclasses of the term:
SELECT ?sem_annotation
WHERE {
<$CONCEPT_URI> rdfs:subClassOf* ?sem_annotation .
}
The current architecture is flexible enough to allow other types of SPARQL queries to be run on annotations.
The search UI leverages BioPortal’s API and their tree view widget to provide users with a way to find terms and search by them. At current, only terms from ECSO’s MeasurementType tree are viewable and searchable.
The search UI provides popovers on landing pages for annotations and provides an enhanced tooltip if the term is present in BioPortal. When the popover is clicked, a request is made to BioPortal’s class search API to find a defintion for the term and the popover is updated with the found defintion:
If the term is not found in BioPortal, the popover is still shown and works mostly the same, minus the added definition.
For performance and security reasons, the OntologyModelService
doesn’t supporting loading arbitrary ontologies at query time. Instead, a set of whitelisted ontologies was established:
These are loaded into a Jena OntologyModelService
at startup and are available for query expansion when new records are indexed.
When EML records are annotated with terms not from the set of Whitelisted Ontologies, annotations, the search catalog will work slightly different than the case where annotations use terms from the whitelist.
ECSO
MeasurementType
annotations at this time.Instead of the more helpful tooltip:
To add an ontology, you must:
src/main/resources/ontologies
in D1_CN_INDEX_PROCESSOR
.src/main/resources/application-context-ontology-model-service.xml
D1_CN_INDEX_PROCESSOR
in the ontologyList
and altEntryList
properties.