Selectors for Data Package Components

Goal

Develop a generalized selector API and standard that can be used to identify granules within a data resource (data set, data package, composite object) that is managed by DataONE. The “selector” can be appended to an identifier or included with an API call to retrieve an object and the response is the sub-element or component specified by the selector.

Types of selector are likely to vary with the types of composite objects being managed by DataONE. For example, a selector may be defined to return a range by bytes from a BLOB, or perhaps a single file from within a zip archive of a set of files.

Some examples:

  • Byte range (first and last offset, or first offset and size)
  • Index range (offset into a list)
  • HTML fragment identifier
  • Time offset or range
  • Bounding box spatial range
  • Database record selection from a key

Rationale

There are practical limits to the total number of objects that may be effectively managed at the level of detail supported by the DataONE infrastructure. By managing content at the collection or package level, the total number of managed identifiers can be constrained to a more manageable range (e.g. a single collection may have >= 1e5 elements or records). Using selectors, this could be reduced to a single identifier for the collection plus support for the appropriate selector.

So a “selector service” is implemented by a node as a mechanism for retrieving some object that exists as a sub-component of a larger object that is managed by the DataONE infrastructure.

Implications for Implementation

  • Advertisement and discovery of selector services implemented by a node
  • Implementation overhead (processing, etc) for a Member Node
  • Representation formats for extracted information
  • Access control policies (managed parent object controls everything?)