Skip to content

axis vs. index, sparse vs. dense array semantics and syntax? #4

Open
@sneakers-the-rat

Description

@sneakers-the-rat

catching up with what y'all have been doing over here - is this the right place to talk about arrays? i also see it's already in the metamodel here: https://github.com/linkml/linkml-model/blob/main/linkml_model/model/schema/array.yaml so move this if this is the wrong spot!

Following the example here: https://github.com/linkml/linkml-arrays/blob/main/tests/input/temperature_dataset.yaml

it looks like an array consists of:

  • A top-level DataArray model that contains...
  • A set of linkml:axis attributes that specify class ranges for an axis index
    • The class ranges of linkml:axis attributes implement linkml:NDArray, and have a values attribute that...
      • implement linkml:elements and declare the range and unit for the axis
  • A linkml:array attribute that specifies the actual data of the array as a class range
    • The linkml:array class is both a linkml:NDArray and a linkml:RowOrderedArray that has its dimensionality specified as an annotation and...
      • has a values attribute that declares the range and unit for the array

I have a few questions about the semantics of the axis specification:

  • Are axis indices required for all arrays? It seems like the DataArray and NDArray classes are somewhat distinct, but I can't tell if an NDArray is intended to always be a part of a DataArray. If it wasn't, then one would presumably specify NWB-style array constraints on size and number of dimensions on an NDArray, but then the division of labor between axis and NDArray becomes unclear - some parts of the dimensions are specified by the axis classes, and others are specified by the NDArray
  • Is it possible to have open-ended dimension specification? Or, another way of putting that, is it possible to annotate only a subset of the axes? I am thinking of the NWB TimeSeries class, which ideally would specify "a first dimension that is always time, but then n other dimensions that are values over time", but the limitations in the schema language make it only possible to express 4-D timeseries arrays. The NDArray dimension specification seems like it would be able to support that by accepting something like 2.. to say "at least two dimensions" or ..4 for "up to 4 dimensions" and so on, but that sacrifices the ability to annotate some of the dimensions (ie. in the TimeSeries example, we want to say "axis 1 is a time in seconds")
  • What is the desired API for interacting with array and axis data? currently the generated model looks like:
class TemperatureDataset(ConfiguredBaseModel):
    
    name: str = Field(...)
    latitude_in_deg: LatitudeSeries = Field(...)
    longitude_in_deg: LongitudeSeries = Field(...)
    time_in_d: DaySeries = Field(...)
    temperatures_in_K: TemperatureMatrix = Field(...)

and without the adjoining schema one would have a hard time knowing that 3/5 of those fields are an index onto temperatures_in_K. It would also make the metaprogramming hard to be able to, say be able to do data[x,y,z] to make use of the indices to select elements in the array.

It seems like the axis attributes are behaving like axis indices, and that might help clarify the meaning of generated model attributes and simplify the syntax a bit. It also seems like this specification is mostly centered on sparse arrays, so DataArray might also benefit from being clarified as SparseArray that requires indices, and NDArray is another top-level class alongside it for compact arrays, but that might be another issue?

If they are indices, we can make their definition a little more concise by taking advantage of knowing they will be 1D series, so maybe one example that tries to stick close to the existing structure looks like this:

classes:
  TemperatureDataset:
    implements:
      - linkml:DataArray
    attributes:
      index:
        range: TemperatureDatasetIndex
      value:
        implements:
          - linkml:NDArray
        range: float
        unit:
          ucum_code: K

  TemperatureDatasetIndex:
    implements:
      - linkml:indices
    slots:
      - latitude
      - longitude
      - time
    annotations:
      strict_index: true


slots:
  latitude:
    implements:
      - linkml:index
    range: float
    unit:
      ucum_code: deg

  longitude:
    implements:
      - linkml:index
    range: float
    unit:
      ucum_code: deg

  time:
    implements:
      - linkl:index
    range: float
    unit:
      ucum_code: d

where strict_index indicates that only those indices are allowed (rather than allowing additional dimensions to be present) and all are required, linkml:indices indicates a collection of indices for an array, and the rest is standard.

So with that we might generate models that look like this (using nptyping syntax for array constraints):

class TemperatureDataset(ArrayModel):
    index: TemperatureDatasetIndex = Field(...)
    value: NDArray[
        Shape["* latitude, * longitude, * time"],
        Float
    ] = Field(..., linkml_meta={{'unit': {'ucum_code': 'K'}}})

class TemperatudeDatasetIndex(IndicesModel):
    # also would put the units in the Fields here but abbreviated for example
    latitude: list[float] = Field(...) 
    longitude: list[float] = Field(...)
    time: list[float] = Field(...)

which gives us clear convention for being able to build a metamodel ArrayModel that could declare a __getitem__ method for accessing items in the array using items in the indices.

One could also omit the indices construction and infer that all the implements: index attributes on a class are that array's index, since they're unlikely to be reused in a meaningful way as far as I can tell.

anyway just some ideas!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions