Description
catching up with what y'all have been doing over here - is this the right place to talk about arrays? i also see it's already in the metamodel here: https://github.com/linkml/linkml-model/blob/main/linkml_model/model/schema/array.yaml so move this if this is the wrong spot!
Following the example here: https://github.com/linkml/linkml-arrays/blob/main/tests/input/temperature_dataset.yaml
it looks like an array consists of:
- A top-level
DataArray
model that contains... - A set of
linkml:axis
attributes that specify class ranges for an axis index- The class ranges of
linkml:axis
attributes implementlinkml:NDArray
, and have avalues
attribute that...- implement
linkml:elements
and declare the range and unit for the axis
- implement
- The class ranges of
- A
linkml:array
attribute that specifies the actual data of the array as a class range- The
linkml:array
class is both alinkml:NDArray
and alinkml:RowOrderedArray
that has its dimensionality specified as an annotation and...- has a
values
attribute that declares the range and unit for the array
- has a
- The
I have a few questions about the semantics of the axis specification:
- Are axis indices required for all arrays? It seems like the
DataArray
andNDArray
classes are somewhat distinct, but I can't tell if anNDArray
is intended to always be a part of aDataArray
. If it wasn't, then one would presumably specify NWB-style array constraints on size and number of dimensions on anNDArray
, but then the division of labor betweenaxis
andNDArray
becomes unclear - some parts of the dimensions are specified by theaxis
classes, and others are specified by theNDArray
- Is it possible to have open-ended dimension specification? Or, another way of putting that, is it possible to annotate only a subset of the axes? I am thinking of the NWB
TimeSeries
class, which ideally would specify "a first dimension that is always time, but then n other dimensions that are values over time", but the limitations in the schema language make it only possible to express 4-D timeseries arrays. TheNDArray
dimension specification seems like it would be able to support that by accepting something like2..
to say "at least two dimensions" or..4
for "up to 4 dimensions" and so on, but that sacrifices the ability to annotate some of the dimensions (ie. in theTimeSeries
example, we want to say "axis 1 is a time in seconds") - What is the desired API for interacting with array and axis data? currently the generated model looks like:
class TemperatureDataset(ConfiguredBaseModel):
name: str = Field(...)
latitude_in_deg: LatitudeSeries = Field(...)
longitude_in_deg: LongitudeSeries = Field(...)
time_in_d: DaySeries = Field(...)
temperatures_in_K: TemperatureMatrix = Field(...)
and without the adjoining schema one would have a hard time knowing that 3/5 of those fields are an index onto temperatures_in_K
. It would also make the metaprogramming hard to be able to, say be able to do data[x,y,z]
to make use of the indices to select elements in the array.
It seems like the axis
attributes are behaving like axis indices, and that might help clarify the meaning of generated model attributes and simplify the syntax a bit. It also seems like this specification is mostly centered on sparse arrays, so DataArray
might also benefit from being clarified as SparseArray
that requires indices, and NDArray
is another top-level class alongside it for compact arrays, but that might be another issue?
If they are indices, we can make their definition a little more concise by taking advantage of knowing they will be 1D series, so maybe one example that tries to stick close to the existing structure looks like this:
classes:
TemperatureDataset:
implements:
- linkml:DataArray
attributes:
index:
range: TemperatureDatasetIndex
value:
implements:
- linkml:NDArray
range: float
unit:
ucum_code: K
TemperatureDatasetIndex:
implements:
- linkml:indices
slots:
- latitude
- longitude
- time
annotations:
strict_index: true
slots:
latitude:
implements:
- linkml:index
range: float
unit:
ucum_code: deg
longitude:
implements:
- linkml:index
range: float
unit:
ucum_code: deg
time:
implements:
- linkl:index
range: float
unit:
ucum_code: d
where strict_index
indicates that only those indices are allowed (rather than allowing additional dimensions to be present) and all are required, linkml:indices
indicates a collection of indices for an array, and the rest is standard.
So with that we might generate models that look like this (using nptyping
syntax for array constraints):
class TemperatureDataset(ArrayModel):
index: TemperatureDatasetIndex = Field(...)
value: NDArray[
Shape["* latitude, * longitude, * time"],
Float
] = Field(..., linkml_meta={{'unit': {'ucum_code': 'K'}}})
class TemperatudeDatasetIndex(IndicesModel):
# also would put the units in the Fields here but abbreviated for example
latitude: list[float] = Field(...)
longitude: list[float] = Field(...)
time: list[float] = Field(...)
which gives us clear convention for being able to build a metamodel ArrayModel
that could declare a __getitem__
method for accessing items in the array using items in the indices.
One could also omit the indices
construction and infer that all the implements: index
attributes on a class are that array's index, since they're unlikely to be reused in a meaningful way as far as I can tell.
anyway just some ideas!