Skip to content

Deterministic, parallel data iteration #68

@lorenzoh

Description

@lorenzoh

The parallel eachobs implementation is not deterministic in that observations are returned as soon as they are loaded, so they may be returned out of order. This is very performant, and fine for some use cases like training, where data should be shuffled anyway.

To give the option to have a deterministic iteration would be helpful in many use cases, though.

This could be implemented as a wrapper around an existing iterator that does the following:

  • instead of iterating over data with the wrapped iterator, iterate over (1:nobs(data), data) to preserve ordering information
  • collect returned observations, stripping the index
  • return an observation only if all previous (by index) observations have been returned

I am unsure by how much this will affect performance and memory usage and how the interplay is with buffersize. Are there alternative approaches to this implementation?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions