`get_test_data()` does not consider lagged differences of lagged differences or lags of lags

@rnayebi21 was trying to calculate second differences with `step_lag_difference()` + another `step_lag_difference()` on the generated output.  But `get_test_data()`'s `horizon` processing assumes that these are both calculated from "original" signals (and maybe also that for each epikey, at the latest time value available for this epikey, that these signals are both nonmissing).  The result is too short a time window for predictions.  E.g., lagged differencing with `horizon = 7` followed by another `horizon = 7` will make `get_test_data()` filter to around 8 days, but we actually need around 15 days.  Additionally, the eventual output error message appears to be deeply nested and unhelpful, from `stopifnot(length(values) == length(quantile_levels))`.

Potential resolutions:
- Modify `step_epi_shift()`, `step_lag_difference()`, and `step_growth_rate()` to do some additional tagging of outputs based on the shift range [given #362, maybe actually a shift set] they depend on + check their inputs for such tags and consider it in that logic, and extract that info in `get_test_data()`.  And make sure to appropriately label `step_lag_difference()`s operation as a lag, not a horizon (the current `horizon` naming seems like a hack to make lag_difference + lag work).  Maybe also export some related utilities to let developers manipulate these tags if they want to create new steps. 
  - Example A:
    - Lag-difference X by 7 to get Delta_X --- tag Delta_X with a shift range of [-7, 0] [shift set of {-7, 0}]
    - Lag-difference Delta_X to get DeltaSquared_X --- tag DeltaSquared_X with a shift range of [-14, 0] [shift set of {-14, -7, 0}]
  - Example B:
    - Lag X by 7 to get X_lag_7 --- tag X_lag_7 with a shift range of [-7, -7] [shift set of {-7}]
    - Lag-difference X_lag_7 to get Delta_X_lag_7 --- tag Delta_X_lag_7 with a shift range of [-14, -7] [shift set of {-14, -7}]

- Try to avoid `get_test_data()` altogether e.g. with [this sort of approach](https://github.com/cmu-delphi/epipredict/issues/293#issuecomment-1945964168). 
- Hybrid approach:
  - Tag all original inputs with the shift range `[0, 0]` [shift set of {0}].  (That probably simplifies the first resolution's logic anyway.)
  - Use the first resolution.
  - In/around `get_test_data()`, if not all relevant variables have shift ranges, then assume a shift range of `(-Infty, Infty)` [or a special value representing this / branch handling this case if using shift range approach].  Bake the data, getting extra rows.  But then filter those baked results to just the latest time value (or latest time_value with nonmissing predictors?? not sure) per epikey.
    - Alternatively, ALWAYS apply this filter step even if we think it's supposed to be a no-op.
    - We could also check and issue better error messages if `get_test_data()` + baking causes epikeys to drop out of the data set, especially if it causes all epikeys to drop out of the data set.  (By an epi_key dropping out, I mean something like `drop_na(baked_test_data, all_predictors())` yielding 0 rows with that epikey when that epikey was "present" in the "original" data.... for some definitions of "present" and "original"....)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`get_test_data()` does not consider lagged differences of lagged differences or lags of lags #359

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

get_test_data() does not consider lagged differences of lagged differences or lags of lags #359

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`get_test_data()` does not consider lagged differences of lagged differences or lags of lags #359