Description
@rnayebi21 was trying to calculate second differences with step_lag_difference()
+ another step_lag_difference()
on the generated output. But get_test_data()
's horizon
processing assumes that these are both calculated from "original" signals (and maybe also that for each epikey, at the latest time value available for this epikey, that these signals are both nonmissing). The result is too short a time window for predictions. E.g., lagged differencing with horizon = 7
followed by another horizon = 7
will make get_test_data()
filter to around 8 days, but we actually need around 15 days. Additionally, the eventual output error message appears to be deeply nested and unhelpful, from stopifnot(length(values) == length(quantile_levels))
.
Potential resolutions:
-
Modify
step_epi_shift()
,step_lag_difference()
, andstep_growth_rate()
to do some additional tagging of outputs based on the shift range [given (Gaps causing) missing test predictors gives confusing error #362, maybe actually a shift set] they depend on + check their inputs for such tags and consider it in that logic, and extract that info inget_test_data()
. And make sure to appropriately labelstep_lag_difference()
s operation as a lag, not a horizon (the currenthorizon
naming seems like a hack to make lag_difference + lag work). Maybe also export some related utilities to let developers manipulate these tags if they want to create new steps.- Example A:
- Lag-difference X by 7 to get Delta_X --- tag Delta_X with a shift range of [-7, 0] [shift set of {-7, 0}]
- Lag-difference Delta_X to get DeltaSquared_X --- tag DeltaSquared_X with a shift range of [-14, 0] [shift set of {-14, -7, 0}]
- Example B:
- Lag X by 7 to get X_lag_7 --- tag X_lag_7 with a shift range of [-7, -7] [shift set of {-7}]
- Lag-difference X_lag_7 to get Delta_X_lag_7 --- tag Delta_X_lag_7 with a shift range of [-14, -7] [shift set of {-14, -7}]
- Example A:
-
Try to avoid
get_test_data()
altogether e.g. with this sort of approach. -
Hybrid approach:
- Tag all original inputs with the shift range
[0, 0]
[shift set of {0}]. (That probably simplifies the first resolution's logic anyway.) - Use the first resolution.
- In/around
get_test_data()
, if not all relevant variables have shift ranges, then assume a shift range of(-Infty, Infty)
[or a special value representing this / branch handling this case if using shift range approach]. Bake the data, getting extra rows. But then filter those baked results to just the latest time value (or latest time_value with nonmissing predictors?? not sure) per epikey.- Alternatively, ALWAYS apply this filter step even if we think it's supposed to be a no-op.
- We could also check and issue better error messages if
get_test_data()
+ baking causes epikeys to drop out of the data set, especially if it causes all epikeys to drop out of the data set. (By an epi_key dropping out, I mean something likedrop_na(baked_test_data, all_predictors())
yielding 0 rows with that epikey when that epikey was "present" in the "original" data.... for some definitions of "present" and "original"....)
- Tag all original inputs with the shift range