Skip to content

get_test_data() does not consider lagged differences of lagged differences or lags of lags #359

Open
@brookslogan

Description

@brookslogan

@rnayebi21 was trying to calculate second differences with step_lag_difference() + another step_lag_difference() on the generated output. But get_test_data()'s horizon processing assumes that these are both calculated from "original" signals (and maybe also that for each epikey, at the latest time value available for this epikey, that these signals are both nonmissing). The result is too short a time window for predictions. E.g., lagged differencing with horizon = 7 followed by another horizon = 7 will make get_test_data() filter to around 8 days, but we actually need around 15 days. Additionally, the eventual output error message appears to be deeply nested and unhelpful, from stopifnot(length(values) == length(quantile_levels)).

Potential resolutions:

  • Modify step_epi_shift(), step_lag_difference(), and step_growth_rate() to do some additional tagging of outputs based on the shift range [given (Gaps causing) missing test predictors gives confusing error #362, maybe actually a shift set] they depend on + check their inputs for such tags and consider it in that logic, and extract that info in get_test_data(). And make sure to appropriately label step_lag_difference()s operation as a lag, not a horizon (the current horizon naming seems like a hack to make lag_difference + lag work). Maybe also export some related utilities to let developers manipulate these tags if they want to create new steps.

    • Example A:
      • Lag-difference X by 7 to get Delta_X --- tag Delta_X with a shift range of [-7, 0] [shift set of {-7, 0}]
      • Lag-difference Delta_X to get DeltaSquared_X --- tag DeltaSquared_X with a shift range of [-14, 0] [shift set of {-14, -7, 0}]
    • Example B:
      • Lag X by 7 to get X_lag_7 --- tag X_lag_7 with a shift range of [-7, -7] [shift set of {-7}]
      • Lag-difference X_lag_7 to get Delta_X_lag_7 --- tag Delta_X_lag_7 with a shift range of [-14, -7] [shift set of {-14, -7}]
  • Try to avoid get_test_data() altogether e.g. with this sort of approach.

  • Hybrid approach:

    • Tag all original inputs with the shift range [0, 0] [shift set of {0}]. (That probably simplifies the first resolution's logic anyway.)
    • Use the first resolution.
    • In/around get_test_data(), if not all relevant variables have shift ranges, then assume a shift range of (-Infty, Infty) [or a special value representing this / branch handling this case if using shift range approach]. Bake the data, getting extra rows. But then filter those baked results to just the latest time value (or latest time_value with nonmissing predictors?? not sure) per epikey.
      • Alternatively, ALWAYS apply this filter step even if we think it's supposed to be a no-op.
      • We could also check and issue better error messages if get_test_data() + baking causes epikeys to drop out of the data set, especially if it causes all epikeys to drop out of the data set. (By an epi_key dropping out, I mean something like drop_na(baked_test_data, all_predictors()) yielding 0 rows with that epikey when that epikey was "present" in the "original" data.... for some definitions of "present" and "original"....)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions