Skip to content

Consider what epi_slide(.window_size = Inf) should output when min time_value differs by epikey #660

Open
@brookslogan

Description

@brookslogan
suppressPackageStartupMessages({
  library(dplyr)
  library(epiprocess)
})
vctrs::vec_rbind(
  tibble::tibble(geo_value = 1, time_value = 1:4 + 0, value = 1:4),
  tibble::tibble(geo_value = 2, time_value = 3:5 + 0, value = 11:13)
) %>%
as_epi_df() %>%
epi_slide(~ sum(.x$value), .window_size = Inf)
#> An `epi_df` object, 7 x 4 with metadata:
#> * geo_type  = hhs
#> * time_type = integer
#> * as_of     = 2025-04-08 16:57:38.919515
#> 
#> # A tibble: 7 × 4
#>   geo_value time_value value slide_value
#>       <dbl>      <dbl> <int>       <int>
#> 1         1          1     1           1
#> 2         1          2     2           3
#> 3         1          3     3           6
#> 4         1          4     4          10
#> 5         2          3    11          NA
#> 6         2          4    12          NA
#> 7         2          5    13          NA

# (We get the same result with epi_slide_sum; something like this is in our test suite.)
vctrs::vec_rbind(
  tibble::tibble(geo_value = 1, time_value = 1:4 + 0, value = 1:4),
  tibble::tibble(geo_value = 2, time_value = 3:5 + 0, value = 11:13)
) %>%
as_epi_df() %>%
epi_slide_sum(value, .window_size = Inf)
#> An `epi_df` object, 7 x 4 with metadata:
#> * geo_type  = hhs
#> * time_type = integer
#> * as_of     = 2025-04-08 16:57:39.021215
#> 
#> # A tibble: 7 × 4
#>   geo_value time_value value value_running_sum
#>       <dbl>      <dbl> <int>             <dbl>
#> 1         1          1     1                 1
#> 2         1          2     2                 3
#> 3         1          3     3                 6
#> 4         1          4     4                10
#> 5         2          3    11                NA
#> 6         2          4    12                NA
#> 7         2          5    13                NA

Created on 2025-04-08 with reprex v2.1.1

The NAs in the second group are presumably coming from completing time values 1&2 with NAs. Is this what we want? On one hand, it makes the input time_values contributing to each output time_value the same for each geo_value. On the other hand, it makes the result inconsistent with what one might expect from explicitly spelling out edf %>% group_by(geo_value) %>% epi_slide(....) %>% ungroup(), i.e., that it'd be the same as group-splitting/mapping and performing the same operation, and recombining. (We might have some other lesser violations of this expectation with period-inference somewhere, maybe epix_slide, but in general I think we've been following this as well. [Another violation is in handling of explicit .ref_time_values; if we split out into geos with partial ref time availability then we would raise an error.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions