Skip to content

epi_df joins to data frames with richer keys yields invalid epi_df #671

Open
@brookslogan

Description

@brookslogan
  • If x is an epi_df with some set of key cols, and y is a data frame with key cols that include all of x's key cols plus some more, and we join the two by x's key cols, we should be outputting an epi_df with y's key cols. But instead, we output an epi_df with only x's key cols, even when this violates the one-row-per-epikeytime constraint (we don't even decay to tibble).
  • check_ukey_unique on errors is printing a string instead of actually generating an error (thus why the below needs to cat its result)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(epiprocess)
#> Loading required package: epidatasets
#> Registered S3 method overwritten by 'tsibble':
#>   method               from 
#>   as_tibble.grouped_df dplyr
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter
target <- tibble(geo_value = 1, time_value = 1, target = 1) %>% as_epi_df()
sensors <- tibble(geo_value = 1, time_value = 1, sensor_name = c("A", "B"), sensor_value = 1:2)
joined <- left_join(target, sensors, by = c("geo_value", "time_value"))
joined
#> An `epi_df` object, 2 x 5 with metadata:
#> * geo_type  = hhs
#> * time_type = integer
#> * as_of     = 2025-06-03 10:24:05.131905
#> 
#> # A tibble: 2 × 5
#>   geo_value time_value target sensor_name sensor_value
#>       <dbl>      <dbl>  <dbl> <chr>              <int>
#> 1         1          1      1 A                      1
#> 2         1          1      1 B                      2
key_colnames(joined)
#> [1] "geo_value"  "time_value"
epiprocess:::check_ukey_unique(joined, key_colnames(joined)) %>% cat()
#> There cannot be more than one row with the same combination of geo_value
#> and time_value.  Problematic rows:
#> An `epi_df` object, 2 x 5 with metadata:
#> * geo_type  = hhs
#> * time_type = integer
#> * as_of     = 2025-06-03 10:24:05.131905
#> 
#> # A tibble: 2 × 5
#>   geo_value time_value target sensor_name sensor_value
#>       <dbl>      <dbl>  <dbl> <chr>              <int>
#> 1         1          1      1 A                      1
#> 2         1          1      1 B                      2

Created on 2025-06-03 with reprex v2.1.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions