@@ -91,10 +91,10 @@ The `{parsnip}` front-end abstracts away the differences in interface between a
91
91
92
92
### Postprocessor
93
93
94
- Postprocessors are unique to this ` {epipredict} ` .
94
+ Postprocessors are unique to ` {epipredict} ` .
95
95
A postprocessor modifies and formats the prediction after a model has been fit.
96
96
97
- Each operation within a postprocessor is called a ` layer_ ` , and the stack of layers is known as ` frosting() ` ,
97
+ Each operation within a postprocessor is called a "layer" (functions are named ` layer_* ` ) , and the stack of layers is known as ` frosting() ` ,
98
98
continuing the metaphor of baking a cake established in ` {recipes} ` .
99
99
Some example operations include:
100
100
@@ -314,23 +314,23 @@ data.
314
314
315
315
# Extending ` four_week_ahead `
316
316
317
- Now that we've recreated ` four_week_ahead ` , we can start modifying parts of it to get custom behavior.
317
+ Now that we know how to create ` four_week_ahead ` from scratch , we can start modifying the workflow to get custom behavior.
318
318
319
319
There are many ways we could modify ` four_week_ahead ` . We might consider:
320
320
321
- - Including a growth rate estimate as part of the model
322
- - Converting from rates to counts, for example if that were the
323
- preferred prediction format
324
- - Including a time component to the prediction, useful if we
325
- expect there to be a strong seasonal component
326
- - Scaling by some factor
321
+ - Converting from rates to counts
322
+ - Including a growth rate estimate as a predictor
323
+ - Including a time component as a predictor -- useful if we
324
+ expect there to be a strong seasonal component to the outcome
325
+ - Scaling by a factor
327
326
328
327
We will demo a couple of these modifications below.
329
328
330
329
## Growth rate
331
330
332
- One feature that may potentially improve our forecast is looking at the growth
333
- rate
331
+ Let's say we're interested in including growth rate as a predictor in our model because
332
+ we think it may potentially improve our forecast.
333
+ We can easily create a new growth rate column as a step in the ` epi_recipe ` .
334
334
335
335
``` {r growth_rate_recipe}
336
336
growth_rate_recipe <- epi_recipe(
@@ -341,6 +341,7 @@ growth_rate_recipe <- epi_recipe(
341
341
step_epi_lag(death_rate, lag = c(0, 7, 14)) |>
342
342
step_epi_ahead(death_rate, ahead = 4 * 7) |>
343
343
step_epi_naomit() |>
344
+ # Calculate growth rate from death rate column.
344
345
step_growth_rate(death_rate) |>
345
346
step_training_window()
346
347
```
@@ -354,7 +355,8 @@ growth_rate_recipe |>
354
355
select(
355
356
geo_value, time_value, case_rate,
356
357
death_rate, gr_7_rel_change_death_rate
357
- )
358
+ ) |>
359
+ tail()
358
360
```
359
361
360
362
And the role:
@@ -396,8 +398,7 @@ gr_predictions <- gr_fit_workflow |>
396
398
<details >
397
399
<summary > Plot </summary >
398
400
399
- Plotting the result; this is reusing some code from the landing page to print
400
- the forecast date.
401
+ We'll reuse some code from the landing page to plot the result.
401
402
402
403
``` {r plotting}
403
404
forecast_date_label <-
@@ -416,7 +417,7 @@ result_plot <- autoplot(
416
417
) +
417
418
geom_vline(aes(xintercept = forecast_date)) +
418
419
geom_text(
419
- data = forecast_date_label %>% filter(.response_name == "death_rate"),
420
+ data = forecast_date_label |> filter(.response_name == "death_rate"),
420
421
aes(x = dates, label = "forecast\ndate", y = heights),
421
422
size = 3, hjust = "right"
422
423
) +
@@ -428,14 +429,13 @@ result_plot <- autoplot(
428
429
result_plot
429
430
```
430
431
431
- TODO ` get_test_data ` isn't actually working here...
432
+ <!-- TODO `get_test_data` isn't actually working here... -->
432
433
433
434
## Population scaling
434
435
435
- Suppose we're sending our predictions to someone who is looking to understand
436
- counts, rather than rates.
437
- Then we can adjust just the frosting to get a forecast for the counts from our
438
- rates forecaster:
436
+ Suppose we want to modify our predictions to apply to counts, rather than rates.
437
+ To do that, we can adjust _ just_ the ` frosting ` to perform post-processing on our existing rates forecaster.
438
+ Since rates are calculated as counts per 100 000 people, we will convert back to counts by multiplying rates by the factor $\frac{regional \text{ } population}{100000}$.
439
439
440
440
``` {r rate_scale}
441
441
count_layers <-
@@ -445,15 +445,18 @@ count_layers <-
445
445
layer_population_scaling(
446
446
.pred,
447
447
.pred_distn,
448
+ # `df` contains scaling values for all regions; in this case it is the state populations
448
449
df = epidatasets::state_census,
449
450
df_pop_col = "pop",
450
451
create_new = FALSE,
452
+ # `rate_rescaling` gives the denominator of the existing rate predictions
451
453
rate_rescaling = 1e5,
452
454
by = c("geo_value" = "abbr")
453
455
) |>
454
456
layer_add_forecast_date() |>
455
457
layer_add_target_date() |>
456
458
layer_threshold()
459
+
457
460
# building the new workflow
458
461
count_workflow <- epi_workflow(
459
462
four_week_recipe,
@@ -464,66 +467,67 @@ count_pred_data <- get_test_data(four_week_recipe, training_data)
464
467
count_predictions <- count_workflow |>
465
468
fit(training_data) |>
466
469
predict(count_pred_data)
470
+
467
471
count_predictions
468
472
```
469
473
470
- which are 2-3 orders of magnitude larger than the corresponding rates above.
471
- ` df ` represents the scaling value; in this case it is the state populations,
472
- while ` rate_rescaling ` gives the denominator of the rate (our fit values were
473
- per 100,000).
474
-
475
474
# Custom classifier workflow
476
475
477
- As a more complicated example of the kind of pipeline that you can build using
478
- this framework, here is an example of a hotspot prediction model, which predicts
479
- whether the case rates are increasing (` up ` ), decreasing (` down ` ) or flat
476
+ Let's work through an example of a more complicated kind of pipeline you can build using
477
+ the ` epipredict ` framework.
478
+ This is a hotspot prediction model, which predicts whether case rates are increasing (` up ` ), decreasing (` down ` ) or flat
480
479
(` flat ` ).
481
- This comes from a paper by McDonald, Bien, Green, Hu et al[ ^ 3 ] , and roughly
480
+ The model comes from a paper by McDonald, Bien, Green, Hu et al[ ^ 3 ] , and roughly
482
481
serves as an extension of ` arx_classifier() ` .
483
482
484
- First, we need to add a factor version of the ` geo_value ` , so that it can
485
- actually be used as a feature.
483
+ First, we need to add a factor version of ` geo_value ` , so that it can be used as a feature.
486
484
487
485
``` {r training_factor}
488
486
training_data <-
489
- training_data %>%
487
+ training_data |>
490
488
mutate(geo_value_factor = as.factor(geo_value))
491
489
```
492
490
493
491
Then we put together the recipe, using a combination of base ` {recipe} `
494
- functions such as ` add_role() ` and ` step_dummy() ` , and ` {epipreict } ` functions
492
+ functions such as ` add_role() ` and ` step_dummy() ` , and ` {epipredict } ` functions
495
493
such as ` step_growth_rate() ` .
496
494
497
495
``` {r class_recipe}
498
- classifier_recipe <- epi_recipe(training_data) %>%
499
- add_role(time_value, new_role = "predictor") %>%
500
- step_dummy(geo_value_factor) %>%
501
- step_growth_rate(case_rate, role = "none", prefix = "gr_") %>%
502
- step_epi_lag(starts_with("gr_"), lag = c(0, 7, 14)) %>%
503
- step_epi_ahead(starts_with("gr_"), ahead = 7, role = "none") %>%
504
- # note recipes::step_cut() has a bug in it, or we could use that here
496
+ classifier_recipe <- epi_recipe(training_data) |>
497
+ # Turn `time_value` into predictor
498
+ add_role(time_value, new_role = "predictor") |>
499
+ # Turn `geo_value_factor` into predictor
500
+ step_dummy(geo_value_factor) |>
501
+ # Create and lag growth rate
502
+ step_growth_rate(case_rate, role = "none", prefix = "gr_") |>
503
+ step_epi_lag(starts_with("gr_"), lag = c(0, 7, 14)) |>
504
+ step_epi_ahead(starts_with("gr_"), ahead = 7, role = "none") |>
505
+ # Divide growth rate into 3 bins.
506
+ # Note `recipes::step_cut()` has a bug, or we could use that here.
505
507
step_mutate(
506
508
response = cut(
507
509
ahead_7_gr_7_rel_change_case_rate,
508
- breaks = c(-Inf, -0.2, 0.25, Inf) / 7, # division gives weekly not daily
510
+ # Define bin thresholds.
511
+ # Divide by 7 to create weekly values.
512
+ breaks = c(-Inf, -0.2, 0.25, Inf) / 7,
509
513
labels = c("down", "flat", "up")
510
514
),
511
515
role = "outcome"
512
- ) %>%
513
- step_rm(has_role("none"), has_role("raw")) %>%
516
+ ) |>
517
+ # Drop unused columns.
518
+ step_rm(has_role("none"), has_role("raw")) |>
514
519
step_epi_naomit()
515
520
```
516
521
522
+ This adds as predictors:
517
523
518
- Roughly, this adds as predictors:
524
+ - time value (via ` add_role() ` )
525
+ - ` geo_value ` (via ` step_dummy() ` and the previous ` as.factor() ` )
526
+ - growth rate of case rate, both at prediction time (no lag), and lagged by one and two weeks
519
527
520
- 1 . the time value (via ` add_role() ` )
521
- 2 . the ` geo_value ` (via ` step_dummy() ` and the ` as.factor() ` above)
522
- 3 . the growth rate, both at prediction time and lagged by one and two weeks
523
-
524
- The outcome is created by composing several steps together: ` step_epi_ahead() `
525
- creates a column with the growth rate one week into the future, while
526
- ` step_mutate() ` creates a factor with the 3 values:
528
+ The outcome variable is created by composing several steps together. ` step_epi_ahead() `
529
+ creates a column with the growth rate one week into the future, and
530
+ ` step_mutate() ` turns that column into a factor with 3 possible values,
527
531
528
532
$$
529
533
Z_{\ell, t}=
@@ -538,47 +542,49 @@ where $Y^{\Delta}_{\ell, t}$ is the growth rate at location $\ell$ and time $t$.
538
542
` up ` means that the ` case_rate ` is has increased by at least 25%, while ` down `
539
543
means it has decreased by at least 20%.
540
544
541
- Note that both ` step_growth_rate() ` and ` step_epi_ahead() ` assign the role
542
- ` none ` explicitly; this is because they're used as intermediate steps to create
543
- both predictors and the outcome.
544
- ` step_rm() ` drops them after they're done , along with the original ` raw ` columns
545
- ` death_rate ` and ` case_rate ` (both ` geo_value ` and ` time_value ` are retained
546
- because their roles have been reassigned) .
545
+ Note that in both ` step_growth_rate() ` and ` step_epi_ahead() ` we explicitly assign the role
546
+ ` none ` . This is because those columns are used as intermediaries to create
547
+ predictor and outcome columns .
548
+ Afterwards, ` step_rm() ` drops the temporary columns , along with the original ` role = " raw" ` columns
549
+ ` death_rate ` and ` case_rate ` . Both ` geo_value_factor ` and ` time_value ` are retained
550
+ because their roles have been reassigned.
547
551
548
552
549
- To fit a classification model like this, we will need to use a parsnip model
550
- with mode classification; the simplest example is ` multinomial_reg() ` .
551
- We don't actually need to apply any layers, so we can skip adding one to the ` epiworkflow() ` :
553
+ To fit a classification model like this, we will need to use a ` {parsnip} ` model
554
+ that has ` mode = "classification" ` .
555
+ The simplest example of a ` {parsnip} ` ` classification ` -` mode ` model is ` multinomial_reg() ` .
556
+ We don't need to do any post-processing, so we can skip adding ` layer ` s to the ` epiworkflow() ` .
557
+ So our workflow looks like:
552
558
553
559
``` {r, warning=FALSE}
554
560
wf <- epi_workflow(
555
561
classifier_recipe,
556
562
multinom_reg()
557
- ) %>%
563
+ ) |>
558
564
fit(training_data)
559
565
560
- forecast(wf) %>% filter(!is.na(.pred_class))
566
+ forecast(wf) |> filter(!is.na(.pred_class))
561
567
```
562
568
563
- And comparing the result with the actual growth rates at that point:
569
+ And comparing the result with the actual growth rates at that point in time,
564
570
``` {r growth_rate_results}
565
571
growth_rates <- covid_case_death_rates |>
566
572
filter(geo_value %in% used_locations) |>
567
573
group_by(geo_value) |>
568
574
mutate(
569
- # multiply by 7 to get to weekly equivalents
575
+ # Multiply by 7 to estimate weekly equivalents
570
576
case_gr = growth_rate(x = time_value, y = case_rate) * 7
571
577
) |>
572
578
ungroup()
573
579
574
580
growth_rates |> filter(time_value == "2021-08-01")
575
581
```
576
582
577
- So they're all increasing at significantly higher than 25% per week (36%-62%),
578
- which matches the classification.
583
+ we see that they're all significantly higher than 25% per week (36%-62%),
584
+ which matches the classification model's predictions .
579
585
580
586
581
- See the [ tooling book] ( https://cmu-delphi.github.io/delphi-tooling-book/preprocessing-and-models.html ) for a more in depth discussion of this example.
587
+ See the [ tooling book] ( https://cmu-delphi.github.io/delphi-tooling-book/preprocessing-and-models.html ) for a more in- depth discussion of this example.
582
588
583
589
584
590
[ ^ 1 ] : Think of baking a cake, where adding the frosting is the last step in the
0 commit comments