Skip to content

Commit 7312ade

Browse files
committed
second half custom_epiworkflows.Rmd
1 parent d83d9d8 commit 7312ade

File tree

1 file changed

+73
-67
lines changed

1 file changed

+73
-67
lines changed

vignettes/custom_epiworkflows.Rmd

Lines changed: 73 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -91,10 +91,10 @@ The `{parsnip}` front-end abstracts away the differences in interface between a
9191

9292
### Postprocessor
9393

94-
Postprocessors are unique to this `{epipredict}`.
94+
Postprocessors are unique to `{epipredict}`.
9595
A postprocessor modifies and formats the prediction after a model has been fit.
9696

97-
Each operation within a postprocessor is called a `layer_`, and the stack of layers is known as `frosting()`,
97+
Each operation within a postprocessor is called a "layer" (functions are named `layer_*`), and the stack of layers is known as `frosting()`,
9898
continuing the metaphor of baking a cake established in `{recipes}`.
9999
Some example operations include:
100100

@@ -314,23 +314,23 @@ data.
314314

315315
# Extending `four_week_ahead`
316316

317-
Now that we've recreated `four_week_ahead`, we can start modifying parts of it to get custom behavior.
317+
Now that we know how to create `four_week_ahead` from scratch, we can start modifying the workflow to get custom behavior.
318318

319319
There are many ways we could modify `four_week_ahead`. We might consider:
320320

321-
- Including a growth rate estimate as part of the model
322-
- Converting from rates to counts, for example if that were the
323-
preferred prediction format
324-
- Including a time component to the prediction, useful if we
325-
expect there to be a strong seasonal component
326-
- Scaling by some factor
321+
- Converting from rates to counts
322+
- Including a growth rate estimate as a predictor
323+
- Including a time component as a predictor -- useful if we
324+
expect there to be a strong seasonal component to the outcome
325+
- Scaling by a factor
327326

328327
We will demo a couple of these modifications below.
329328

330329
## Growth rate
331330

332-
One feature that may potentially improve our forecast is looking at the growth
333-
rate
331+
Let's say we're interested in including growth rate as a predictor in our model because
332+
we think it may potentially improve our forecast.
333+
We can easily create a new growth rate column as a step in the `epi_recipe`.
334334

335335
```{r growth_rate_recipe}
336336
growth_rate_recipe <- epi_recipe(
@@ -341,6 +341,7 @@ growth_rate_recipe <- epi_recipe(
341341
step_epi_lag(death_rate, lag = c(0, 7, 14)) |>
342342
step_epi_ahead(death_rate, ahead = 4 * 7) |>
343343
step_epi_naomit() |>
344+
# Calculate growth rate from death rate column.
344345
step_growth_rate(death_rate) |>
345346
step_training_window()
346347
```
@@ -354,7 +355,8 @@ growth_rate_recipe |>
354355
select(
355356
geo_value, time_value, case_rate,
356357
death_rate, gr_7_rel_change_death_rate
357-
)
358+
) |>
359+
tail()
358360
```
359361

360362
And the role:
@@ -396,8 +398,7 @@ gr_predictions <- gr_fit_workflow |>
396398
<details>
397399
<summary> Plot </summary>
398400

399-
Plotting the result; this is reusing some code from the landing page to print
400-
the forecast date.
401+
We'll reuse some code from the landing page to plot the result.
401402

402403
```{r plotting}
403404
forecast_date_label <-
@@ -416,7 +417,7 @@ result_plot <- autoplot(
416417
) +
417418
geom_vline(aes(xintercept = forecast_date)) +
418419
geom_text(
419-
data = forecast_date_label %>% filter(.response_name == "death_rate"),
420+
data = forecast_date_label |> filter(.response_name == "death_rate"),
420421
aes(x = dates, label = "forecast\ndate", y = heights),
421422
size = 3, hjust = "right"
422423
) +
@@ -428,14 +429,13 @@ result_plot <- autoplot(
428429
result_plot
429430
```
430431

431-
TODO `get_test_data` isn't actually working here...
432+
<!-- TODO `get_test_data` isn't actually working here... -->
432433

433434
## Population scaling
434435

435-
Suppose we're sending our predictions to someone who is looking to understand
436-
counts, rather than rates.
437-
Then we can adjust just the frosting to get a forecast for the counts from our
438-
rates forecaster:
436+
Suppose we want to modify our predictions to apply to counts, rather than rates.
437+
To do that, we can adjust _just_ the `frosting` to perform post-processing on our existing rates forecaster.
438+
Since rates are calculated as counts per 100 000 people, we will convert back to counts by multiplying rates by the factor $\frac{regional \text{ } population}{100000}$.
439439

440440
```{r rate_scale}
441441
count_layers <-
@@ -445,15 +445,18 @@ count_layers <-
445445
layer_population_scaling(
446446
.pred,
447447
.pred_distn,
448+
# `df` contains scaling values for all regions; in this case it is the state populations
448449
df = epidatasets::state_census,
449450
df_pop_col = "pop",
450451
create_new = FALSE,
452+
# `rate_rescaling` gives the denominator of the existing rate predictions
451453
rate_rescaling = 1e5,
452454
by = c("geo_value" = "abbr")
453455
) |>
454456
layer_add_forecast_date() |>
455457
layer_add_target_date() |>
456458
layer_threshold()
459+
457460
# building the new workflow
458461
count_workflow <- epi_workflow(
459462
four_week_recipe,
@@ -464,66 +467,67 @@ count_pred_data <- get_test_data(four_week_recipe, training_data)
464467
count_predictions <- count_workflow |>
465468
fit(training_data) |>
466469
predict(count_pred_data)
470+
467471
count_predictions
468472
```
469473

470-
which are 2-3 orders of magnitude larger than the corresponding rates above.
471-
`df` represents the scaling value; in this case it is the state populations,
472-
while `rate_rescaling` gives the denominator of the rate (our fit values were
473-
per 100,000).
474-
475474
# Custom classifier workflow
476475

477-
As a more complicated example of the kind of pipeline that you can build using
478-
this framework, here is an example of a hotspot prediction model, which predicts
479-
whether the case rates are increasing (`up`), decreasing (`down`) or flat
476+
Let's work through an example of a more complicated kind of pipeline you can build using
477+
the `epipredict` framework.
478+
This is a hotspot prediction model, which predicts whether case rates are increasing (`up`), decreasing (`down`) or flat
480479
(`flat`).
481-
This comes from a paper by McDonald, Bien, Green, Hu et al[^3], and roughly
480+
The model comes from a paper by McDonald, Bien, Green, Hu et al[^3], and roughly
482481
serves as an extension of `arx_classifier()`.
483482

484-
First, we need to add a factor version of the `geo_value`, so that it can
485-
actually be used as a feature.
483+
First, we need to add a factor version of `geo_value`, so that it can be used as a feature.
486484

487485
```{r training_factor}
488486
training_data <-
489-
training_data %>%
487+
training_data |>
490488
mutate(geo_value_factor = as.factor(geo_value))
491489
```
492490

493491
Then we put together the recipe, using a combination of base `{recipe}`
494-
functions such as `add_role()` and `step_dummy()`, and `{epipreict}` functions
492+
functions such as `add_role()` and `step_dummy()`, and `{epipredict}` functions
495493
such as `step_growth_rate()`.
496494

497495
```{r class_recipe}
498-
classifier_recipe <- epi_recipe(training_data) %>%
499-
add_role(time_value, new_role = "predictor") %>%
500-
step_dummy(geo_value_factor) %>%
501-
step_growth_rate(case_rate, role = "none", prefix = "gr_") %>%
502-
step_epi_lag(starts_with("gr_"), lag = c(0, 7, 14)) %>%
503-
step_epi_ahead(starts_with("gr_"), ahead = 7, role = "none") %>%
504-
# note recipes::step_cut() has a bug in it, or we could use that here
496+
classifier_recipe <- epi_recipe(training_data) |>
497+
# Turn `time_value` into predictor
498+
add_role(time_value, new_role = "predictor") |>
499+
# Turn `geo_value_factor` into predictor
500+
step_dummy(geo_value_factor) |>
501+
# Create and lag growth rate
502+
step_growth_rate(case_rate, role = "none", prefix = "gr_") |>
503+
step_epi_lag(starts_with("gr_"), lag = c(0, 7, 14)) |>
504+
step_epi_ahead(starts_with("gr_"), ahead = 7, role = "none") |>
505+
# Divide growth rate into 3 bins.
506+
# Note `recipes::step_cut()` has a bug, or we could use that here.
505507
step_mutate(
506508
response = cut(
507509
ahead_7_gr_7_rel_change_case_rate,
508-
breaks = c(-Inf, -0.2, 0.25, Inf) / 7, # division gives weekly not daily
510+
# Define bin thresholds.
511+
# Divide by 7 to create weekly values.
512+
breaks = c(-Inf, -0.2, 0.25, Inf) / 7,
509513
labels = c("down", "flat", "up")
510514
),
511515
role = "outcome"
512-
) %>%
513-
step_rm(has_role("none"), has_role("raw")) %>%
516+
) |>
517+
# Drop unused columns.
518+
step_rm(has_role("none"), has_role("raw")) |>
514519
step_epi_naomit()
515520
```
516521

522+
This adds as predictors:
517523

518-
Roughly, this adds as predictors:
524+
- time value (via `add_role()`)
525+
- `geo_value` (via `step_dummy()` and the previous `as.factor()`)
526+
- growth rate of case rate, both at prediction time (no lag), and lagged by one and two weeks
519527

520-
1. the time value (via `add_role()`)
521-
2. the `geo_value` (via `step_dummy()` and the `as.factor()` above)
522-
3. the growth rate, both at prediction time and lagged by one and two weeks
523-
524-
The outcome is created by composing several steps together: `step_epi_ahead()`
525-
creates a column with the growth rate one week into the future, while
526-
`step_mutate()` creates a factor with the 3 values:
528+
The outcome variable is created by composing several steps together. `step_epi_ahead()`
529+
creates a column with the growth rate one week into the future, and
530+
`step_mutate()` turns that column into a factor with 3 possible values,
527531

528532
$$
529533
Z_{\ell, t}=
@@ -538,47 +542,49 @@ where $Y^{\Delta}_{\ell, t}$ is the growth rate at location $\ell$ and time $t$.
538542
`up` means that the `case_rate` is has increased by at least 25%, while `down`
539543
means it has decreased by at least 20%.
540544

541-
Note that both `step_growth_rate()` and `step_epi_ahead()` assign the role
542-
`none` explicitly; this is because they're used as intermediate steps to create
543-
both predictors and the outcome.
544-
`step_rm()` drops them after they're done, along with the original `raw` columns
545-
`death_rate` and `case_rate` (both `geo_value` and `time_value` are retained
546-
because their roles have been reassigned).
545+
Note that in both `step_growth_rate()` and `step_epi_ahead()` we explicitly assign the role
546+
`none`. This is because those columns are used as intermediaries to create
547+
predictor and outcome columns.
548+
Afterwards, `step_rm()` drops the temporary columns, along with the original `role = "raw"` columns
549+
`death_rate` and `case_rate`. Both `geo_value_factor` and `time_value` are retained
550+
because their roles have been reassigned.
547551

548552

549-
To fit a classification model like this, we will need to use a parsnip model
550-
with mode classification; the simplest example is `multinomial_reg()`.
551-
We don't actually need to apply any layers, so we can skip adding one to the `epiworkflow()`:
553+
To fit a classification model like this, we will need to use a `{parsnip}` model
554+
that has `mode = "classification"`.
555+
The simplest example of a `{parsnip}` `classification`-`mode` model is `multinomial_reg()`.
556+
We don't need to do any post-processing, so we can skip adding `layer`s to the `epiworkflow()`.
557+
So our workflow looks like:
552558

553559
```{r, warning=FALSE}
554560
wf <- epi_workflow(
555561
classifier_recipe,
556562
multinom_reg()
557-
) %>%
563+
) |>
558564
fit(training_data)
559565
560-
forecast(wf) %>% filter(!is.na(.pred_class))
566+
forecast(wf) |> filter(!is.na(.pred_class))
561567
```
562568

563-
And comparing the result with the actual growth rates at that point:
569+
And comparing the result with the actual growth rates at that point in time,
564570
```{r growth_rate_results}
565571
growth_rates <- covid_case_death_rates |>
566572
filter(geo_value %in% used_locations) |>
567573
group_by(geo_value) |>
568574
mutate(
569-
# multiply by 7 to get to weekly equivalents
575+
# Multiply by 7 to estimate weekly equivalents
570576
case_gr = growth_rate(x = time_value, y = case_rate) * 7
571577
) |>
572578
ungroup()
573579
574580
growth_rates |> filter(time_value == "2021-08-01")
575581
```
576582

577-
So they're all increasing at significantly higher than 25% per week (36%-62%),
578-
which matches the classification.
583+
we see that they're all significantly higher than 25% per week (36%-62%),
584+
which matches the classification model's predictions.
579585

580586

581-
See the [tooling book](https://cmu-delphi.github.io/delphi-tooling-book/preprocessing-and-models.html) for a more in depth discussion of this example.
587+
See the [tooling book](https://cmu-delphi.github.io/delphi-tooling-book/preprocessing-and-models.html) for a more in-depth discussion of this example.
582588

583589

584590
[^1]: Think of baking a cake, where adding the frosting is the last step in the

0 commit comments

Comments
 (0)