Skip to content

Commit 5229654

Browse files
committed
various requested changes
1 parent 7312ade commit 5229654

File tree

4 files changed

+48
-26
lines changed

4 files changed

+48
-26
lines changed

man/autoplot-epipred.Rd

Lines changed: 1 addition & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

man/step_adjust_latency.Rd

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

vignettes/backtesting.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@ p0 <-
122122
```
123123
</details>
124124

125-
```{r plot_just_revisioning, warn = FALSE, message = FALSE}
125+
```{r plot_just_revisioning, echo = FALSE, warn = FALSE, message = FALSE}
126126
p0
127127
```
128128

vignettes/custom_epiworkflows.Rmd

Lines changed: 44 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -19,14 +19,17 @@ library(recipes)
1919
library(epipredict)
2020
library(epiprocess)
2121
library(ggplot2)
22+
library(rlang) # for %@%
2223
forecast_date <- as.Date("2021-08-01")
2324
used_locations <- c("ca", "ma", "ny", "tx")
2425
library(epidatr)
2526
```
2627

2728
If you want to do custom data preprocessing or fit a model that isn't included in the canned workflows, you'll need to write a custom `epi_workflow()`.
29+
An `epi_workflow()` is a sub-class of a `workflows::workflow()` from the
30+
`{workflows}` package designed to handle panel data specifically.
2831

29-
To get understand how to work with custom `epi_workflow()`s, let's recreate and then
32+
To understand how to work with custom `epi_workflow()`s, let's recreate and then
3033
modify the `four_week_ahead` example from the [landing
3134
page](../index.html#motivating-example).
3235
Let's first remind ourselves how to use a simple canned workflow:
@@ -133,9 +136,11 @@ parameters have already been calculated based on the training data set.
133136
Let's create an `epi_recipe()` to hold the 6 steps:
134137

135138
```{r make_recipe}
136-
four_week_recipe <- epi_recipe(
137-
covid_case_death_rates |>
139+
filtered_data <- covid_case_death_rates |>
138140
filter(time_value <= forecast_date, geo_value %in% used_locations)
141+
four_week_recipe <- epi_recipe(
142+
filtered_data,
143+
reference_date = (filtered_data %@% metadata)$as_of
139144
)
140145
```
141146

@@ -171,21 +176,28 @@ The `step_naomit()`s differ in their treatment of the data at predict time.
171176
For example, if we wanted to use the same lags for both `case_rate` and `death_rate`, we could
172177
specify them in a single step, like `step_epi_lag(ends_with("rate"), lag = c(0, 7, 14))`.
173178

174-
In general, `{recipes}` `step`s assign roles (`predictor`, `outcome`) to columns either by adding new columns or adjusting existing
179+
In general, `{recipes}` `step`s assign roles (such as `predictor`, or `outcome`,
180+
see the [Roles vignette for
181+
details](https://recipes.tidymodels.org/articles/Roles.html)) to columns either
182+
by adding new columns or adjusting existing
175183
ones.
176184
`step_epi_lag()`, for example, creates a new column for each lag with the name
177185
`lag_x_column_name` and labels them each with the `predictor` role.
178-
`step_epi_ahead()` creates `ahead_x_column_name` columns and labels each with the `outcome` role.
186+
`step_epi_ahead()` creates `ahead_x_column_name` columns and labels each with
187+
the `outcome` role.
179188

180-
We can inspect assigned roles with `prep()` to make sure that we are training on the correct columns:
189+
In general, to inspect the 'prepared' steps, we can run `prep()`, which fits any
190+
parameters used in the recipe, calculates new columns, and assigns roles[^4].
191+
For example, we can use `prep()` to make sure that we are training on the
192+
correct columns:
181193

182194
```{r prep_recipe}
183195
prepped <- four_week_recipe |> prep(training_data)
184196
prepped$term_info |> print(n = 14)
185197
```
186198

187-
We can inspect newly-created columns by running `bake()` on the
188-
recipe so far:
199+
`bake()` applies a prepared recipe to a (potentially new) dataset to create the dataset as handed to the `epi_workflow()`.
200+
We can inspect newly-created columns by running `bake()` on the recipe so far:
189201

190202
```{r bake_recipe}
191203
four_week_recipe |>
@@ -243,7 +255,7 @@ On the other hand, the layers that are only supported by quantile estimating
243255
engines (such as `quantile_reg()`) are
244256

245257
- `layer_quantile_distn()`: adds the specified quantiles.
246-
If they differ from the ones actually fit, they will be interpolated and/or
258+
If the quantile levels specified differ from the ones actually fit, they will be interpolated and/or
247259
extrapolated.
248260
- `layer_point_from_distn()`: this adds the median quantile as a point estimate,
249261
and, if called, should be included after `layer_quantile_distn()`.
@@ -272,8 +284,7 @@ However, it does not generate any predictions; predictions need to be created in
272284

273285
## Predicting
274286

275-
To make a prediction, we need to first narrow a data set down to the relevant observations.
276-
This process removes observations that will not be used for training because, for example, they contain missing values or <!-- TODO other reasons?-->.
287+
To make a prediction, it helps to narrow the data set down to the relevant observations using `get_test_data()`. Not doing this will still fit, but it will predict on every day in the data-set, and not just on the `reference_date`.
277288

278289
```{r grab_data}
279290
relevant_data <- get_test_data(
@@ -299,7 +310,7 @@ fit_workflow |> predict(training_data)
299310

300311
The resulting tibble is 800 rows long, however.
301312
Not running `get_test_data()` means that we're providing irrelevant data along with relevant, valid data.
302-
Passing the non-subsetted data set produces forecasts for not just the requested `forecast_date`, but for every
313+
Passing the non-subsetted data set produces forecasts for not just the requested `reference_date`, but for every
303314
day in the data set that has sufficient data to produce a prediction.
304315
To narrow this down, we could filter to rows where the `time_value` matches the `forecast_date`:
305316

@@ -356,6 +367,7 @@ growth_rate_recipe |>
356367
geo_value, time_value, case_rate,
357368
death_rate, gr_7_rel_change_death_rate
358369
) |>
370+
arrange(geo_value, time_value) |>
359371
tail()
360372
```
361373

@@ -484,7 +496,8 @@ First, we need to add a factor version of `geo_value`, so that it can be used as
484496

485497
```{r training_factor}
486498
training_data <-
487-
training_data |>
499+
covid_case_death_rates |>
500+
filter(time_value <= forecast_date, geo_value %in% used_locations) |>
488501
mutate(geo_value_factor = as.factor(geo_value))
489502
```
490503

@@ -496,7 +509,7 @@ such as `step_growth_rate()`.
496509
classifier_recipe <- epi_recipe(training_data) |>
497510
# Turn `time_value` into predictor
498511
add_role(time_value, new_role = "predictor") |>
499-
# Turn `geo_value_factor` into predictor
512+
# Turn `geo_value_factor` into predictor by adding indicators for each value
500513
step_dummy(geo_value_factor) |>
501514
# Create and lag growth rate
502515
step_growth_rate(case_rate, role = "none", prefix = "gr_") |>
@@ -514,15 +527,15 @@ classifier_recipe <- epi_recipe(training_data) |>
514527
),
515528
role = "outcome"
516529
) |>
517-
# Drop unused columns.
530+
# Drop unused columns, not strictly necessary
518531
step_rm(has_role("none"), has_role("raw")) |>
519532
step_epi_naomit()
520533
```
521534

522535
This adds as predictors:
523536

524-
- time value (via `add_role()`)
525-
- `geo_value` (via `step_dummy()` and the previous `as.factor()`)
537+
- time value as a continuous variable (via `add_role()`)
538+
- `geo_value` as a set of indicator variables (via `step_dummy()` and the previous `as.factor()`)
526539
- growth rate of case rate, both at prediction time (no lag), and lagged by one and two weeks
527540

528541
The outcome variable is created by composing several steps together. `step_epi_ahead()`
@@ -553,17 +566,25 @@ because their roles have been reassigned.
553566
To fit a classification model like this, we will need to use a `{parsnip}` model
554567
that has `mode = "classification"`.
555568
The simplest example of a `{parsnip}` `classification`-`mode` model is `multinomial_reg()`.
556-
We don't need to do any post-processing, so we can skip adding `layer`s to the `epiworkflow()`.
557-
So our workflow looks like:
569+
The needed layers are more or less the same as the `linear_reg()` regression layers, with the addition that we need to remove some `NA` values:
570+
571+
```{r, warning=FALSE}
572+
frost <- frosting() |>
573+
layer_naomit(starts_with(".pred")) |>
574+
layer_add_forecast_date() |>
575+
layer_add_target_date() |>
576+
layer_threshold()
577+
```
558578

559579
```{r, warning=FALSE}
560580
wf <- epi_workflow(
561581
classifier_recipe,
562-
multinom_reg()
582+
multinom_reg(),
583+
frost
563584
) |>
564585
fit(training_data)
565586
566-
forecast(wf) |> filter(!is.na(.pred_class))
587+
forecast(wf)
567588
```
568589

569590
And comparing the result with the actual growth rates at that point in time,
@@ -596,3 +617,5 @@ See the [tooling book](https://cmu-delphi.github.io/delphi-tooling-book/preproce
596617
[^3]: McDonald, Bien, Green, Hu, et al. “Can auxiliary indicators improve
597618
COVID-19 forecasting and hotspot prediction?.” Proceedings of the National
598619
Academy of Sciences 118.51 (2021): e2111453118. doi:10.1073/pnas.2111453118
620+
621+
[^4]: Note that `prep()` and `bake()` are standard `{recipes}` functions, so any discussion of them there applies just as well here. For example in the [guide to creating a new step](https://www.tidymodels.org/learn/develop/recipes/#create-the-prep-method).

0 commit comments

Comments
 (0)