Skip to content

Edited existing chapters #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions _freeze/epipredict/execute-results/html.json

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions _freeze/flatline-forecaster/execute-results/html.json

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions _freeze/forecast-framework/execute-results/html.json

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions _freeze/tidymodels-intro/execute-results/html.json

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions _freeze/tidymodels-regression/execute-results/html.json

Large diffs are not rendered by default.

626 changes: 318 additions & 308 deletions _freeze/tidymodels-regression/figure-html/unnamed-chunk-21-1.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
214 changes: 107 additions & 107 deletions _freeze/tidymodels-regression/figure-html/unnamed-chunk-24-1.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions epipredict.qmd
Original file line number Diff line number Diff line change
@@ -13,7 +13,7 @@ Serving both populations is the main motivation for our efforts, but at the same
## Baseline models

We provide a set of basic, easy-to-use forecasters that work out of the box.
You should be able to do a reasonably limited amount of customization on them. Any serious customization happens with the framework discussed below).
You should be able to do a limited amount of customization on them. Any serious customization happens with the framework discussed below).

For the basic forecasters, we provide:

@@ -214,7 +214,7 @@ out_gb <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"),
Or quantile regression, using our custom forecasting engine `quantile_reg()`:

```{r quantreg, warning = FALSE}
out_gb <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"),
out_qr <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"),
quantile_reg())
```

57 changes: 32 additions & 25 deletions flatline-forecaster.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Introducing the flatline forecaster

The flatline forecaster is a very simple forecasting model intended for `epi_df` data, where the most recent observation is used as the forecast for any future date. In other words, the last observation is propagated forward. Hence, a flat line phenomenon is observed for the point predictions. The predictive intervals are produced from the quantiles of the residuals of such a forecast over all of the training data. By default, these intervals will be obtained separately for each combination of keys (`geo_value` and any additional keys) in the `epi_df`. Thus, the output is a data frame of point (and optionally interval) forecasts at a single unique horizon (`ahead`) for each unique combination of key variables. This forecaster is comparable to the baseline used by the [COVID Forecast Hub](https://covid19forecasthub.org).
The flatline forecaster is a very simple forecasting model intended for `epi_df` data, where the most recent observation is used as the forecast for any future date. In other words, the last observation is propagated forward. Hence, a flat line phenomenon is observed for the point predictions. The prediction intervals are produced from the quantiles of the residuals of such a forecast over all of the training data. By default, these intervals will be obtained separately for each combination of keys (`geo_value` and any additional keys) in the `epi_df`. Thus, the output is a data frame of point (and optionally interval) forecasts at a single unique horizon (`ahead`) for each unique combination of key variables. This forecaster is comparable to the baseline used by the [COVID Forecast Hub](https://covid19forecasthub.org).

## Example of using the flatline forecaster

@@ -27,19 +27,26 @@ jhu

### The basic mechanics of the flatline forecaster

The simplest way to create and train a flatline forecaster to predict the d
eath rate one week into the future, is to input the `epi_df` and the name of
the column from it that we want to predict in the `flatline_forecaster` function.
Suppose that our goal is to predict death rates one week ahead of the last date available for each state. Mathematically, on day $t$, we want to predict new deaths $y$ that are $h$ days ahead at many locations $j$. So for each location, we'll predict
$$
\hat{y}_{j, {t+h}} = y_{j, t}
$$
where $t$ is 2021-12-31, $h$ is 7 days, and $j$ is the state in our example.

Now, the simplest way to create and train a flatline forecaster to predict the death rate one week into the future is to input the `epi_df` and the name of the column from it that we want to predict in the `flatline_forecaster` function.

```{r}
one_week_ahead <- flatline_forecaster(jhu, outcome = "death_rate")
one_week_ahead <- flatline_forecaster(jhu,
outcome = "death_rate")
one_week_ahead
```

The result is both a fitted model object which could be used any time in the
future to create different forecasts, as well as a set of predicted values and
prediction intervals for each location 7 days after the last available time
value in the data, which is Dec 31, 2021. Note that 7 days is the default
The result is a S3 object, which contains three objects - metadata, a tibble of predictions,
and a S3 object of class `epi_workflow`. There are a few important things to note here.
First, the fitted model object contained in the `epi_workflow`
object could be used any time in the future to create different forecasts.
Next, the tibble of predicted values and prediction intervals is for each location 7 days
after the last available time value in the data, which is Dec 31, 2021. Note that 7 days is the default
number of time steps ahead of the forecast date in which forecasts should be
produced. To change this, you must change the value of the `ahead` parameter
in the list of additional arguments `flatline_args_list()`. Let's change this
@@ -55,15 +62,16 @@ five_days_ahead <- flatline_forecaster(
five_days_ahead
```

We could also specify that we want a 80% predictive interval by changing the
levels. The default 0.05 and 0.95 levels/quantiles give us 90% predictive
We could also specify that we want a 80% prediction interval by changing the
levels. The default 0.05 and 0.95 levels/quantiles give us 90% prediction
interval.

```{r}
five_days_ahead <- flatline_forecaster(
jhu,
outcome = "death_rate",
flatline_args_list(ahead = 5L, levels = c(0.1, 0.9))
flatline_args_list(ahead = 5L,
levels = c(0.1, 0.9))
)
five_days_ahead
@@ -77,8 +85,10 @@ five_days_ahead$epi_workflow

The fitted model here was based on minimal pre-processing of the data,
estimating a flatline model, and then post-processing the results to be
meaningful for epidemiological tasks. To look deeper into the pre-processing,
model and processing parts individually, you may use the `$` operator after `epi_workflow`. For example, let's examine the pre-processing part in more detail.
meaningful for epidemiological tasks. To look deeper into the pre-processing
or post-processing parts individually, you can use the `extract_preprocessor()`
or `extract_frosting()` functions on the `epi_workflow`, respectively.
For example, let’s examine the pre-processing part in more detail.

```{r}
#| results: false
@@ -94,12 +104,10 @@ extract_preprocessor(five_days_ahead$epi_workflow)
extract_preprocessor(five_days_ahead$epi_workflow)
```


Under Operations, we can see that the pre-processing operations were to lead the
death rate by 5 days (`step_epi_ahead()`) and that the \# of recent observations
used in the training window were not limited (in `step_training_window()` as
`n_training = Inf` in `flatline_args_list()`). You should also see the
molded/pre-processed training data.
`n_training = Inf` in `flatline_args_list()`).

For symmetry, let's have a look at the post-processing.

@@ -116,18 +124,17 @@ extract_frosting(five_days_ahead$epi_workflow)
extract_frosting(five_days_ahead$epi_workflow)
```


The post-processing operations in the order the that were performed were to create the predictions and the predictive intervals, add the forecast and target dates and bound the predictions at zero.
The post-processing operations in the order the that were performed were to create the predictions and the prediction intervals, add the forecast and target dates and bound the predictions at zero.

We can also easily examine the predictions themselves.

```{r}
five_days_ahead$predictions
```

The results above show a distributional forecast produced using data through the end of 2021 for the January 5, 2022. A prediction for the death rate per 100K inhabitants along with a 95% predictive interval is available for every state (`geo_value`).
The results above show a distributional forecast produced using data through the end of 2021 for the January 5, 2022. A prediction for the death rate per 100K inhabitants along with a 80% prediction interval is available for every state (`geo_value`).

The figure below displays the prediction and prediction interval for three sample states: Arizona, New York, and Florida.
The figure below displays the prediction and prediction interval for three sample states: Arizona, Florida, and New York.

```{r}
#| fig-height: 5
@@ -158,7 +165,7 @@ ggplot(hist, aes(color = geo_value)) +

The vertical black line is the forecast date. Here the forecast seems pretty reasonable based on the past observations shown. In cases where the recent past is highly predictive of the near future, a simple flatline forecast may be respectable, but in more complex situations where there is more uncertainty of what's to come, the flatline forecaster may be best relegated to being a baseline model and nothing more.

Take for example what happens when we consider a wider range of target dates. That is, we will now predict for several different horizons or `ahead` values - in our case, 5 to 25 days ahead, inclusive. Since the flatline forecaster function forecasts at a single unique `ahead` value, we can use the `map()` function from `purrr` to apply the forecaster to each ahead value we want to use. Then, we row bind the list of results.
Take for example what happens when we consider a wider range of target dates. That is, we will now predict for several different horizons or `ahead` values - in our case, 1 to 28 days ahead, inclusive. Since the flatline forecaster function forecasts at a single unique `ahead` value, we can use the `map()` function from `purrr` to apply the forecaster to each ahead value we want to use. And then we row bind the list of results.

```{r}
out_df <- map(1:28, ~ flatline_forecaster(
@@ -168,7 +175,7 @@ out_df <- map(1:28, ~ flatline_forecaster(
list_rbind()
```

Then, we proceed as we did before. The only difference from before is that we're using `out_df` where we had `five_days_ahead$predictions`.
Then we proceed as we did before. The only difference from before is that we're using `out_df` where we had `five_days_ahead$predictions`.

```{r}
#| fig-height: 5
@@ -195,9 +202,9 @@ ggplot(hist) +
theme(legend.position = "none")
```

Now, you can really see the flat line trend in the predictions. And you may also observe that as we get further away from the forecast date, the more unnerving using a flatline prediction becomes. It feels increasingly unnatural.
Now you can really see the flat line trend in the predictions. And you may also observe that as we get further away from the forecast date, the more unnerving using a flatline prediction becomes. It feels increasingly unnatural.

So naturally the choice of forecaster relates to the time frame being considered. In general, using a flatline forecaster makes more sense for short-term forecasts than for long-term forecasts and for periods of great stability than in less stable times. Realistically, periods of great stability are rare. Moreover, in our model of choice we want to take into account more information about the past than just what happened at the most recent time point. So simple forecasters like the flatline forecaster don't cut it as actual contenders in many real-life situations. However, they are not useless, just used for a different purpose. A simple model is often used to compare a more complex model to, which is why you may have seen such a model used as a baseline in the [COVID Forecast Hub](https://covid19forecasthub.org). The following [blog post](https://delphi.cmu.edu/blog/2021/09/30/on-the-predictability-of-covid-19/#ensemble-forecast-performance) from Delphi explores the Hub's ensemble accuracy relative to such a baseline model.
So naturally the choice of forecaster relates to the time frame being considered. In general, using a flatline forecaster makes more sense for short-term forecasts than for long-term forecasts and for periods of great stability than in less stable times. Realistically, periods of great stability are rare. And moreover, in our model of choice we want to take into account more information about the past than just what happened at the most recent time point. So simple forecasters like the flatline forecaster don't cut it as actual contenders in many real-life situations. However, they are not useless, just used for a different purpose. A simple model is often used to compare a more complex model to, which is why you may have seen such a model used as a baseline in the [COVID Forecast Hub](https://covid19forecasthub.org). The following [blog post](https://delphi.cmu.edu/blog/2021/09/30/on-the-predictability-of-covid-19/#ensemble-forecast-performance) from Delphi explores the Hub's ensemble accuracy relative to such a baseline model.

## What we've learned in a nutshell

4 changes: 2 additions & 2 deletions forecast-framework.qmd
Original file line number Diff line number Diff line change
@@ -6,7 +6,7 @@ source("_common.R")
```

Underneath the hood, the `arx_forecaster()` (and all our canned
forecasters) creates (and returns) an `epi_workflow`.
forecasters) creates and returns an `epi_workflow`.
Essentially, this is a big S3 object that wraps up the 4 modular steps
(preprocessing - postprocessing) described in the last chapter.

@@ -17,7 +17,7 @@ Essentially, this is a big S3 object that wraps up the 4 modular steps

Let's investigate how these interact with `{tidymodels}` and why it's important
to think of forecasting this way. To have something to play with, we'll continue
to examine the data and an estimated canned corecaster.
to examine the data and an estimated canned forecaster.


```{r demo-workflow}
49 changes: 22 additions & 27 deletions tidymodels-intro.qmd
Original file line number Diff line number Diff line change
@@ -10,7 +10,7 @@ R contains a universe of packages that each have their own unique interfaces (fu

If only there was a unifying interface available to help simplify and streamline the modelling process. This is the purpose of `tidymodels`, which provides a unified interface for modeling that abides by the [tidy philosphy](https://tidyverse.tidyverse.org/articles/paper.html#:~:text=Its%20primary%20goal%20is%20to,easier%20to%20learn%20the%20next) and that fits nicely into the tidyverse. From pre-processing to model training to prediction to validation, `tidymodels` provides the necessary tools to perform many modelling tasks.

It is important to understand that the `tidymodels` packages do not aim to implement the algorithms themseves, rather they provide the interface to bring together many disparate approaches under the same roof. And as a result of this, model fitting tasks are easier to carry out. In the grand scheme of things, here's where `tidymodels` tends to fit into a data analysis project.
It is important to understand that the `tidymodels` packages do not aim to implement the algorithms themselves, rather they provide the interface to bring together many disparate approaches under the same roof. And as a result of this, model fitting tasks are easier to carry out. In the grand scheme of things, here's where `tidymodels` tends to fit into a data analysis project.

```{r, echo = FALSE}
knitr::include_graphics("img/tidymodels_packages.png")
@@ -31,7 +31,7 @@ knitr::include_graphics("img/tidymodels_model_substeps.png")

## An example using the penguins dataset

We will now explore the `tidymodels` functions using the `penguins` dataset that we introduced and used in [Regression in Tidymodels](LINK%20TO%20VIGNETTE).
We will now explore the `tidymodels` functions using the `penguins` dataset that we introduced and used in [Regression in Tidymodels](tidymodels-regression.qmd#sec-slr-intro).

### Load packages

@@ -43,7 +43,7 @@ library(tidymodels)

### Simplify dataset

To keep the focus on learning how to use `tidymodels`, we will work with a simplified version of the dataset in which we will only use the complete cases/rows in the `penguins` dataset
To keep the focus on learning how to use `tidymodels`, we will only wretain the complete cases/rows in the `penguins` dataset

```{r}
penguins <- penguins %>%
@@ -94,7 +94,7 @@ The main goal of this step is to use data transformations to make the data suita

Before training the model, a recipe can be used to carry out the pre-processing required by the model.

The `recipe()` has two main arguments: a formula (with the same format as when doing <regression in tidymodels>\[LINK TO VIGNETTE\]) and a data argument, which is usually the training set as that is the data used to create the model. Hence, we have `data = train_data` here.
The `recipe()` has two main arguments: a formula (with the same format as when doing [regression in tidymodels](tidymodels-regression.qmd#sec-slr-intro)) and a data argument, which is usually the training set as that is the data used to create the model. Hence, we have `data = train_data` here.

In our example, suppose that our goal is to predict penguin species from bill length, bill depth and flipper length, then our recipe function would look as follows:

@@ -112,22 +112,28 @@ Now, after saying that you are making a recipe by way of the `recipe()` function

- `step_scale()` to normalize numeric data to have a standard deviation of one

One of the advantages of having these pre-processing steps is that they help to simplify concepts that are difficult or a pain to enforce in coding. For example, centering could be a nuisance to implement from scratch because we would first have to calculate statistics (variable averages) from the training data and then use them on both the training and on the test data. Note that centering should not be done on the test data, rather on the training data to avoid data leakage (contamination of the test data by using statistics from the test data). In a recipe, the the estimation of the variable means using the training data and the application of these to center new data sets is done automatically, under the hood, and so spares the coder from having to manually implement it. The situation is similar for scaling numeric data (`step_scale()`).
One of the advantages of having these pre-processing steps is that they help to simplify the process of applying training statistics to test data. For example, centering could be a nuisance to implement from scratch because we would first have to calculate statistics (variable averages) from the training data and then use them on both the training and test data.

Another useful feature of the `tidymodels` pre-processing interface is that each step can be applied to one specified variable, a group of variables, or all variables. The `all_predictors()` and `all_outcomes()` functions are particularly convenient to help minimize the amount of typing you need to do. For instance, if you wanted to apply `step_center()` to only the predictor variables, simply type `step_center(all_predictors())` instead of listing out each and every predictor in the step function.
:::{.callout-note}
Note that centering should not be done on the test data, but rather on the training data to avoid data leakage - contamination of the test data by using statistics from the test data.
:::

Now, let's try this all out on our example.
In a recipe, estimation of the variable means using the training data and the application of these to center new data is done automatically, under the hood, and so spares the coder from having to manually implement it. The situation is similar for scaling numeric data.

Another feature of the `tidymodels` pre-processing interface is that each step can be applied to one specified variable, a group of variables, or all variables. The `all_predictors()` and `all_outcomes()` functions are particularly convenient to help minimize the amount of typing you need to do. For instance, if you want to apply `step_center()` to only the predictor variables, simply type `step_center(all_predictors())` instead of listing out each and every predictor in the step function.

Now, let's try this out on our example.

```{r}
penguins_recipe <- recipe(species ~ ., data = train_data) %>%
step_corr(all_predictors()) %>%
step_center(all_predictors(), -all_outcomes()) %>%
step_scale(all_predictors(), -all_outcomes())
step_center(all_predictors()) %>%
step_scale(all_predictors())
```

To summarize, we obtained a recipe object, `penguins_recipe`, by putting the `recipe()` and step functions together on our training data that we had ready to go from sampling.

Now, to get the recipe details, simply call the recipe object. The operations section details what pre-processing steps we've applied to the data. Notice that the steps shown here are in the order that they were input into the recipe and they specify the variables used in each step.
Now, to get the recipe details, simply call the recipe object. The operations section details what pre-processing steps we've applied to the data. Notice that the steps shown here are in the order that they were input into the recipe and they each indicate the variables that were used.

```{r}
penguins_recipe
@@ -139,7 +145,7 @@ Recall that in R, the same type of model could be fit using several different pa

`Tidymodels` created an single interface that supports the usage of both models. Moreover, this general interface supports an even wider range of functions that use perform random forest. The key part that points to the function and package to be used is the engine.

Let's see how this works in practice. In the below example, we'll use the general `rand_forest()` function from `tidymodels`. In there, we can specify the number of trees by using the `trees` argument. Then, in `set_engine()` we specify that we want to use ranger's version of random forest. Notice this follows the model specification format introduced in the \[Regression in Tidymodels\]<ADD LINK HERE> chapter.
Let's see how this works in practice. In the below example, we'll use the general `rand_forest()` function from `tidymodels`. In there, we can specify the number of trees by using the `trees` argument. Then, in `set_engine()` we specify that we want to use ranger's version of random forest. Notice this follows the model specification format introduced in the [Regression in Tidymodels](tidymodels-regression.qmd#sec-slr-intro) chapter.

```{r}
penguins_ranger <- rand_forest(trees = 100, mode = "classification") %>%
@@ -153,17 +159,9 @@ penguins_rf <- rand_forest(trees = 100, mode = "classification") %>%
set_engine("randomForest")
```

For the remainder of this tutorial, we'll stick with using `ranger` for simplify. At this stage, we're ready to pre-process and model. The first task of those two is to apply our recipe before we train and test our model, in that we must

1. Process the recipe using the training set.

2. Use the recipe on the training set to get the finalized predictor set.

3. Use the recipe on the predictor set to get the test set.

A workflow can be used to pair model and processing tasks together. When different recipes are needed for different models, this is very useful so that you don't have to keep track of separate model and recipe objects in your workspace. Hence, training and testing different workflows becomes easier.
For the remainder of this tutorial, we'll stick with using `ranger` for simplicity. And at this stage, we're ready to pre-process and model. A workflow can be used to pair model and processing tasks together. When different recipes are needed for different models, this is very useful so that you don't have to keep track of separate model and recipe objects in your workspace. So training and testing different workflows becomes easier.

For our example, we'll try tidy model's workflows package to pair our model and our recipe together.
For our example, we'll employ the `workflows` package to pair our model and our recipe together.

```{r}
penguins_wflow <- workflow() %>%
@@ -173,7 +171,7 @@ penguins_wflow <- workflow() %>%
penguins_wflow
```

After that, we're ready to fit the model to our training data. The `fit()` function is what we will use to prepare the the recipe and train the model from the finalized predictors.
After that, we're ready to fit the model to our training data. The `fit()` function is what we will use to prepare the recipe and train the model from the finalized predictors.

```{r}
penguins_fit <- penguins_wflow %>% fit(data = train_data)
@@ -278,15 +276,12 @@ penguins_aug %>%

The diagonal shows the counts of correct predictions for each species, while the off-diagonal shows the counts of model misclassifications. As the metrics have indicated, our model performed magnificently on the test set as there was only three misclassifications of Adelie penguins as Chinstrap.

## Concluding remarks

In this vignette, we introduced `tidymodels` and illustrated how to its packages work together by way of example. Since this was an elementary example, so use this as a starting point and explore what more can be done with this wonderful set of packages. And yet, however wonderful they are, you may have already noticed that there are limitations like the glaring lack of a set of post-processing tools to refine the results. We fill this gap for epidemiological modelling with [frosting](https://cmu-delphi.github.io/epipredict/reference/add_frosting.html). This will be formally introduced in a later chapter, so stay tuned!\
\
🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧
🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧

## Attribution


This Chapter was largely adapted from [A Gentle Introduction to Tidymodels](https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/) as well as [Tidymodels - Getting Started](https://www.tidymodels.org/start/recipes/) and [Tidymodels](https://wec.wur.nl/dse/24-tidymodels.html). The diagrams are from [A Gentle Introduction to Tidymodels](https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/) and based on [R for Data Science](https://r4ds.had.co.nz/explore-intro.html).

🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧
🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧
28 changes: 20 additions & 8 deletions tidymodels-regression.qmd
Original file line number Diff line number Diff line change
@@ -15,7 +15,7 @@ library(broom)
library(performance)
```

## Simple linear regression
## Simple linear regression {#sec-slr-intro}

The key steps to perform linear regression in `tidymodels` are to first specify the model type and then to specify the model form and the data to be used to construct it.

@@ -174,7 +174,7 @@ Notice that on each plot it says what we should expect to see if the model assum

We shall now briefly walk you through what each plot means.

The first two plots help us to examine the linearity of the errors versus the fitted values. Ideally, we want this error to be relatively flat and horizontal. The third plot is for checking homogeneity of the variance, where we want the points to be roughly the same distance from the line as this indicates similar dispersion. The fourth plot helps us to see if there are high leverage points - points that have command or influence over the model fit. As a result, these can have a great effect on the model predictions. So the removal of such points or modifications to the model may be necessary to deal with them. The fifth plot helps us to discern collinearity, which is when predictors are highly correlated. Since independent variables should be independent, this can throw off simple regression models (in standard error of coefficient estimates and the estimates themselves, which would likely be sensitive to changes in the predictors that are included in the model). The last plot enables us to check the normality of residuals. If the distribution of the model error is non-normal, then that suggests a linear model may not be appropriate. For a QQ plot, we want the points to fall along a straight diagonal line.
The first two plots help us to examine the linearity of the errors versus the fitted values. Ideally, we want this error to be relatively flat and horizontal. The third plot is for checking homogeneity of the variance, where we want the points to be roughly the same distance from the line as this indicates similar dispersion. The fourth plot helps us to see if there are high leverage points - points that have command or influence over the model fit. As a result, these can have a great effect on the model predictions. So the removal of such points or modifications to the model may be necessary to deal with them. The fifth plot helps us to discern collinearity, which is when predictors are highly correlated. Since independent variables should be independent, this can throw off simple regression models (in the standard error of coefficient estimates and the estimates themselves, which would likely be sensitive to changes in the predictors that are included in the model). The last plot enables us to check the normality of residuals. If the distribution of the model error is non-normal, then that suggests a linear model may not be appropriate. For a QQ plot, we want the points to fall along a straight diagonal line.

For our example, we observe that there's a pretty high correlation between `body_mass_g` and `flipper_length_mm` (not quite in the red-zone of 10 and above, but close enough for concern). That is indicative of multicollinearity between them. Intuitively, it makes sense for the body mass and flipper length variables - we'd expect that as once increases, so should the other.

@@ -216,7 +216,7 @@ In general, the syntax to add an interaction term to a formula is as follows:
- `x:y` denotes an interaction term between `x` and `y`.
- `x*y` denotes the interaction between `x` and `y` as well as `x` and `y`; that is, `x + y + x*y`.

It is important to note that this syntax is not compatible with all engines. Thus, we shall explain how to bypass this issue by adding an interaction term in a recipe later on. For now, let's start simple by adding an interaction term between `species` and `bill_length_mm`, which allows for a species-specific slope.
Let's start simple by adding an interaction term between `species` and `bill_length_mm`, which allows for a species-specific slope.

```{r}
lm_fit4 <- lm_spec %>% fit(
@@ -226,7 +226,9 @@ lm_fit4 <- lm_spec %>% fit(
lm_fit4
```

Using recipes, the interaction term is specified by using `step_interact()`. Then we construct a workflow object, where we add the linear regression model specification and recipe. Finally, we fit the model as we did for a `parsnip` model. Note that the workflow object does not need the variables that were specified in the recipe to be specified again.
It is important to note that this formula-based syntax is not compatible with all engines. Thus, we'll now go over how to bypass this problem by adding an interaction term in a recipe.

Using `recipes`, the interaction term is specified by using `step_interact()`. Then we construct a workflow object, where we add the linear regression model specification and recipe. Finally, we fit the model as we did for a `parsnip` model. Note that the workflow object does not need the variables that were specified in the recipe to be specified again.

```{r}
rec_spec_interact <- recipe(
@@ -248,9 +250,19 @@ To read more about formula syntax, see [?formula](https://rdrr.io/r/stats/formul

## Non-linear transformations of the predictors

Similar to how we were able to add an interaction term using recipes, we can also perform a transformation as a pre-processing step. The function used for this is `step_mutate()` (which acts like `dplyr`'s `mutate`).
While in base R, you can add a non-linear transformation to a formula like so,

```{r}
lm_fit5 <- lm_spec %>%
# let's square bill_depth_mm
fit(bill_length_mm ~ bill_depth_mm + I(bill_depth_mm^2), data = penguins)
lm_fit5
```

you cannot do this in a formula in a recipe.

Note that, in general, if you are specifying a recipe aim to keep as much of the pre-processing in your recipe specification as possible. This helps to ensure that the transformation will be applied to new data consistently.
Fortunately, we can deal with this in a similar way as we did to add an interaction term to a recipe. We can add non-linear transformation as a pre-processing step. The function used for this is `step_mutate()` (which acts like `dplyr`'s `mutate`). Thus, we can perform the above square transformation in a recipe as follows:

```{r}
rec_spec_pow2 <- recipe(bill_length_mm ~ bill_depth_mm, data = penguins) %>%
@@ -263,7 +275,7 @@ lm_wf_pow2 <- workflow() %>%
lm_wf_pow2 %>% fit(penguins)
```

There are many transformations already built into recipes such as `step_log()`. So, for basic transformations, there's often no need to make your own transformation from scratch. See [here](https://recipes.tidymodels.org/reference/#section-step-functions-individual-transformations) for a comprehensive list of the transformations that are offered in recipes.
There are many transformations already built into `recipes` such as `step_log()`. So, for basic transformations, there's often no need to make your own transformation from scratch. See [here](https://recipes.tidymodels.org/reference/#section-step-functions-individual-transformations) for a comprehensive list of the transformations that are offered in `recipes`.

```{r}
rec_spec_log <- recipe(bill_length_mm ~ bill_depth_mm, data = penguins) %>%
@@ -283,6 +295,6 @@ lm_wf_log %>% fit(penguins)

## Attribution

This Chapter was largely adapted from [Chapter 3 of ISLR tidymodels labs](https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/03-linear-regression.html). Checking linear regression assumptions using the performance package is based on [this article](https://easystats.github.io/performance/reference/check_model.html) and [this blog post](https://www.r-bloggers.com/2021/07/easystats-quickly-investigate-model-performance/) on investigating model performance. The artwork used is by [Allison Horst](https://twitter.com/allison_horst).[Allison Horst](https://twitter.com/allison_horst).
This Chapter was largely adapted from [Chapter 3 of ISLR tidymodels labs](https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/03-linear-regression.html). Checking linear regression assumptions using the performance package is based on [this article](https://easystats.github.io/performance/reference/check_model.html) and [this blog post](https://www.r-bloggers.com/2021/07/easystats-quickly-investigate-model-performance/) on investigating model performance. The artwork used is by [Allison Horst](https://twitter.com/allison_horst).

🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧