Skip to content

Commit a2503d0

Browse files
2025 upkeep (#1710)
* Use air * Reformat with air * Format on save with air * Remove empty lines after code cell yaml * Fix typo, closes #1665 * Use source editor * Fix logical error, closes #1603
1 parent 3923e2a commit a2503d0

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+125
-457
lines changed

.Rbuildignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,5 @@
33
^\.travis\.yml$
44
^\.github$
55
^CODE_OF_CONDUCT\.md$
6+
^[\.]?air\.toml$
7+
^\.vscode$

.vscode/extensions.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"recommendations": [
3+
"Posit.air-vscode"
4+
]
5+
}

.vscode/settings.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"[r]": {
3+
"editor.formatOnSave": true,
4+
"editor.defaultFormatter": "Posit.air-vscode"
5+
},
6+
"editor.defaultFormatter": "Posit.air-vscode",
7+
"quarto.visualEditor.markdownWrap": "sentence"
8+
}

EDA.qmd

Lines changed: 1 addition & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22

33
```{r}
44
#| echo: false
5-
65
source("_common.R")
76
```
87

@@ -35,7 +34,6 @@ In this chapter we'll combine what you've learned about dplyr and ggplot2 to int
3534
```{r}
3635
#| label: setup
3736
#| message: false
38-
3937
library(tidyverse)
4038
```
4139

@@ -87,7 +85,6 @@ Since `carat` is a numerical variable, we can use a histogram:
8785
#| the bin centered at 0.5, approximately 15000 diamonds in the bin centered
8886
#| at 1, and much fewer, approximately 5000 diamonds in the bin centered at
8987
#| 1.5. Beyond this, there's a trailing tail.
90-
9188
ggplot(diamonds, aes(x = carat)) +
9289
geom_histogram(binwidth = 0.5)
9390
```
@@ -122,7 +119,6 @@ Let's take a look at the distribution of `carat` for smaller diamonds.
122119
#| (0.01), resulting in a very large number of skinny bars. The distribution
123120
#| is right skewed, with many peaks followed by bars in decreasing heights,
124121
#| until a sharp increase at the next peak.
125-
126122
smaller <- diamonds |>
127123
filter(carat < 3)
128124
@@ -164,7 +160,6 @@ The only evidence of outliers is the unusually wide limits on the x-axis.
164160
#| A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and
165161
#| the y-axis ranges from 0 to 12000. There is a peak around 5, and the
166162
#| data appear to be completely clustered around the peak.
167-
168163
ggplot(diamonds, aes(x = y)) +
169164
geom_histogram(binwidth = 0.5)
170165
```
@@ -179,7 +174,6 @@ To make it easy to see the unusual values, we need to zoom to small values of th
179174
#| appear to be completely clustered around the peak. Other than those data,
180175
#| there is one bin at 0 with a height of about 8, one a little over 30 with
181176
#| a height of 1 and another one a little below 60 with a height of 1.
182-
183177
ggplot(diamonds, aes(x = y)) +
184178
geom_histogram(binwidth = 0.5) +
185179
coord_cartesian(ylim = c(0, 50))
@@ -193,7 +187,6 @@ We pluck them out with dplyr:
193187

194188
```{r}
195189
#| include: false
196-
197190
old <- options(tibble.print_max = 10, tibble.print_min = 10)
198191
```
199192

@@ -207,7 +200,6 @@ unusual
207200

208201
```{r}
209202
#| include: false
210-
211203
options(old)
212204
```
213205

@@ -248,7 +240,6 @@ If you've encountered unusual values in your dataset, and simply want to move on
248240

249241
```{r}
250242
#| eval: false
251-
252243
diamonds2 <- diamonds |>
253244
filter(between(y, 3, 20))
254245
```
@@ -274,7 +265,6 @@ It's not obvious where you should plot missing values, so ggplot2 doesn't includ
274265
#| linear association between the two variables. All but one of the diamonds
275266
#| has length greater than 3. The one outlier has a length of 0 and a width
276267
#| of about 6.5.
277-
278268
ggplot(diamonds2, aes(x = x, y = y)) +
279269
geom_point()
280270
```
@@ -283,7 +273,6 @@ To suppress that warning, set `na.rm = TRUE`:
283273

284274
```{r}
285275
#| eval: false
286-
287276
ggplot(diamonds2, aes(x = x, y = y)) +
288277
geom_point(na.rm = TRUE)
289278
```
@@ -301,7 +290,6 @@ You can do this by making a new variable, using `is.na()` to check if `dep_time`
301290
#| represent flights that are cancelled and not cancelled. The x-axis ranges
302291
#| from 0 to 25 minutes and the y-axis ranges from 0 to 10000. The number of
303292
#| flights not cancelled are much higher than those cancelled.
304-
305293
nycflights13::flights |>
306294
mutate(
307295
cancelled = is.na(dep_time),
@@ -346,7 +334,6 @@ For example, let's explore how the price of a diamond varies with its quality (m
346334
#| 5000. The lines overlap a great deal, suggesting similar frequency
347335
#| distributions of prices of diamonds. One notable feature is that
348336
#| Ideal diamonds have the highest peak around 1500.
349-
350337
ggplot(diamonds, aes(x = price)) +
351338
geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)
352339
```
@@ -367,7 +354,6 @@ Instead of displaying count, we'll display the **density**, which is the count s
367354
#| a great deal, suggesting similar density distributions of prices of
368355
#| diamonds. One notable feature is that all but Fair diamonds have high peaks
369356
#| around a price of 1500 and Fair diamonds have a higher mean than others.
370-
371357
ggplot(diamonds, aes(x = price, y = after_stat(density))) +
372358
geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)
373359
```
@@ -386,7 +372,6 @@ A visually simpler plot for exploring this relationship is using side-by-side bo
386372
#| prices is right skewed for each cut (Fair, Good, Very Good, Premium, and
387373
#| Ideal). The medians are close to each other, with the median for Ideal
388374
#| diamonds lowest and that for Fair highest.
389-
390375
ggplot(diamonds, aes(x = cut, y = price)) +
391376
geom_boxplot()
392377
```
@@ -407,7 +392,6 @@ You might be interested to know how highway mileage varies across classes:
407392
#| Side-by-side boxplots of highway mileages of cars by class. Classes are
408393
#| on the x-axis (2seaters, compact, midsize, minivan, pickup, subcompact,
409394
#| and suv).
410-
411395
ggplot(mpg, aes(x = class, y = hwy)) +
412396
geom_boxplot()
413397
```
@@ -419,7 +403,6 @@ To make the trend easier to see, we can reorder `class` based on the median valu
419403
#| Side-by-side boxplots of highway mileages of cars by class. Classes are
420404
#| on the x-axis and ordered by increasing median highway mileage (pickup,
421405
#| suv, minivan, 2seater, subcompact, compact, and midsize).
422-
423406
ggplot(mpg, aes(x = fct_reorder(class, hwy, median), y = hwy)) +
424407
geom_boxplot()
425408
```
@@ -431,7 +414,6 @@ You can do that by exchanging the x and y aesthetic mappings.
431414
#| fig-alt: |
432415
#| Side-by-side boxplots of highway mileages of cars by class. Classes are
433416
#| on the y-axis and ordered by increasing median highway mileage.
434-
435417
ggplot(mpg, aes(x = hwy, y = fct_reorder(class, hwy, median))) +
436418
geom_boxplot()
437419
```
@@ -473,7 +455,6 @@ One way to do that is to rely on the built-in `geom_count()`:
473455
#| and color (D, E, F, G, G, I, and J). The sizes of the points represent
474456
#| the number of observations for that combination. The legend indicates
475457
#| that these sizes range between 1000 and 4000.
476-
477458
ggplot(diamonds, aes(x = cut, y = color)) +
478459
geom_count()
479460
```
@@ -497,7 +478,6 @@ Then visualize with `geom_tile()` and the fill aesthetic:
497478
#| observations in each tile. There are more Ideal diamonds than other cuts,
498479
#| with the highest number being Ideal diamonds with color G. Fair diamonds
499480
#| and diamonds with color I are the lowest in frequency.
500-
501481
diamonds |>
502482
count(color, cut) |>
503483
ggplot(aes(x = color, y = cut)) +
@@ -530,7 +510,6 @@ The relationship is exponential.
530510
#| fig-alt: |
531511
#| A scatterplot of price vs. carat. The relationship is positive, somewhat
532512
#| strong, and exponential.
533-
534513
ggplot(smaller, aes(x = carat, y = price)) +
535514
geom_point()
536515
```
@@ -547,7 +526,6 @@ You've already seen one way to fix the problem: using the `alpha` aesthetic to a
547526
#| strong, and exponential. The points are transparent, showing clusters where
548527
#| the number of points is higher than other areas, The most obvious clusters
549528
#| are for diamonds with 1, 1.5, and 2 carats.
550-
551529
ggplot(smaller, aes(x = carat, y = price)) +
552530
geom_point(alpha = 1 / 100)
553531
```
@@ -569,7 +547,6 @@ You will need to install the hexbin package to use `geom_hex()`.
569547
#| Plot 1: A binned density plot of price vs. carat. Plot 2: A hexagonal bin
570548
#| plot of price vs. carat. Both plots show that the highest density of
571549
#| diamonds have low carats and low prices.
572-
573550
ggplot(smaller, aes(x = carat, y = price)) +
574551
geom_bin2d()
575552
@@ -591,7 +568,6 @@ For example, you could bin `carat` and then for each group, display a boxplot:
591568
#| roughly symmetric price distributions, and diamonds that weigh more have
592569
#| left skewed distributions. Cheaper, smaller diamonds have outliers on the
593570
#| higher end, more expensive, bigger diamonds have outliers on the lower end.
594-
595571
ggplot(smaller, aes(x = carat, y = price)) +
596572
geom_boxplot(aes(group = cut_width(carat, 0.1)))
597573
```
@@ -672,7 +648,6 @@ Then, we exponentiate the residuals to put them back in the scale of raw prices.
672648
#| to 5, the y-axis ranges from 0 to almost 4. Much of the data are clustered
673649
#| around low values of carat and residuals. There is a clear, curved pattern
674650
#| showing decrease in residuals as carat increases.
675-
676651
library(tidymodels)
677652
678653
diamonds <- diamonds |>
@@ -699,7 +674,6 @@ Once you've removed the strong relationship between carat and price, you can see
699674
#| cuts (Fair to Ideal), the y-axis ranges from 0 to almost 5. The medians are
700675
#| quite similar, between roughly 0.75 to 1.25. Each of the distributions of
701676
#| residuals is right skewed, with many outliers on the higher end.
702-
703677
ggplot(diamonds_aug, aes(x = cut, y = .resid)) +
704678
geom_boxplot()
705679
```
@@ -712,4 +686,4 @@ In this chapter you've learned a variety of tools to help you understand the var
712686
You've seen techniques that work with a single variable at a time and with a pair of variables.
713687
This might seem painfully restrictive if you have tens or hundreds of variables in your data, but they're the foundation upon which all other techniques are built.
714688

715-
In the next chapter, we'll focus on the tools we can use to communicate our results.
689+
In the next chapter, we'll focus on the tools we can use to communicate our results.

_common.R

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ knitr::opts_chunk$set(
66
# cache = TRUE,
77
fig.retina = 2,
88
fig.width = 6,
9-
fig.asp = 2/3,
9+
fig.asp = 2 / 3,
1010
fig.show = "hold"
1111
)
1212

@@ -27,15 +27,17 @@ ggplot2::theme_set(ggplot2::theme_gray(12))
2727

2828
# use results: "asis" when setting a status for a chapter
2929
status <- function(type) {
30-
status <- switch(type,
30+
status <- switch(
31+
type,
3132
polishing = "should be readable but is currently undergoing final polishing",
3233
restructuring = "is undergoing heavy restructuring and may be confusing or incomplete",
3334
drafting = "is currently a dumping ground for ideas, and we don't recommend reading it",
3435
complete = "is largely complete and just needs final proof reading",
3536
stop("Invalid `type`", call. = FALSE)
3637
)
3738

38-
class <- switch(type,
39+
class <- switch(
40+
type,
3941
polishing = "note",
4042
restructuring = "important",
4143
drafting = "important",
@@ -45,9 +47,13 @@ status <- function(type) {
4547
cat(paste0(
4648
"\n",
4749
":::: status\n",
48-
"::: callout-", class, " \n",
50+
"::: callout-",
51+
class,
52+
" \n",
4953
"You are reading the work-in-progress second edition of R for Data Science. ",
50-
"This chapter ", status, ". ",
54+
"This chapter ",
55+
status,
56+
". ",
5157
"You can find the complete first edition at <https://r4ds.had.co.nz>.\n",
5258
":::\n",
5359
"::::\n"

_quarto.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -82,5 +82,4 @@ format:
8282
include-in-header: "plausible.html"
8383
callout-appearance: simple
8484

85-
editor: visual
86-
85+
editor: source

air.toml

Whitespace-only changes.

base-R.qmd

Lines changed: 1 addition & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22

33
```{r}
44
#| echo: false
5-
65
source("_common.R")
76
```
87

@@ -30,7 +29,6 @@ This package focuses on base R so doesn't have any real prerequisites, but we'll
3029
```{r}
3130
#| label: setup
3231
#| message: false
33-
3432
library(tidyverse)
3533
```
3634

@@ -152,7 +150,6 @@ Several dplyr verbs are special cases of `[`:
152150

153151
```{r}
154152
#| results: false
155-
156153
df <- tibble(
157154
x = c(2, 3, 1, 1, NA),
158155
y = letters[1:5],
@@ -170,7 +167,6 @@ Several dplyr verbs are special cases of `[`:
170167
171168
```{r}
172169
#| results: false
173-
174170
df |> arrange(x, y)
175171
176172
# same as
@@ -183,7 +179,6 @@ Several dplyr verbs are special cases of `[`:
183179
184180
```{r}
185181
#| results: false
186-
187182
df |> select(x, z)
188183
189184
# same as
@@ -202,7 +197,6 @@ df |>
202197

203198
```{r}
204199
#| results: false
205-
206200
# same as
207201
df |> subset(x > 1, c(y, z))
208202
```
@@ -353,7 +347,6 @@ If this pepper shaker is your list `pepper`, then, `pepper[1]` is a pepper shake
353347
#| the pepper shaker containing pepper, it contains a single packet of pepper.
354348
#| In the middle is a photo of a single packet of pepper. On the right is a
355349
#| photo of the contents of a packet of pepper.
356-
357350
knitr::include_graphics("diagrams/pepper.png")
358351
```
359352

@@ -434,7 +427,6 @@ The basic structure of a `for` loop looks like this:
434427

435428
```{r}
436429
#| eval: false
437-
438430
for (element in vector) {
439431
# do something with element
440432
}
@@ -445,15 +437,13 @@ For example, in @sec-save-database instead of using `walk()`:
445437

446438
```{r}
447439
#| eval: false
448-
449440
paths |> walk(append_file)
450441
```
451442

452443
We could have used a `for` loop:
453444

454445
```{r}
455446
#| eval: false
456-
457447
for (path in paths) {
458448
append_file(path)
459449
}
@@ -525,7 +515,6 @@ Here's a quick example from the diamonds dataset:
525515
#| that fans out as both price and carat increases. The scatter plot
526516
#| shows very few diamonds bigger than 3 carats compared to diamonds between
527517
#| 0 to 3 carats.
528-
529518
# Left
530519
hist(diamonds$carat)
531520
@@ -543,4 +532,4 @@ This often makes life easier for programming and so becomes more important as yo
543532

544533
This chapter concludes the programming section of the book.
545534
You've made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can *program* in R.
546-
We hope these chapters have sparked your interest in programming and that you're looking forward to learning more outside of this book.
535+
We hope these chapters have sparked your interest in programming and that you're looking forward to learning more outside of this book.

0 commit comments

Comments
 (0)