|
| 1 | +--- |
| 2 | +title: "Reproducibility challenge" |
| 3 | +author: "John Blischak" |
| 4 | +date: "2020-06-11" |
| 5 | +output: workflowr::wflow_html |
| 6 | +editor_options: |
| 7 | + chunk_output_type: console |
| 8 | +--- |
| 9 | + |
| 10 | +## Introduction |
| 11 | + |
| 12 | +For the reproducibility challenge, you will attempt to re-run an analysis of |
| 13 | +Spotify song genres that was inspired by the blog post |
| 14 | +[Understanding + classifying genres using Spotify audio features][blog-post] |
| 15 | +by Kaylin Pavlik ([\@kaylinquest][kaylinquest]). |
| 16 | + |
| 17 | +[kaylinquest]: https://twitter.com/kaylinquest |
| 18 | +[blog-post]: https://www.kaylinpavlik.com/classifying-songs-genres/ |
| 19 | + |
| 20 | +<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">NEW: Understanding song genres using Spotify audio features and decision trees in <a href="https://twitter.com/hashtag/rstats?src=hash&ref_src=twsrc%5Etfw">#rstats</a>. Basically: <br><br>rap: speechy 🗣️<br>rock: can’t dance to it 🤟<br>EDM: high tempo ⏩<br>R&B: long songs ⏱️<br>latin: very danceable 💃<br>pop: everything else.<a href="https://t.co/q57ZDdROf7">https://t.co/q57ZDdROf7</a> <a href="https://t.co/sfxRPKvpp2">pic.twitter.com/sfxRPKvpp2</a></p>— Kaylin Pavlik (@kaylinquest) <a href="https://twitter.com/kaylinquest/status/1213138536570015745?ref_src=twsrc%5Etfw">January 3, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> |
| 21 | + |
| 22 | +The code includes a minimal machine learning style analysis with the following |
| 23 | +steps: |
| 24 | + |
| 25 | +* Import the songs data |
| 26 | +* Split the songs data into training and test sets |
| 27 | +* Build a tree model to classify songs into genres based on the song characteristics |
| 28 | +* Assess accuracy of the model on the training and test sets |
| 29 | +* Compute the accuracy of a model based on random guessing |
| 30 | + |
| 31 | +## Getting started |
| 32 | + |
| 33 | +The analysis purposefully contains various issues that make it difficult to |
| 34 | +reproduce. Open the file `spotify.Rmd` by clicking on it in the RStudio Files |
| 35 | +pane. Click the Knit button to re-run the analysis and find the first issue. |
| 36 | + |
| 37 | +## File paths |
| 38 | + |
| 39 | +The first error you will encounter is below: |
| 40 | + |
| 41 | +``` |
| 42 | +Quitting from lines 16-21 (spotify.Rmd) |
| 43 | +Error in file(file, "rt") : cannot open the connection |
| 44 | +Calls: <Anonymous> ... withVisible -> eval -> eval -> read.csv -> read.table -> file |
| 45 | +Execution halted |
| 46 | +``` |
| 47 | + |
| 48 | +The function `read.csv()` is unable to open the data file. What's wrong with the |
| 49 | +path to the file? Apply what you know about absolute and relative paths to |
| 50 | +update the path and re-run the analysis. |
| 51 | + |
| 52 | +## Undefined variable |
| 53 | + |
| 54 | +The next error you encounter is: |
| 55 | + |
| 56 | +``` |
| 57 | +Quitting from lines 27-30 (spotify.Rmd) |
| 58 | +Error in sample.int(length(x), size, replace, prob) : |
| 59 | + object 'numTrainingSamples' not found |
| 60 | +Calls: <Anonymous> ... withVisible -> eval -> eval -> sample -> sample.int |
| 61 | +Execution halted |
| 62 | +``` |
| 63 | + |
| 64 | +It looks like the variable `numTrainingSamples` isn't defined in the Rmd file. |
| 65 | +This error often occurs when a variable is interactively created in the R |
| 66 | +console, but you forget to define it in the script. |
| 67 | + |
| 68 | +Based on the description above the code chunk, can you define the variable |
| 69 | +`numTrainingSamples`? Hint: You can obtain the number of samples with |
| 70 | +`nrow(spotify)`. |
| 71 | + |
| 72 | +## Missing package |
| 73 | + |
| 74 | +The next error you encounter is: |
| 75 | + |
| 76 | +``` |
| 77 | +Quitting from lines 36-39 (spotify.Rmd) |
| 78 | +Error in rpart(genre ~ ., data = spotifyTraining) : |
| 79 | + could not find function "rpart" |
| 80 | +Calls: <Anonymous> ... handle -> withCallingHandlers -> withVisible -> eval -> eval |
| 81 | +Execution halted |
| 82 | +``` |
| 83 | + |
| 84 | +The function `rpart()` can't be found. This can occur when you load a package |
| 85 | +in the current R session, but forget to put the call to `library()` in the |
| 86 | +script. |
| 87 | + |
| 88 | +Based on the text above the code chunk, can you figure out which package needs |
| 89 | +to be loaded? |
| 90 | + |
| 91 | +## Renamed variable |
| 92 | + |
| 93 | +The next error you encounter is: |
| 94 | + |
| 95 | +``` |
| 96 | +Quitting from lines 61-66 (spotify.Rmd) |
| 97 | +Error in mean(spotifyTesting[, 1] == predict_random) : |
| 98 | + object 'predict_random' not found |
| 99 | +Calls: <Anonymous> ... withCallingHandlers -> withVisible -> eval -> eval -> mean |
| 100 | +
|
| 101 | +Execution halted |
| 102 | +``` |
| 103 | + |
| 104 | +R can't find the variable named `predict_random`. Look at the surrounding code: |
| 105 | +what do you think the name of this variable should be? |
| 106 | + |
| 107 | +Renaming variables during an analysis can lead to these subtle errors. Since |
| 108 | +both the original and updated versions of the variable are defined in the current |
| 109 | +R session, the code will continue to run. But when you or someone else tries to |
| 110 | +run the code in a clean R session, the code will unexpectedly fail. |
| 111 | + |
| 112 | +## Compare results |
| 113 | + |
| 114 | +Success! The analysis now runs. Compare your prediction results to that of your |
| 115 | +partners' and/or re-run the analysis again. Are the results always identical? |
| 116 | +Why not? What could you do if you wanted to publish these results and allow |
| 117 | +others to exactly reproduce your findings? |
0 commit comments