a number of minor tidy ups (#624)

penelopeysm · web-flow · commit 2788f7cd8f4e · 2025-06-28T11:20:12.000+01:00
* Fix broken links, tidy shortcodes

* Minor text fixes

* Text fixes

* Fix the fixes

* Bump versions, remove explicit Zygote dep

* remove Zygote, add DynamicPPL.DebugUtils in performance tips

* remove config=nothing from Mooncake
diff --git a/Manifest.toml b/Manifest.toml
diff --git a/Project.toml b/Project.toml
@@ -54,7 +54,6 @@ StatsFuns = "4c63d2b9-4356-54db-8cca-17b64c39e42c"
 StatsPlots = "f3b207a7-027a-5e70-b257-86293d7955fd"
 Turing = "fce5fe82-541a-59a6-adf8-730c64b5f9a0"
 UnPack = "3a884ed6-31ef-47d7-9d2a-63182c4928ed"
-Zygote = "e88e6eb3-aa80-5325-afca-941959d7151f"
 
 [compat]
 Turing = "0.39"
diff --git a/_quarto.yml b/_quarto.yml
@@ -169,23 +169,23 @@ include-in-header:
 # Note that you don't need to prepend `../../` to the link, Quarto will figure
 # it out automatically.
 
-get-started: tutorials/docs-00-getting-started
-tutorials-intro: tutorials/00-introduction
-gaussian-mixture-model: tutorials/01-gaussian-mixture-model
-logistic-regression: tutorials/02-logistic-regression
-bayesian-neural-network: tutorials/03-bayesian-neural-network
-hidden-markov-model: tutorials/04-hidden-markov-model
-linear-regression: tutorials/05-linear-regression
-infinite-mixture-model: tutorials/06-infinite-mixture-model
-poisson-regression: tutorials/07-poisson-regression
-multinomial-logistic-regression: tutorials/08-multinomial-logistic-regression
-variational-inference: tutorials/09-variational-inference
-bayesian-differential-equations: tutorials/10-bayesian-differential-equations
-probabilistic-pca: tutorials/11-probabilistic-pca
-gplvm: tutorials/12-gplvm
-seasonal-time-series: tutorials/13-seasonal-time-series
-using-turing-advanced: tutorials/docs-09-using-turing-advanced
-using-turing: tutorials/docs-12-using-turing-guide
+core-functionality: core-functionality
+get-started: getting-started
+
+tutorials-intro: tutorials/coin-flipping
+gaussian-mixture-model: tutorials/gaussian-mixture-models
+logistic-regression: tutorials/bayesian-logistic-regression
+bayesian-neural-network: tutorials/bayesian-neural-networks
+hidden-markov-model: tutorials/hidden-markov-models
+linear-regression: tutorials/bayesian-linear-regression
+infinite-mixture-model: tutorials/infinite-mixture-models
+poisson-regression: tutorials/bayesian-poisson-regression
+multinomial-logistic-regression: tutorials/multinomial-logistic-regression
+variational-inference: tutorials/variational-inference
+bayesian-differential-equations: tutorials/bayesian-differential-equations
+probabilistic-pca: tutorials/probabilistic-pca
+gplvm: tutorials/gaussian-process-latent-variable-models
+seasonal-time-series: tutorials/bayesian-time-series-analysis
 
 usage-automatic-differentiation: usage/automatic-differentiation
 usage-custom-distribution: usage/custom-distribution
@@ -204,7 +204,7 @@ dev-model-manual: developers/compiler/model-manual
 contexts: developers/compiler/minituring-contexts
 minituring: developers/compiler/minituring-compiler
 using-turing-compiler: developers/compiler/design-overview
-using-turing-variational-inference: developers/inference/variational-inference
+dev-variational-inference: developers/inference/variational-inference
 using-turing-implementing-samplers: developers/inference/implementing-samplers
 dev-transforms-distributions: developers/transforms/distributions
 dev-transforms-bijectors: developers/transforms/bijectors
diff --git a/getting-started/index.qmd b/getting-started/index.qmd
@@ -92,5 +92,5 @@ The underlying theory of Bayesian machine learning is not explained in detail in
 A thorough introduction to the field is [*Pattern Recognition and Machine Learning*](https://www.springer.com/us/book/9780387310732) (Bishop, 2006); an online version is available [here (PDF, 18.1 MB)](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf).
 :::
 
-The next page on [Turing's core functionality]({{<meta using-turing>}}) explains the basic features of the Turing language.
+The next page on [Turing's core functionality]({{<meta core-functionality>}}) explains the basic features of the Turing language.
 From there, you can either look at [worked examples of how different models are implemented in Turing]({{<meta tutorials-intro>}}), or [specific tips and tricks that can help you get the most out of Turing]({{<meta usage-performance-tips>}}).
diff --git a/tutorials/bayesian-differential-equations/index.qmd b/tutorials/bayesian-differential-equations/index.qmd
@@ -344,7 +344,7 @@ import Mooncake
 import SciMLSensitivity
 
 # Define the AD backend to use
-adtype = AutoMooncake(; config=nothing)
+adtype = AutoMooncake()
 
 # Sample a single chain with 1000 samples using Mooncake
 sample(model, NUTS(; adtype=adtype), 1000; progress=false)
diff --git a/tutorials/bayesian-neural-networks/index.qmd b/tutorials/bayesian-neural-networks/index.qmd
@@ -210,7 +210,7 @@ setprogress!(false)
 ```{julia}
 # Perform inference.
 n_iters = 2_000
-ch = sample(bayes_nn(reduce(hcat, xs), ts), NUTS(; adtype=AutoMooncake(; config=nothing)), n_iters);
+ch = sample(bayes_nn(reduce(hcat, xs), ts), NUTS(; adtype=AutoMooncake()), n_iters);
 ```
 
 Now we extract the parameter samples from the sampled chain as `θ` (this is of size `5000 x 20` where `5000` is the number of iterations and `20` is the number of parameters).
diff --git a/tutorials/gaussian-mixture-models/index.qmd b/tutorials/gaussian-mixture-models/index.qmd
@@ -74,9 +74,9 @@ and then drawing the datum accordingly, i.e., in our example drawing
 $$
 x_i \sim \mathcal{N}([\mu_{z_i}, \mu_{z_i}]^\mathsf{T}, I) \qquad (i=1,\ldots,N).
 $$
-For more details on Gaussian mixture models, we refer to Christopher M. Bishop, *Pattern Recognition and Machine Learning*, Section 9.
+For more details on Gaussian mixture models, refer to Chapter 9 of Christopher M. Bishop, *Pattern Recognition and Machine Learning*.
 
-We specify the model with Turing.
+We specify the model in Turing:
 
 ```{julia}
 using Turing
@@ -130,10 +130,11 @@ burn = 10
 chains = sample(model, sampler, MCMCThreads(), nsamples, nchains, discard_initial = burn);
 ```
 
-::: {.callout-warning collapse="true"}
+::: {.callout-warning}
 ## Sampling With Multiple Threads
-The `sample()` call above assumes that you have at least `nchains` threads available in your Julia instance. If you do not, the multiple chains
-will run sequentially, and you may notice a warning. For more information, see [the Turing documentation on sampling multiple chains.](https://turinglang.org/dev/docs/using-turing/guide/#sampling-multiple-chains)
+The `sample()` call above assumes that you have at least two threads available in your Julia instance.
+If you do not, the multiple chains will run sequentially, and you may notice a warning.
+For more information, see [the Turing documentation on sampling multiple chains.]({{<meta core-functionality>}}#sampling-multiple-chains)
 :::
 
 ```{julia}
@@ -159,12 +160,14 @@ We consider the samples of the location parameters $\mu_1$ and $\mu_2$ for the t
 plot(chains[["μ[1]", "μ[2]"]]; legend=true)
 ```
 
-It can happen that the modes of $\mu_1$ and $\mu_2$ switch between chains.
-For more information see the [Stan documentation](https://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html). This is because it's possible for either model parameter $\mu_k$ to be assigned to either of the corresponding true means, and this assignment need not be consistent between chains.
+From the plots above, we can see that the chains have converged to seemingly different values for the parameters $\mu_1$ and $\mu_2$.
+However, these actually represent the same solution: it does not matter whether we assign $\mu_1$ to the first cluster and $\mu_2$ to the second, or vice versa, since the resulting sum is the same.
+(In principle it is also possible for the parameters to swap places _within_ a single chain, although this does not happen in this example.)
+For more information see the [Stan documentation](https://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html), or Bishop's book, where the concept of _identifiability_ is discussed.
 
-That is, the posterior is fundamentally multimodal, and different chains can end up in different modes, complicating inference.
-One solution here is to enforce an ordering on our $\mu$ vector, requiring $\mu_k > \mu_{k-1}$ for all $k$.
-`Bijectors.jl` [provides](https://turinglang.org/Bijectors.jl/dev/transforms/#Bijectors.OrderedBijector) an easy transformation (`ordered()`) for this purpose:
+Having $\mu_1$ and $\mu_2$ swap can complicate the interpretation of the results, especially when different chains converge to different assignments.
+One solution here is to enforce an ordering on our $\mu$ vector, requiring $\mu_k \geq \mu_{k-1}$ for all $k$.
+`Bijectors.jl` [provides](https://turinglang.org/Bijectors.jl/stable/transforms/#Bijectors.OrderedBijector) a convenient function, `ordered()`, which can be applied to a (continuous multivariate) distribution to enforce this:
 
 ```{julia}
 using Bijectors: ordered
@@ -194,15 +197,13 @@ end
 model = gaussian_mixture_model_ordered(x);
 ```
 
-
-Now, re-running our model, we can see that the assigned means are consistent across chains:
+Now, re-running our model, we can see that the assigned means are consistent between chains:
 
 ```{julia}
 #| output: false
 chains = sample(model, sampler, MCMCThreads(), nsamples, nchains, discard_initial = burn);
 ```
 
-
 ```{julia}
 #| echo: false
 let
@@ -243,6 +244,7 @@ scatter!(x[1, :], x[2, :]; legend=false, title="Synthetic Dataset")
 ```
 
 ## Inferred Assignments
+
 Finally, we can inspect the assignments of the data points inferred using Turing.
 As we can see, the dataset is partitioned into two distinct groups.
 
@@ -259,23 +261,23 @@ scatter(
 
 
 ## Marginalizing Out The Assignments
-We can write out the marginal posterior of (continuous) $w, \mu$ by summing out the influence of our (discrete) assignments $z_i$ from
-our likelihood:
-$$
-p(y \mid w, \mu ) = \sum_{k=1}^K w_k p_k(y \mid \mu_k)
-$$
+
+We can write out the marginal posterior of (continuous) $w, \mu$ by summing out the influence of our (discrete) assignments $z_i$ from our likelihood:
+
+$$p(y \mid w, \mu ) = \sum_{k=1}^K w_k p_k(y \mid \mu_k)$$
+
 In our case, this gives us:
-$$
-p(y \mid w, \mu) = \sum_{k=1}^K w_k \cdot \operatorname{MvNormal}(y \mid \mu_k, I)
-$$
+
+$$p(y \mid w, \mu) = \sum_{k=1}^K w_k \cdot \operatorname{MvNormal}(y \mid \mu_k, I)$$
 
 
 ### Marginalizing By Hand
-We could implement the above version of the Gaussian mixture model in Turing as follows:
+
+We could implement the above version of the Gaussian mixture model in Turing as follows.
+
 First, Turing uses log-probabilities, so the likelihood above must be converted into log-space:
-$$
-\log \left( p(y \mid w, \mu) \right) = \text{logsumexp} \left[\log (w_k) + \log(\operatorname{MvNormal}(y \mid \mu_k, I)) \right]
-$$
+
+$$\log \left( p(y \mid w, \mu) \right) = \text{logsumexp} \left[\log (w_k) + \log(\operatorname{MvNormal}(y \mid \mu_k, I)) \right]$$
 
 Where we sum the components with `logsumexp` from the [`LogExpFunctions.jl` package](https://juliastats.org/LogExpFunctions.jl/stable/).
 The manually incremented likelihood can be added to the log-probability with `@addlogprob!`, giving us the following model:
@@ -300,27 +302,25 @@ using LogExpFunctions
 end
 ```
 
-::: {.callout-warning collapse="false"}
+::: {.callout-warning}
 ## Manually Incrementing Probablity
 
-When possible, use of `@addlogprob!` should be avoided, as it exists outside the
-usual structure of a Turing model. In most cases, a custom distribution should be used instead.
+When possible, use of `@addlogprob!` should be avoided, as it exists outside the usual structure of a Turing model.
+In most cases, a custom distribution should be used instead.
 
-Here, the next section demonstrates the preferred method --- using the `MixtureModel` distribution we have seen already to
-perform the marginalization automatically.
+The next section demonstrates the preferred method: using the `MixtureModel` distribution we have seen already to perform the marginalization automatically.
 :::
 
+### Marginalizing For Free With Distribution.jl's `MixtureModel` Implementation
 
-### Marginalizing For Free With Distribution.jl's MixtureModel Implementation
-
-We can use Turing's `~` syntax with anything that `Distributions.jl` provides `logpdf` and `rand` methods for. It turns out that the
-`MixtureModel` distribution it provides has, as its `logpdf` method, `logpdf(MixtureModel([Component_Distributions], weight_vector), Y)`, where `Y` can be either a single observation or vector of observations.
+We can use Turing's `~` syntax with anything that `Distributions.jl` provides `logpdf` and `rand` methods for.
+It turns out that the `MixtureModel` distribution it provides has, as its `logpdf` method, `logpdf(MixtureModel([Component_Distributions], weight_vector), Y)`, where `Y` can be either a single observation or vector of observations.
 
 In fact, `Distributions.jl` provides [many convenient constructors](https://juliastats.org/Distributions.jl/stable/mixture/) for mixture models, allowing further simplification in common special cases.
 
 For example, when mixtures distributions are of the same type, one can write: `~ MixtureModel(Normal, [(μ1, σ1), (μ2, σ2)], w)`, or when the weight vector is known to allocate probability equally, it can be ommited.
 
-The `logpdf` implementation for a `MixtureModel` distribution is exactly the marginalization defined above, and so our model becomes simply:
+The `logpdf` implementation for a `MixtureModel` distribution is exactly the marginalization defined above, and so our model can be simplified to:
 
 ```{julia}
 #| output: false
@@ -334,15 +334,14 @@ end
 model = gmm_marginalized(x);
 ```
 
-As we've summed out the discrete components, we can perform inference using `NUTS()` alone.
+As we have summed out the discrete components, we can perform inference using `NUTS()` alone.
 
 ```{julia}
 #| output: false
 sampler = NUTS()
 chains = sample(model, sampler, MCMCThreads(), nsamples, nchains; discard_initial = burn);
 ```
 
-
 ```{julia}
 #| echo: false
 let
@@ -356,23 +355,22 @@ let
 end
 ```
 
-`NUTS()` significantly outperforms our compositional Gibbs sampler, in large part because our model is now Rao-Blackwellized thanks to
-the marginalization of our assignment parameter.
+`NUTS()` significantly outperforms our compositional Gibbs sampler, in large part because our model is now Rao-Blackwellized thanks to the marginalization of our assignment parameter.
 
 ```{julia}
 plot(chains[["μ[1]", "μ[2]"]], legend=true)
 ```
 
-## Inferred Assignments - Marginalized Model
-As we've summed over possible assignments, the associated parameter is no longer available in our chain.
-This is not a problem, however, as given any fixed sample $(\mu, w)$, the assignment probability — $p(z_i \mid y_i)$ — can be recovered using Bayes rule:
-$$
-p(z_i \mid y_i) = \frac{p(y_i \mid z_i) p(z_i)}{\sum_{k = 1}^K \left(p(y_i \mid z_i) p(z_i) \right)}
-$$
+## Inferred Assignments With The Marginalized Model
+
+As we have summed over possible assignments, the latent parameter representing the assignments is no longer available in our chain.
+This is not a problem, however, as given any fixed sample $(\mu, w)$, the assignment probability $p(z_i \mid y_i)$ can be recovered using Bayes's theorme:
 
-This quantity can be computed for every $p(z = z_i \mid y_i)$, resulting in a probability vector, which is then used to sample
-posterior predictive assignments from a categorial distribution.
+$$p(z_i \mid y_i) = \frac{p(y_i \mid z_i) p(z_i)}{\sum_{k = 1}^K \left(p(y_i \mid z_i) p(z_i) \right)}$$
+
+This quantity can be computed for every $p(z = z_i \mid y_i)$, resulting in a probability vector, which is then used to sample posterior predictive assignments from a categorial distribution.
 For details on the mathematics here, see [the Stan documentation on latent discrete parameters](https://mc-stan.org/docs/stan-users-guide/latent-discrete.html).
+
 ```{julia}
 #| output: false
 function sample_class(xi, dists, w)
diff --git a/tutorials/multinomial-logistic-regression/index.qmd b/tutorials/multinomial-logistic-regression/index.qmd
@@ -147,7 +147,7 @@ chain
 ::: {.callout-warning collapse="true"}
 ## Sampling With Multiple Threads
 The `sample()` call above assumes that you have at least `nchains` threads available in your Julia instance. If you do not, the multiple chains
-will run sequentially, and you may notice a warning. For more information, see [the Turing documentation on sampling multiple chains.]({{<meta using-turing>}}#sampling-multiple-chains)
+will run sequentially, and you may notice a warning. For more information, see [the Turing documentation on sampling multiple chains.]({{<meta core-functionality>}}#sampling-multiple-chains)
 :::
 
 Since we ran multiple chains, we may as well do a spot check to make sure each chain converges around similar points.
diff --git a/tutorials/probabilistic-pca/index.qmd b/tutorials/probabilistic-pca/index.qmd
@@ -206,7 +206,7 @@ setprogress!(false)
 ```{julia}
 k = 2 # k is the dimension of the projected space, i.e. the number of principal components/axes of choice
 ppca = pPCA(mat_exp', k) # instantiate the probabilistic model
-chain_ppca = sample(ppca, NUTS(; adtype=AutoMooncake(; config=nothing)), 500);
+chain_ppca = sample(ppca, NUTS(; adtype=AutoMooncake()), 500);
 ```
 
 The samples are saved in `chain_ppca`, which is an `MCMCChains.Chains` object.
@@ -320,7 +320,7 @@ We instantiate the model and ask Turing to sample from it using NUTS sampler. Th
 
 ```{julia}
 ppca_ARD = pPCA_ARD(mat_exp') # instantiate the probabilistic model
-chain_ppcaARD = sample(ppca_ARD, NUTS(; adtype=AutoMooncake(; config=nothing)), 500) # sampling
+chain_ppcaARD = sample(ppca_ARD, NUTS(; adtype=AutoMooncake()), 500) # sampling
 plot(group(chain_ppcaARD, :α); margin=6.0mm)
 ```
 
diff --git a/tutorials/variational-inference/index.qmd b/tutorials/variational-inference/index.qmd
@@ -14,7 +14,7 @@ Pkg.instantiate();
 
 This post will look at **variational inference (VI)**, an optimization approach to _approximate_ Bayesian inference, and how to use it in Turing.jl as an alternative to other approaches such as MCMC.
 This post will focus on the usage of VI in Turing rather than the principles and theory underlying VI.
-If you are interested in understanding the mathematics you can checkout [our write-up]({{<meta using-turing-variational-inference>}}) or any other resource online (there are a lot of great ones).
+If you are interested in understanding the mathematics you can checkout [our write-up]({{<meta dev-variational-inference>}}) or any other resource online (there are a lot of great ones).
 
 Let's start with a minimal example. 
 Consider a `Turing.Model`, which we denote as `model`.
diff --git a/usage/automatic-differentiation/index.qmd b/usage/automatic-differentiation/index.qmd
@@ -37,7 +37,7 @@ import Mooncake
     # Rest of your model here
 end
 
-sample(f(), HMC(0.1, 5; adtype=AutoMooncake(; config=nothing)), 100)
+sample(f(), HMC(0.1, 5; adtype=AutoMooncake()), 100)
 ```
 
 By default, if you do not specify a backend, Turing will default to [ForwardDiff.jl](https://github.com/JuliaDiff/ForwardDiff.jl).
diff --git a/usage/mode-estimation/index.qmd b/usage/mode-estimation/index.qmd
@@ -74,7 +74,7 @@ We can also help the optimisation by giving it a starting point we know is close
 import Mooncake
 
 maximum_likelihood(
-    model, NelderMead(); initial_params=[0.1, 2], adtype=AutoMooncake(; config=nothing)
+    model, NelderMead(); initial_params=[0.1, 2], adtype=AutoMooncake()
 )
 ```
 
diff --git a/usage/performance-tips/index.qmd b/usage/performance-tips/index.qmd

Original file line number	Diff line number	Diff line change
`@@ -74,7 +74,7 @@ We can also help the optimisation by giving it a starting point we know is close`
`74`	`74`	`import Mooncake`
`75`	`75`
`76`	`76`	`maximum_likelihood(`
`77`		`- model, NelderMead(); initial_params=[0.1, 2], adtype=AutoMooncake(; config=nothing)`
	`77`	`+ model, NelderMead(); initial_params=[0.1, 2], adtype=AutoMooncake()`
`78`	`78`	`)`
`79`	`79`	```
`80`	`80`