Replace some ReverseDiffs with Mooncake (#623)

penelopeysm · web-flow · commit 78eb54b8e4f1 · 2025-06-26T12:19:38.000+01:00
* Use Mooncake for Optimization.jl example

* Use Mooncake in pPCA

* Update a link to AD docs page

* Fix bad shortcodes
diff --git a/tutorials/gaussian-processes-introduction/index.qmd b/tutorials/gaussian-processes-introduction/index.qmd
@@ -131,7 +131,7 @@ please let us know!
 Moving on, we generate samples from the posterior using the default `NUTS` sampler.
 We'll make use of [ReverseDiff.jl](https://github.com/JuliaDiff/ReverseDiff.jl), as it has
 better performance than [ForwardDiff.jl](https://github.com/JuliaDiff/ForwardDiff.jl/) on
-this example. See Turing.jl's docs on Automatic Differentiation for more info.
+this example. See the [automatic differentiation docs]({{< meta usage-automatic-differentiation >}}) for more info.
 
 
 ```{julia}
diff --git a/tutorials/probabilistic-pca/index.qmd b/tutorials/probabilistic-pca/index.qmd
@@ -90,7 +90,7 @@ First, we load the dependencies used.
 
 ```{julia}
 using Turing
-using ReverseDiff
+using Mooncake
 using LinearAlgebra, FillArrays
 
 # Packages for visualization
@@ -108,7 +108,7 @@ You can install them via `using Pkg; Pkg.add("package_name")`.
 ::: {.callout-caution}
 ## Package usages:
 We use `DataFrames` for instantiating matrices, `LinearAlgebra` and `FillArrays` to perform matrix operations;
-`Turing` for model specification and MCMC sampling, `ReverseDiff` for setting the automatic differentiation backend when sampling.
+`Turing` for model specification and MCMC sampling, `Mooncake` for automatic differentiation when sampling.
 `StatsPlots` for visualising the resutls. `, Measures` for setting plot margin units.
 As all examples involve sampling, for reproducibility we set a fixed seed using the `Random` standard library.
 :::
@@ -194,8 +194,9 @@ Specifically:
 
 Here we aim to perform MCMC sampling to infer the projection matrix $\mathbf{W}_{D \times k}$, the latent variable matrix $\mathbf{Z}_{k \times N}$, and the offsets $\boldsymbol{\mu}_{N \times 1}$.
 
-We run the inference using the NUTS sampler, of which the chain length is set to be 500, target accept ratio 0.65 and initial stepsize 0.1. By default, the NUTS sampler samples 1 chain.
-You are free to try [different samplers](https://turinglang.org/stable/docs/library/#samplers).
+We run the inference using the NUTS sampler.
+By default, `sample` samples a single chain (in this case with 500 samples).
+You can also use [different samplers]({{< meta usage-sampler-visualisation >}}) if you wish.
 
 ```{julia}
 #| output: false
@@ -205,17 +206,21 @@ setprogress!(false)
 ```{julia}
 k = 2 # k is the dimension of the projected space, i.e. the number of principal components/axes of choice
 ppca = pPCA(mat_exp', k) # instantiate the probabilistic model
-chain_ppca = sample(ppca, NUTS(;adtype=AutoReverseDiff()), 500);
+chain_ppca = sample(ppca, NUTS(; adtype=AutoMooncake(; config=nothing)), 500);
 ```
 
-The samples are saved in the Chains struct `chain_ppca`, whose shape can be checked:
+The samples are saved in `chain_ppca`, which is an `MCMCChains.Chains` object.
+We can check its shape:
 
 ```{julia}
 size(chain_ppca) # (no. of iterations, no. of vars, no. of chains) = (500, 159, 1)
 ```
 
-The Chains struct `chain_ppca` also contains the sampling info such as r-hat, ess, mean estimates, etc.
-You can print it to check these quantities.
+Sampling statistics such as R-hat, ESS, mean estimates, and so on can also be obtained from this:
+
+```{julia}
+describe(chain_ppca)
+```
 
 #### Step 5: posterior predictive checks
 
@@ -280,7 +285,7 @@ Another way to put it: 2 dimensions is enough to capture the main structure of t
 A direct question arises from above practice is: how many principal components do we want to keep, in order to sufficiently represent the latent structure in the data?
 This is a very central question for all latent factor models, i.e. how many dimensions are needed to represent that data in the latent space.
 In the case of PCA, there exist a lot of heuristics to make that choice.
-For example, We can tune the number of principal components using empirical methods such as cross-validation based some criteria such as MSE between the posterior predicted (e.g. mean predictions) data matrix and the original data matrix or the percentage of variation explained [^3].
+For example, We can tune the number of principal components using empirical methods such as cross-validation based on some criteria such as MSE between the posterior predicted (e.g. mean predictions) data matrix and the original data matrix or the percentage of variation explained [^3].
 
 For p-PCA, this can be done in an elegant and principled way, using a technique called *Automatic Relevance Determination* (ARD).
 ARD can help pick the correct number of principal directions by regularizing the solution space using a parameterized, data-dependent prior distribution that effectively prunes away redundant or superfluous features [^4].
@@ -315,7 +320,7 @@ We instantiate the model and ask Turing to sample from it using NUTS sampler. Th
 
 ```{julia}
 ppca_ARD = pPCA_ARD(mat_exp') # instantiate the probabilistic model
-chain_ppcaARD = sample(ppca_ARD, NUTS(;adtype=AutoReverseDiff()), 500) # sampling
+chain_ppcaARD = sample(ppca_ARD, NUTS(; adtype=AutoMooncake(; config=nothing)), 500) # sampling
 plot(group(chain_ppcaARD, :α); margin=6.0mm)
 ```
 
diff --git a/usage/mode-estimation/index.qmd b/usage/mode-estimation/index.qmd
@@ -71,14 +71,14 @@ The above are just two examples, Optimization.jl supports [many more](https://do
 We can also help the optimisation by giving it a starting point we know is close to the final solution, or by specifying an automatic differentiation method
 
 ```{julia}
-using ADTypes: AutoReverseDiff
-import ReverseDiff
+import Mooncake
+
 maximum_likelihood(
-    model, NelderMead(); initial_params=[0.1, 2], adtype=AutoReverseDiff()
+    model, NelderMead(); initial_params=[0.1, 2], adtype=AutoMooncake(; config=nothing)
 )
 ```
 
-When providing values to arguments like `initial_params` the parameters are typically specified in the order in which they appear in the code of the model, so in this case first `s²` then `m`. More precisely it's the order returned by `Turing.Inference.getparams(model, Turing.VarInfo(model))`.
+When providing values to arguments like `initial_params` the parameters are typically specified in the order in which they appear in the code of the model, so in this case first `s²` then `m`. More precisely it's the order returned by `Turing.Inference.getparams(model, DynamicPPL.VarInfo(model))`.
 
 We can also do constrained optimisation, by providing either intervals within which the parameters must stay, or costraint functions that they need to respect. For instance, here's how one can find the MLE with the constraint that the variance must be less than 0.01 and the mean must be between -1 and 1.: