Fix figure alignment and captions

ningli · ningli · commit a9833190d527 · 2021-08-25T14:11:04.000+01:00
diff --git a/doc/hector.md b/doc/hector.md
@@ -80,10 +80,10 @@ This set of benchmarks was performed in March 2010 on the HECToR phase 2a system
 
 Large-scale parallel benchmarks of the FFT interface were performed, using problem size up to 8192^3. The results presented are the time spent to compute a pair of forward and backward transforms on random signals. Both c2c and r2c/c2r transforms were tested. The underlying FFT engine is the ACML FFT (version 4.3). In all cases, the original signals were recovered to machine accuracy after the backward transforms - a good validation for the library. Up to 16384 cores were used on HECToR and each case was repeated 3 times and the fastest results were recorded. On Jaguar, a few very large tests were arranged using up to 131072 cores. Note that the runtime performance does vary a lot for such communication intensive applications, particularly on busy production systems.
 
-<figure>
-  <img src="images/fft_hector_2a.png" style="display:block;float:none;margin-left:auto;margin-right:auto;">
-  <figcaption  style="text-align: center;">Scaling of the FFT interface on HECToR phase 2a and Jaguar.</figcaption>
-</figure>
+<p align="center">
+  <img src="images/fft_hector_2a.png"><br>
+  <span style="font-size:smaller;">Scaling of the FFT interface on HECToR phase 2a and Jaguar.</span>
+</p>
 
 It can be seen that the FFT interface scales almost perfectly on HECToR for all the tests. As expected, r2c/c2r transforms are twice as fast as c2c transforms. On Jaguar, the scaling is less good for larger core counts but the paralle efficiency is still at a respectable 81% for the largest test. For a particular configuration - 4096^3 mesh on 16384 cores - the time spent on Jaguar is almost twice of that on HECToR. This is not unexpected. Jaguar had two 6-core chips. In order to achieve better load balance, the problem sizes need to have a factor of 6 which was not the case in these tests. Also the problem size 8192^3, while quite large for real-world applications, is indeed too small when distributing over 10^5 cores.
 
@@ -99,6 +99,7 @@ The largest tests done before was on problem size 8192^3, but on either much lar
 
 The library code had to be optimised first to minimise the memory footprint. The ACML implementation of the library was optimised by using inplace transforms wherever possible. A software option was also introduced to allow the FFT input to be overwritten.<a href="#note2" id="note2ref"><sup>2</sup></a> In order to increase the problem size further, the 24-core nodes can be under-populated - by using only 16 cores per node, each core will have access to about 50% more memory.
 
+<p align="center">
 <table>
 	<tr style="background-color:#09548B; color:#ffffff;">
 	  <td style="text-align:center;">Problem Size</td>
@@ -131,11 +132,12 @@ The library code had to be optimised first to minimise the memory footprint. The
 	  <td style="text-align:center;">82.76</td>
 	</tr>
 </table>
+</p>
 
 The table summarises all the test cases done using 16384 cores. For under-populated cases, 24576 cores (1024 nodes, the largest possible HECToR job) had to be reserved. The figures reported are number of seconds to perform a pair (forward+backward) of single-precision complex-to-complex FFTs. As shown, the largest problem size achieved is 12288\*8192\*8192. The scaling of the library is very good - each time the problem size is doubled, the time required is only slightly more than doubled. Also shown is that when running in under-populated mode, the code is consistently 20% faster.
 
 <hr size="1">
 
 <a id="note1" href="#note1ref"><sup>1</sup></a>This research used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725.
 
-<a id="note2" href="#note2ref"><sup>2</sup></a>2DECOMP&FFT's FFT interface itself does not support inplace transforms at the moment - the input and output must point to distinct memory addresses. But the sub-steps (1D FFTs) can be implemented using inplace transforms provided by the underlying FFT engines. Allowing the input to be overwritten makes it possible to reuse the memory as scratch space.
+<a id="note2" href="#note2ref"><sup>2</sup></a>2DECOMP&FFT's FFT interface itself does not support inplace transforms at the moment - the input and output must point to distinct memory addresses. But the sub-steps (1D FFTs) can be implemented using inplace transforms provided by the underlying FFT engines. Allowing the input to be overwritten makes it possible to reuse the memory as scratch space.
diff --git a/doc/jugene.md b/doc/jugene.md
@@ -4,9 +4,9 @@ This set of benchmarked was performed in May 2010 on JUGENE, the big IBM Blue Ge
 
 The work was made possible with the assistance of high performance computing resources (Tier-0) provided by PRACE. 2DECOMP&FFT was ported onto the Blue Gene/P. One major improvement achieved was the implementation of the FFT interface using ESSL, a high-performance math library native to IBM systems. The FFT interface was then benchmarked on problem sizes up to 8192^3 using up to 131072 cores.
 
-<figure>
-  <img src="images/fft_bgp.png" style="display:block;float:none;margin-left:auto;margin-right:auto;">
-  <figcaption  style="text-align: center;">Scaling of the FFT interface on Blue Gene/P JUGENE.</figcaption>
-</figure>
+<p align="center">
+  <img src="images/fft_bgp.png"><br>
+  <span style="font-size:smaller;">Scaling of the FFT interface on Blue Gene/P JUGENE.</span>
+</p>
 
 As seen, the code scales extremely well on the system for all problem sizes. The apparent super-linear scaling for the 1024^3 case is understood to be related to the Torus network configurations that favour larger jobs.
diff --git a/doc/p3dfft.md b/doc/p3dfft.md
@@ -4,10 +4,10 @@ P3DFFT is probably the most well-known open-source distributed FFT library. The
 
 P3DFFT was actually ported onto HECToR (my development system) at the early stage of the 2DECOMP&FFT project. Fig. 1 shows its good scaling on the old hardware (back in early 2009, the system was a Cray XT4 using dual-core AMD Opteron processors and Cray SeaStar interconnect).
 
-<figure>
-  <img src="images/p3dfft_hector_phase1.png" style="display:block;float:none;margin-left:auto;margin-right:auto;">
-  <figcaption  style="text-align: center;">Figure 1. P3DFFT scaling on Cray XT4 HECToR..</figcaption>
-</figure>
+<p align="center">
+  <img src="images/p3dfft_hector_phase1.png"><br>
+  <span style="font-size:smaller;">Figure 1. P3DFFT scaling on Cray XT4 HECToR.</span>
+</p>
 
 What motivated the author to develop a new and somewhat competing library were the following:
 - P3DFFT is an FFT-only package. It is not designed as a general-purpose 2D decomposition library and its communication routines are not designed to be user callable. 2DECOMP&FFT provides a general-purpose decomposition library to support the building of a variety of applications (the applications do not necessarily need to use FFT).
@@ -18,10 +18,10 @@ What motivated the author to develop a new and somewhat competing library were t
 
 The parallel performance of 2DECOMP&FFT and P3DFFT has been studied in great detail in a [MSc thesis by E. Brachos at University of Edinburgh](https://static.epcc.ed.ac.uk/dissertations/hpc-msc/2010-2011/EvangelosBrachos.pdf). Fig. 2 shows a set of benchmark on r2c/c2r transforms of size 256^3. The MPI interface of FFTW 3.3 was also examined, although it can only run in 1D slab decomposition mode.
 
-<figure>
-  <img src="images/Brachos.png" style="display:block;float:none;margin-left:auto;margin-right:auto;">
-  <figcaption  style="text-align: center;">Figure 2. Speedup of 2DECOMP&FFT, P3DFFT and FFTW 3.3's MPI interface.</figcaption>
-</figure>
+<p align="center">
+  <img src="images/Brachos.png"><br>
+  <span style="font-size:smaller;">Figure 2. Speedup of 2DECOMP&FFT, P3DFFT and FFTW 3.3's MPI interface.</span>
+</p>
 
 The performance difference between 2DECOMP&FFT and P3DFFT is often shown to be marginal, although the best 2D processor grid to achieve the optimal performance can be very different due to the different internal architecture of the two libraries.