Slightly improve the writing

ningli · ningli · commit ec3b9ad02757 · 2021-09-10T11:36:44.000+01:00
diff --git a/doc/shared_memory.md b/doc/shared_memory.md
@@ -9,7 +9,7 @@ This feature has been implemented within the communication library as a black bo
 2DECOMP&FFT has two independent shared-memory implementations (they validate each other):
 
 - The first version is based on code supplied by David Tanqueray of Cray Inc., who initially applied this idea to several molecular dynamics applications. This code accesses platform-dependent information<a href="#note1" id="note1ref"><sup>1</sup></a> in order to establish the share-memory configurations (such as which MPI rank belongs to which node). It has been tested on Cray hardware only.
--  The second version is based on the open-source package FreeIPC, created by Ian Bush.  former NAG colleague. FreeIPC is basically a Fortran wrapper for the System V IPC API and it provides a system-independent way to gather shared-memory information. This makes it possible to write more portable shared-memory code. 
+-  The second version is based on the open-source package FreeIPC, created by Ian Bush, a former NAG colleague. FreeIPC is basically a Fortran wrapper for the System V IPC API and it provides a system-independent way to gather shared-memory information. This makes it possible to write more portable shared-memory code. 
 
 Fig. 1 below demonstrates the typical benefit of shared-memory programming. The data was collected on HECToR phase 2a system (Cray XT4 with quad-core AMD Opteron processors) from a series of simulations using 256 MPI ranks over a range of problem sizes. When the problem size is small (so is the message size), the communication routines were called more times so that the total amount of data moving within the system remains a constant. It can be seen that when the problem size is smaller, the overhead of setting up communications is much higher and the shared-memory code can improve communication efficiency by up to 30%. As the problem size increases, the benefit of using shared-memory code becomes smaller. In fact for large message size (> 32Kb in this example), the shared-memory code is actually slower due to the extra memory copying operations required to assemble/disassemble the shared-memory buffers. 
 
@@ -19,7 +19,7 @@ Fig. 1 below demonstrates the typical benefit of shared-memory programming. The
   </span>
 </p>
 
-The HECToR upgrade to phase 2b (world's first production Cray XE6) presented a unique opportunity to demonstrate the benefit of shared-memory programming in real applications. The 24-core nodes were introduced to HECToR several months before the arrival of new Gemini interconnect. During the transitional period, communication intensive applications often produced more network traffic than the old SeaStar interconnect could handle. Fig.2 shows the benchmark of 2DECOMP&FFT's FFT interface with a 2592^3 problem size<a href="#note2" id="note2ref"><sup>2</sup></a>. With the slow SeaStar interconnect, the scaling is very poor when using more than several thousands of cores. However, switching on the shared-memory code significantly improved the code performance (sometimes by more than 40%) and a parallel efficiency of more than 90% was observed through out the scale. The new Gemini interconnect offers significant improvement in terms of both network bandwidth and latency. As a result, significant performance gain is to be expected for communication intensive code: the FFT benchmark was almost twice as fast in some cases. However, the shared-memory code on Gemini (not shown in the figure) offered absolutely no benefit when the network was fast enough to handle all the messages efficiently. 
+The HECToR upgrade to phase 2b (world's first production Cray XE6) presented a unique opportunity to demonstrate the benefit of shared-memory programming in real applications. The 24-core nodes were introduced to HECToR several months before the arrival of new Gemini interconnect. During the transitional period, communication intensive applications often produced more network traffic than the old SeaStar interconnect could handle. Fig.2 shows the benchmark of 2DECOMP&FFT's FFT interface with a 2592^3 problem size<a href="#note2" id="note2ref"><sup>2</sup></a>. With the slow SeaStar interconnect, the scaling was poor when using more than few thousands cores. However, switching on the shared-memory code significantly improved the application performance (by as far as 40%) and parallel efficiency of more than 90% was observed through out the scale. The new Gemini interconnect offered significant improvement in terms of both network bandwidth and latency. As a result, significant performance gain was to be expected for communication intensive codes. The FFT benchmark was almost twice as fast in some cases. However, the shared-memory code on Gemini (not shown in the figure) offered absolutely no benefit when the network was fast enough to handle all the messages efficiently. 
 
 <p align="center">
   <img src="images/shm2.png"><br>