Speaker
Description
As a conventional ptychography iteration reconstruction method, Difference Map (DM) is suitable for computing large datasets of X-ray diffraction patterns on multi-GPU heterogeneous system at HEPS (High Energy Photon Source). However, it is based on the fact that the GPU memory is big enough to handle CUDA FFT/IFFT to all the patterns to GPU from RAM. Meanwhile, the intermediate data during the iteration is one more time of raw data. What’s more, the resolution of Eiger detector in the beamline is developing fast, which makes the data size of every diffraction frame become larger than ever. In order to keep up with rates of the acquired data, the new approach that aims to compute with MPI and CUDA in a distributed parallel way under finite number of computing resource for ptycho-DM ought to be taken into account。
Inspired by PyNX, we utilize the API of pycuda to get the value of GPU memory size allocated by SLURM for this situation. Then let the memory value divides 2.5 to get the maximum splitting subset size of diffraction raw data allowed to be processed in a single GPU as is mentioned above. The splitting details of original diffraction data would follow K-means in RAM first to divide all the patterns to subsets according to the pattern’s location during the previous acquisition procedure. The average adjustment should be carried out to make the number of patterns in all subsets equals to each other and every two sets in neighbour would have several patterns in common. Next, all the subsets would be sent by MPI process from RAM to GPU to compute in iteration to get a separated probe and sub-project. After that, the reconstructed probe and sub-project would be sent from GPU to RAM with the GPU memory clearing for the next part of subsets to be processed. When all the subsets of the diffraction raw data finish the reconstruction work flow, all the probe would be merged into one and all the sub-project would stitch together into a full project.
Profiling this means of distributed parallel computing for DM is done on 2 Nvidia A100 GPUs by Nvidia Nsight System version 2022.3. Nsight shows the GPUs utilization is high enough which means the reconstruction algorithm is running on A100 in high performance.
Beyond the CUDA acceleration approach, we adopt the Hygon DCU in the heterogeneous system integration for the ptycho-DM reconstruction by applying OpenCL and MPI parallel computing toolkit. The heterogeneous system is consist of 2 Hygon DCU Z100 (32GB) along with 250GB DDR4 RAM and Hygon CPU 7390 (32Cores, 2.7GHz). Hygon DCU takes the advantage of hipFFT and rocFFT which completely differ from the CUDA kernel cuFFT. One optimization method is that the number of threads in one block under parallel computation is the times of 64 on DCU instead of 32 owing to the Nvidia GPU’s warp size is 32 and DCU uses a 64 wide wavefront.
This module of distributed parallel computing for ptycho-DM reconstruction is part of Daisy software, which serves to organize and process data in HEPS, IHEP.
Consider for long presentation | Yes |
---|