Improving Data Reuse in NPU On-chip Memory with Interleaved Gradient Order for DNN Training
Published in MICRO, 2023
Jungwoo Kim, Seonjin Na, Sanghyeon Lee, Sunho Lee, and Jaehyuk Huh, "Improving Data Reuse in NPU On-chip Memory with Interleaved Gradient Order for DNN Training", the 56th IEEE/ACM International Symposium on Microarchitecture ( MICRO ), October 2023
During training tasks for machine learning models with neural processing units (NPUs), the most time-consuming part is the backward pass, which incurs significant overheads due to off-chip memory accesses. For NPUs, to mitigate the long latency and limited bandwidth of such off-chip DRAM accesses, the software-managed onchip scratchpad memory (SPM) plays a crucial role. As the backward pass computation must be optimized to improve the effectiveness of SPM, this study identifies a new data reuse pattern specific to the backward computation. The backward pass includes independent input and weight gradient computations sharing the same output gradient in each layer. Conventional sequential processing does not exploit the potential inter-operation data reuse opportunity within SPM. With this new opportunity of data reuse in the backward pass, this study proposes a novel data flow transformation scheme called interleaved gradient order, consisting of three techniques to enhance the utilization of NPU scratchpad memory. The first technique shuffles the input and weight gradient computations by interleaving two operations into a single fused operation to reduce redundant output gradient accesses. The second technique adjusts the tile access order for the interleaved gradient computations to maximize the potential data locality. However, since the best order is not fixed for all tensors, we propose a selection algorithm to find the most suitable order based on the tensor dimensions. The final technique further improves data reuse chances by using the best partitioning and mapping scheme for two gradient computations for single-core and multi-core NPUs. The simulation-based evaluation with singlecore edge and server NPUs shows that the combined techniques can improve performance by 29.3\% and 14.5\% for edge and server NPUs respectively. Furthermore, with a quad-core server NPU, the proposed techniques reduce the execution time by 23.7\%.