And the hardware improves performance and energy efficiency
© 2022
Mohammadreza Soltaniyeh
ALL RIGHTS RESERVEDHARDWARESOFTWARE TECHNIQUES FOR ACCELERATING SPARSE COMPUTATION
By
MOHAMMADREZA SOLTANIYEH
A dissertation submitted to the
School of Graduate Studies
Rutgers, The State University of New Jersey
In partial fulfillment of the requirements
For the degree of
Doctor of Philosophy
Graduate Program in Computer Science
Written under the direction of
Santosh Nagarakatte
And approved by
Dissertation Director: Prof. Santosh Nagarakatte
Linear algebra kernels are widely used in various fields such as machine learning, data science, physical science, and graph analysis. Many of these applications work with sparse data (i.e., only a small fraction of data is nonzero). Sparse data are often stored in a compressed format (ie, sparse format) that stores only the nonzero elements with additional metadata to identify where the nonzero elements are located. Using compressed formats eliminates the need to store and process zeros, making the storage and computation of sparse kernels more efficient.
iii
sparse formats. Our results show that REAP achieves both high frequency and promising speedup compared to stateoftheart FPGA accelerators for SpMV and SpGEMM while supporting various sparse formats and precision configurations. Second, we present a hardware accelerator for sparse CNN inference tasks. We formulate the convolution operation as general matrixmatrix multiplication (GEMM) using an image to column (IM2COL) transformation. With a dynamically reconfigurable GEMM and a novel IM2COL hardware unit, our design can support various layers in CNNs with high performance. Besides, our design exploits sparsity in both weights and feature maps. We use the software to perform groupwise pruning followed by a preprocessing step that puts the pruned weights into our hardwarefriendly sparse format for efficient and high performance computation. We evaluated our accelerator using a cyclelevel simulator and an HLS implementation realized on an Alveo FPGA board. Our ASIC design is on average 2.16×, 1.74×, and 1.63× faster than Gemmini, Eyeriss, and SparsePE, which are prior hardware accelerators for dense and sparse CNNs, respectively. Besides, our hardware accelerator is also 78×, and 12×more energyefficient when compared to CPU and GPU implementations, respectively.
I would also like to thank the other members of my Ph.D. committee, Prof. Yipeng Huang and Prof. Joe Devietti, for their thoughtful feedback about my research that has helped shape this dissertation.
Additionally, I would like to thank the Memory Solution team at Samsung Semiconductor, where I did two internships during my Ph.D. I particularly want to thank my mentors and manager, Veronica Lagrange Moutinho dos Reis, Matt Bryson, and Xuebin Yao, from whom I learned a lot.
Last but not least, I would like to thank all my family for their unconditional love and support throughout the years. I would also like to thank Mahdi, Alireza, Kathryn, Shahab, Mahyar, Vahid, Shiva, and many other friends with whom I had the opportunity to have fun times.
vi
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
14 15 17 


1.1  Dissertation Statement  . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
1.1.1  Synergistic CPUFPGA Acceleration for SpMV and SpGEMM . . .  
1.1.2  
1.1.3  A Endtoend FPGA Prototype of SPOTS for Sparse CNNs . . . . .  
1.2  Papers Related to this Dissertation . . . . . . . . . . . . . . . . . . . . . .  
1.3  Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . .  
2.1  
bra Kernels  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
2.1.1  Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
viii
2.2  Background on Sparse Formats and Sparse Linear Algebra Kernels . . . . . 



2.2.1  A Background on Sparse Formats  . . . . . . . . . . . . . . . . . .  
2.2.2  A Background on SpMV and SpGEMM kernels . . . . . . . . . . .  
2.3  Synergistic CPUFPGA Acceleration . . . . . . . . . . . . . . . . . . . . .  
2.3.1  Instantiation of REAP to Accelerate SpMV . . . . . . . . . . . . .  
2.3.2  Instantiation of REAP to Accelerate SPGEMM . . . . . . . . . . .  
2.4  Experimental Methodology of REAP . . . . . . . . . . . . . . . . . . . . .  
2.5  Experimental Evaluation of REAP for SpMV and SpGEMM . . . . . . . .  
2.6  Related Work on Accelerators for SpMV and SpGEMM Kernels . . . . . .  
2.7  Summary  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
3.1  Overview of SPOTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
3.1.1  Novelties of SPOTS . . . . . . . . . . . . . . . . . . . . . . . . . .  
3.2  Background on CNNs, IM2COL, and SparsityAwareness in CNNs . . . . .  
3.2.1  Convolution Neural Networks  . . . . . . . . . . . . . . . . . . . .  
3.2.2  Transforming Convolution to General MatrixMatrix Multiplication  
3.2.3  SparsityAwareness in CNNs . . . . . . . . . . . . . . . . . . . . .  
3.3 




3.3.1  
ing Elements  . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
3.3.2  Benefits and Challenges of Convolution with IM2COL  . . . . . . .  
3.3.3  Drawbacks of Prior Sparsityaware Designs . . . . . . . . . . . . . 
3.4.2 The GEMM Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.3 Handling Sparsity in CNNs . . . . . . . . . . . . . . . . . . . . . . 73
Chapter 4: An FPGA Accelerator for Sparse Convolutional Neural Networks . . 85
4.1 Background on High Level Synthesis for FPGAs . . . . . . . . . . . . . . 86
4.2.2 The GEMM Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.2.3 Sparsityaware Design . . . . . . . . . . . . . . . . . . . . . . . . 99
Sparse Convolutional Neural Networks . . . . . . . . . . . . . . . . . 107
5.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.2  

5.2.1  
5.2.2 

totypes with CPUs and GPUs . . . . . . . . . . . . . . . . . . . . . 114
Performance Sensitivity to Different Layers’ Shapes  


Chapter 6: Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . 123
6.2.1  

6.2.2  
6.2.3 



2.1  The CPU, and FPGA configurations. . . . . . . . . . . . . . . . . . . . . .  39 

3.1

FPGA resource utilization for SpMV and SpGEMM designs. . . . . . . . .  39  

List of matrices from SparseSuite [27] used in our evaluation. . . . . . . . .  41  
for three sparse formats and different input precisions. . . . . . . . . . . . .  42  
erate sparse linear algebra kernels. . . . . . . . . . . . . . . . . . . . . . .  51  
82  



. . . . . . . . . . . . . . 109  


2.1  19  

CSR, CSC, DIA, ELL, and RLC. . . . . . . . . . . . . . . . . . . . . . . . 
2.2 

21  

matrix multiplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
2.3  The floorplan of XCU200 FPGA. . . . . . . . . . . . . . . . . . . . . . . .  25  
2.4 

27  
FPG tasks in REAP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
2.5  An illustration of RIR bundle generation for SpMV. . . . . . . . . . . . . .  30  
2.6  An illustration of the FPGA design for SpMV. . . . . . . . . . . . . . . . .  31  
2.7  An illustration of RIR bundle generation for SpGEMM. . . . . . . . . . . .  34  
2.8  Overall architecture of the FPGA dsign for SpGEMM.  . . . . . . . . . . .  34  
2.9  An illustration of the multiply unit in action for SpGEMM design.  . . . . .  35 
2.12 Comparing REAP’s speedup with the recent FPGA accelerators for SpMV and SpGEMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.13 REAP execution breakdown. . . . . . . . . . . . . . . . . . . . . . . . . . 47

An illustration of a convolution layer along with its inputs.  . . . . . . . . .  58  

An illustration of an FC layer and its matrix form. . . . . . . . . . . . . . .  59  


A comparison of various pruning methods for CNNs. . . . . . . . . . . . .  61  
VGGNet, and GoogleNet for a CPU system. . . . . . . . . . . . . . . . . .  63  
systolic arraybased GEMM unit. . . . . . . . . . . . . . . . . . . . . . . .  67  
An illustration of patch generation using the PUs in the IM2COL unit.  . . .  68  
An illustration of the GEMM unit in our accelerator. . . . . . . . . . . . . .  71  
theart sparse formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  75 
our sparsityaware architecture. . . . . . . . . . . . . . . . . . . . . . . . . 76
3.11 An illustration of different configurations of the systolic array GEMM unit. 77
4.5 A highlevel overview of IM2COL and patch units. . . . . . . . . . . . . . 92
4.6 An illustration of the IM2COL unit in our FPGA design for sparse CNNs. . 93
a variety of filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.1 An illustration showing the sparsity ratio in the filters and input feature


SparsePE, Eyeriss, and Gemmini for four CNNs: AlexNet, VGGNet, ResNet,
5.6 The load imbalance percentage in the pruned weights for AlexNet, VG
GNet, ResNet, and GoogleNet. . . . . . . . . . . . . . . . . . . . . . . . . 120
First, a sparse format can reduce the total storage by orders of magnitude compared to a dense representation. In addition, operations on zeros can be skipped to reduce the overall computation.
Sparse formats store only the nonzero elements and use various techniques to encode the positions of the nonzero elements within the matrix [15, 26, 72, 91, 104, 131]. In most standard sparse formats, determining the position of the nonzero elements requires a series of indirect memory accesses that introduce irregular memory accesses. For example, to access an element A[i, j] in the Compressed Sparse Row (CSR) [104] format, one needs to consult the row pointer to obtain the location where the row i begins (i.e., row pointer[i]), search the column indices until the beginning of the next row to check if the item is present (i.e., search from col[row pointer[i]] to col[row pointer[i + 1]), and then subsequently access the data element at the matched index. In addition, skipping the operation involving zeros requires extra work to determine if the nonzero elements match before performing the operation. The overhead of accessing the indices into the sparse structure, not including the matching cost, can exceed the work of the mathematical operations by 2×5× [59, 61].
GPUs use a Single Instruction Multiple Data (SIMD) model to perform more useful work per instruction. Additionally, they rely on caching techniques to maximize memory performance by utilizing temporal and spatial data localities. SIMD and caching techniques are effective for most dense computations, but they are not as effective for sparse problems. Using sparse formats introduces many indirect and irregular memory accesses, which limits the benefits of caching and the SIMD model. As a result, using the same formulations and optimizations as the dense version for the sparse version results in poor performance.
For many years, Moore’s Law [92] and Dennard scaling [32] helped generalpurpose processors become faster and more energy efficient transparently. Dennard scaling is obsolete, and Moore’s law has slowed down and is expected to end in the coming years. In response, there has been an increased interest in designing specialized hardware, such as ASICs and FPGAs, for various applications, including sparse problems. Specialization can be applied at different levels for sparse computation. At the algorithmic level, new algorithms and methods can be developed to perform more optimally for sparse scenarios [15, 46, 137]. For sparse data storage, the sparse formats can be customized based on the operations’ memory access pattern and sparsity pattern of the inputs [38, 61, 97, 120]. Lastly, custom hardware can be designed to perform operations such as extracting the nonzero elements or matching the nonzeros more efficiently [7, 42, 52, 61]. These customizations aim to improve sparse computation’s performance and energy efficiency by extracting parallelism at a finer level and optimizing the memory hierarchy to accommodate irregular memory accesses arising from sparse problems. Sparse MatrixVector Multiplication (SpMV), Sparse General MatrixMatrix Multiplication (SpGEMM), and sparse neural networks are among the most studied sparse problems for these specialized hardware.
This dissertation presents a number of hardwaresoftware techniques to improve the performance and energy efficiency of sparse linear algebra kernels, including SpMV and SpGEMM and sparse convolutional neural networks. Our general strategy is to use software methods to reformat the sparse data into a hardwarefriendly format that allows the hardware to perform the computation with a high degree of parallelism. The software improves design flexibility to support multiple sparse formats, and the hardware improves performance and energy efficiency. We develop an intermediate representation that allows the software to communicate regularized data and scheduling decisions to the hardware. Besides, most of the software and hardware execution are overlapped. We applied these hardwaresoftware techniques to three sparse problems: sparse matrixvector multiplication, sparse general matrixvector multiplication, and sparse convolutional neural network. Different characteristics of these three applications raise different questions, and we have answered some of them in the following contributions:
1. A CPUFPGA system to improve the performance of SpMV and SpGEMM kernels in comparison to CPUonly and FPGAonly designs, while supporting multiple sparse formats and data precisions.