引用本文: | 刘彪,陈长林,张宇飞,等.稀疏卷积计算高效数据加载与输出缓存策略.[J].国防科技大学学报,2023,45(5):212-221.[点击复制] |
LIU Biao,CHEN Changlin,ZHANG Yufei,et al.High-efficiency data loading and output buffering strategy for sparse convolutional computing[J].Journal of National University of Defense Technology,2023,45(5):212-221[点击复制] |
|
|
|
本文已被:浏览 3351次 下载 2882次 |
稀疏卷积计算高效数据加载与输出缓存策略 |
刘彪,陈长林,张宇飞,刘思彤,唐励勤,于红旗 |
(国防科技大学 电子科学学院, 湖南 长沙 410073)
|
摘要: |
针对现有神经网络加速器在处理稀疏神经网络时存在的数据加载效率低、乘加资源利用率低、输出缓存寻址逻辑复杂等问题,提出了稀疏卷积计算高效数据加载与输出缓存策略。将属于同一输入通道的非零输入特征图像数据和非零权重进行全对全乘累加运算,降低了非零数据配对难度,提高了乘加资源利用率;通过采用输入驻留计算,以及密集型循环加载特征图像数据,大幅减少了数据片外调取次数;优化了输出缓存设计,解决了现有方案中存在的输出缓存地址访问争用、存储拥塞等问题。实验表明,与采用类似架构的细粒度脉动加速器相比,在处理单元面积上减少了21.45%;在数据加载速度方面平均提高了117.71%;在平均乘法器利用率方面提高了11.25%,达到89%。 |
关键词: 神经网络加速器 稀疏卷积神经网络 输入驻留 全对全计算 |
DOI:10.11887/j.cn.202305025 |
投稿日期:2022-06-08 |
基金项目:国家自然科学基金资助项目(61804181,62074166);国家重点研发计划资助项目(2019YFB2205102) |
|
High-efficiency data loading and output buffering strategy for sparse convolutional computing |
LIU Biao, CHEN Changlin, ZHANG Yufei, LIU Sitong, TANG Liqin, YU Hongqi |
(College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China)
|
Abstract: |
In view of the problems such as inefficient data loading, insufficient utilization of multiply-accumulates resources, complex output buffering and addressing logic in existing neural network accelerators when processing sparse neural networks, a high-efficiency data loading and output buffering strategy for sparse convolutional computing was proposed. It performed an all-to-all multiply-accumulates operation on the non-zero input feature map data and the non-zero weights belonging to the same input channel, which reduces the difficulty of non-zero data pairing and improves the utilization of multiply-accumulates resources. By using input stationary calculation and intensive cyclic loading of input feature map data, it significantly reduced the number of data off-chip fetches. It optimized the output buffer design and solved the problems of address access contention and storage congestion during output buffering in existing solutions. Experimental results show that, when compare to fine-grained systolic accelerator with similar architectures, the process element area of the proposed architecture is decreased by 21.45%; the data loading speed is increased by 117.71% on average; the average utilization of multiplier is increased by 11.25%, reaching 89%. |
Keywords: neural network accelerator sparse convolution neural network input stationary all-to-all calculation |
|
|
|
|
|