High Performance Computing and Artificial Intelligence

1 Optimizing Yinyang K-means algorithm on many-core CPUs

ZHOU Tianyang , WANG Qinglin , LI Rongchun , MEI Songzhu , YIN Shangfei , HAO Ruochen , LIU Jie

2024, 46(1):93-102. DOI: 10.11887/j.cn.202401010

[Abstract](14778) [HTML](641) [PDF 1.09 M](2698)

Abstract:
Traditional Yinyang K-means algorithm is computationally expensive when dealing with large-scale clustering problems. An efficient parallel acceleration implementation of Yinyang K-means algorithm was proposed on the basis of the architectural characteristics of typical many-core CPUs. This implementation was based on a new memory data layout, used vector units in many-core CPUs to accelerate distance calculation in Yinyang K-means, and targeted memory access optimization for NUMA(non-uniform memory access) characteristics. Compared with the open source multi-threaded version of Yinyang K-means algorithm, this implementation can achieve the speedup of up to 5.6 and 8.7 approximately on ARMv8 and x86 many-core CPUs, respectively. Experiments show that the optimization successfully accelerate Yinyang K-means algorithm in many-core CPUs.

2 Parallel optimization of convolution algorithm on multi-core DSP

XU Jinwei , WANG Qinglin , LI Yalin , JIANG Jingfei , GAO Lei , LI Rongchun , LI Dongsheng

2024, 46(1):103-112. DOI: 10.11887/j.cn.202401011

[Abstract](9189) [HTML](645) [PDF 1.74 M](2938)

Abstract:
According to the characteristics of the heterogeneous multi-core DSP(digital signal processing) chip independently developed by National University of Defense Technology and the characteristics of the convolution algorithm, a high-performance multi-core parallel convolution implementation scheme for multi-core DSP architecture was proposed. A feature graph level multi-core parallel scheme is proposed for 1×1 convolution. For convolutions with kernels larger than 1, a window level multi-core parallel optimization design was proposed, and an element-wise vectorization based intra-core parallel optimization implementation was proposed. The experimental results show that the proposed parallel optimization method can reach a maximum single core computing efficiency of 64.95%. When the bandwidth is limited, the parallel expansion efficiency of multi-core can still reach 48.36% ~ 88.52%. Compared with E5-2640 CPU, the execution performance on the typical network ResNet50 achieves 5.39x performance acceleration.

3 Quantization and pruning optimization method for attention mechanism

HE Yuanhong , JIANG Jingfei , XU Jinwei

2024, 46(1):113-120. DOI: 10.11887/j.cn.202401012

[Abstract](11276) [HTML](1322) [PDF 4.50 M](2685)

Abstract:
To address the significant computation and memory overhead of models based on attention mechanism, model compression techniques, such as collaborative optimization of quantization and pruning, were studied. A symmetric linear fixed point quantization method was proposed for four activation matrices of query, key, value and probability in the attention mechanism. Meanwhile, a probability matrix pruning method and a progressive pruning strategy were proposed to effectively reduce the pruning accuracy loss. Experimental results on different datasets show that for the typical attention-based model BERT, this optimization method can achieve 4 bit or 8 bit fixed point quantization and 0.93~0.98 sparsity with little or no accuracy loss, which greatly reduces the model computation and lays a strong foundation for accelerating the inference of quantized sparse models.

4 Efficient RNN inference engine on very long vector processor

SU Huayou , CHEN Kangkang , YANG Qianming

2024, 46(1):121-130. DOI: 10.11887/j.cn.202401013

[Abstract](4808) [HTML](729) [PDF 3.14 M](2914)

Abstract:
With the increasing depth and the inconsistent length of processing sequences, the performance optimization of RNN(recurrent neural network) on different processors makes it difficult to researchers. An efficient RNN acceleration engine was implemented for the self-developed long vector processor FT-M7032. This engine proposed a row-first matrix vector multiplication algorithm and a data-aware multi-core parallel method to improve the computational efficiency of matrix vector multiplication. It proposed a two-level kernel fusion optimization method to reduce the overhead of temporary data transmission. Optimized handwritten assembly codes for multiple operators were integrated to further tap the performance potential of long vector processors. Experiments show that the RNN engine for long-vector processors is efficient, when compared with the multi-core ARM CPU and Intel Golden CPU, the RNN-like model long short term memory networks can achieve a performance acceleration of up to 62.68 times and 3.12 times, respectively.

5 Optimizing operator computation of MiniGo on high-performance heterogeneous accelerator

QIAO Peng , HE Zhouyu , LI Rongchun , JIANG Jingfei

2024, 46(1):131-140. DOI: 10.11887/j.cn.202401014

[Abstract](8023) [HTML](753) [PDF 4.79 M](3020)

Abstract:
An efficient parallel computing method based on the characteristics of the high-performance heterogeneous accelerator and the training mode of MiniGo was proposed. The on-chip computing resources were reasonably planned to achieve pipelining parallel optimization between heterogeneous devices. The shared memory programming was designed according to the existence of shared storage segments between heterogeneous devices to reduce data transmission costs. According to the characteristics of multiple computing resources in a digital signal processing cluster, combined with the computing-memory access feature of the operators, different optimization strategies were designed. At the same time, this method provides an easy-use high-performance operator library for TensorFlow. The experimental results show that this method realizes the multi-core parallel computing of operators. The speedup of convolution was 24.69 compared with that was achieved on a single core. Compared with the cropped version of the 8-core FT2000+ CPU, the speedup of training and self-play execution on this method were 3.83 and 1.5, respectively.

6 High-throughput LDPC decoder on GPU for 5G new radio

LI Rongchun , ZHOU Xin , QIAO Peng , WANG Qinglin

2024, 46(1):141-148. DOI: 10.11887/j.cn.202401015

[Abstract](6258) [HTML](773) [PDF 1.07 M](2251)

Abstract:
A GPU(graphic processing unit) based 5G software radio quasi cyclic LDPC (low-density parity check) code decoder was proposed. In order to save on chip and off chip bandwidth, code word shortening and punching techniques, two-stage quantization, and data packaging schemes were adopted to improve the utilization of data bandwidth. The experiment was based on the Nvidia RTX 2080Ti GPU platform to achieve parallel decoding of minimum and approximate decoding algorithms under high bit rates. By analyzing the optimal thread settings on the GPU, the decoding throughput of the 5/6 (2 080,1 760) LDPC algorithm is improved to 1.38 Gbit/s, and the decoding throughput performance is better than other GPU based LDPC decoders.

Home

About Journal

Guide for Authors

Editorial Board

Publication Statement

Open Access Statement

Contact Us

Journal Subscription

Rss

Chinese