<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:cf="http://www.microsoft.com/schemas/rss/core/2005">
<channel xmlns:cfi="http://www.microsoft.com/schemas/rss/core/2005/internal" cfi:lastdownloaderror="None">
<title cf:type="text"><![CDATA[Editorial department of the Journal of National University of Defense Technology -->高性能计算与人工智能]]></title>
<item>
<title xmlns:cf="http://www.microsoft.com/schemas/rss/core/2005" cf:type="text"><![CDATA[Optimizing Yinyang <i>K</i>-means algorithm on many-core CPUs]]></title>
<link><![CDATA[http://journal.nudt.edu.cn/gfkjdxxben/article/abstract/202401010]]></link>
<description xmlns:cf="http://www.microsoft.com/schemas/rss/core/2005" cf:type="html"><![CDATA[Traditional Yinyang <i>K</i>-means algorithm is computationally expensive when dealing with large-scale clustering problems. An efficient parallel acceleration implementation of Yinyang <i>K</i>-means algorithm was proposed on the basis of the architectural characteristics of typical many-core CPUs. This implementation was based on a new memory data layout, used vector units in many-core CPUs to accelerate distance calculation in Yinyang <i>K</i>-means, and targeted memory access optimization for NUMA(non-uniform memory access) characteristics. Compared with the open source multi-threaded version of Yinyang <i>K</i>-means algorithm, this implementation can achieve the speedup of up to 5.6 and 8.7 approximately on ARMv8 and x86 many-core CPUs, respectively. Experiments show that the optimization successfully accelerate Yinyang <i>K</i>-means algorithm in many-core CPUs.]]></description>
<pubDate>2024/1/28 0:00:00</pubDate>
<category><![CDATA[高性能计算与人工智能]]></category>
<author><![CDATA[ZHOU Tianyang, WANG Qinglin, LI Rongchun, MEI Songzhu, YIN Shangfei, HAO Ruochen, LIU Jie]]></author>
<atom:author xmlns:atom="http://www.w3.org/2005/Atom">
<atom:name>ZHOU Tianyang, WANG Qinglin, LI Rongchun, MEI Songzhu, YIN Shangfei, HAO Ruochen, LIU Jie</atom:name>
</atom:author>
<guid><![CDATA[http://journal.nudt.edu.cn/gfkjdxxben/article/abstract/202401010]]></guid><cfi:id>6</cfi:id><cfi:read>true</cfi:read></item>
<item>
<title xmlns:cf="http://www.microsoft.com/schemas/rss/core/2005" cf:type="text"><![CDATA[Parallel optimization of convolution algorithm on multi-core DSP]]></title>
<link><![CDATA[http://journal.nudt.edu.cn/gfkjdxxben/article/abstract/202401011]]></link>
<description xmlns:cf="http://www.microsoft.com/schemas/rss/core/2005" cf:type="html"><![CDATA[According to the characteristics of the heterogeneous multi-core DSP(digital signal processing) chip independently developed by National University of Defense Technology and the characteristics of the convolution algorithm, a high-performance multi-core parallel convolution implementation scheme for multi-core DSP architecture was proposed. A feature graph level multi-core parallel scheme is proposed for 1×1 convolution. For convolutions with kernels larger than 1, a window level multi-core parallel optimization design was proposed, and an element-wise vectorization based intra-core parallel optimization implementation was proposed. The experimental results show that the proposed parallel optimization method can reach a maximum single core computing efficiency of 64.95%. When the bandwidth is limited, the parallel expansion efficiency of multi-core can still reach 48.36% ~ 88.52%. Compared with E5-2640 CPU, the execution performance on the typical network ResNet50 achieves 5.39x performance acceleration.]]></description>
<pubDate>2024/1/28 0:00:00</pubDate>
<category><![CDATA[高性能计算与人工智能]]></category>
<author><![CDATA[XU Jinwei, WANG Qinglin, LI Yalin, JIANG Jingfei, GAO Lei, LI Rongchun, LI Dongsheng]]></author>
<atom:author xmlns:atom="http://www.w3.org/2005/Atom">
<atom:name>XU Jinwei, WANG Qinglin, LI Yalin, JIANG Jingfei, GAO Lei, LI Rongchun, LI Dongsheng</atom:name>
</atom:author>
<guid><![CDATA[http://journal.nudt.edu.cn/gfkjdxxben/article/abstract/202401011]]></guid><cfi:id>5</cfi:id><cfi:read>true</cfi:read></item>
<item>
<title xmlns:cf="http://www.microsoft.com/schemas/rss/core/2005" cf:type="text"><![CDATA[Quantization and pruning optimization method for attention mechanism]]></title>
<link><![CDATA[http://journal.nudt.edu.cn/gfkjdxxben/article/abstract/202401012]]></link>
<description xmlns:cf="http://www.microsoft.com/schemas/rss/core/2005" cf:type="html"><![CDATA[To address the significant computation and memory overhead of models based on attention mechanism, model compression techniques, such as collaborative optimization of quantization and pruning, were studied. A symmetric linear fixed point quantization method was proposed for four activation matrices of query, key, value and probability in the attention mechanism. Meanwhile, a probability matrix pruning method and a progressive pruning strategy were proposed to effectively reduce the pruning accuracy loss. Experimental results on different datasets show that for the typical attention-based model BERT, this optimization method can achieve 4 bit or 8 bit fixed point quantization and 0.93~0.98 sparsity with little or no accuracy loss, which greatly reduces the model computation and lays a strong foundation for accelerating the inference of quantized sparse models.]]></description>
<pubDate>2024/1/28 0:00:00</pubDate>
<category><![CDATA[高性能计算与人工智能]]></category>
<author><![CDATA[HE Yuanhong, JIANG Jingfei, XU Jinwei]]></author>
<atom:author xmlns:atom="http://www.w3.org/2005/Atom">
<atom:name>HE Yuanhong, JIANG Jingfei, XU Jinwei</atom:name>
</atom:author>
<guid><![CDATA[http://journal.nudt.edu.cn/gfkjdxxben/article/abstract/202401012]]></guid><cfi:id>4</cfi:id><cfi:read>true</cfi:read></item>
<item>
<title xmlns:cf="http://www.microsoft.com/schemas/rss/core/2005" cf:type="text"><![CDATA[Efficient RNN inference engine on very long vector processor]]></title>
<link><![CDATA[http://journal.nudt.edu.cn/gfkjdxxben/article/abstract/202401013]]></link>
<description xmlns:cf="http://www.microsoft.com/schemas/rss/core/2005" cf:type="html"><![CDATA[With the increasing depth and the inconsistent length of processing sequences, the performance optimization of RNN(recurrent neural network) on different processors makes it difficult to researchers. An efficient RNN acceleration engine was implemented for the self-developed long vector processor FT-M7032. This engine proposed a row-first matrix vector multiplication algorithm and a data-aware multi-core parallel method to improve the computational efficiency of matrix vector multiplication. It proposed a two-level kernel fusion optimization method to reduce the overhead of temporary data transmission. Optimized handwritten assembly codes for multiple operators were integrated to further tap the performance potential of long vector processors. Experiments show that the RNN engine for long-vector processors is efficient, when compared with the multi-core ARM CPU and Intel Golden CPU, the RNN-like model long short term memory networks can achieve a performance acceleration of up to 62.68 times and 3.12 times, respectively.]]></description>
<pubDate>2024/1/28 0:00:00</pubDate>
<category><![CDATA[高性能计算与人工智能]]></category>
<author><![CDATA[SU Huayou, CHEN Kangkang, YANG Qianming]]></author>
<atom:author xmlns:atom="http://www.w3.org/2005/Atom">
<atom:name>SU Huayou, CHEN Kangkang, YANG Qianming</atom:name>
</atom:author>
<guid><![CDATA[http://journal.nudt.edu.cn/gfkjdxxben/article/abstract/202401013]]></guid><cfi:id>3</cfi:id><cfi:read>true</cfi:read></item>
<item>
<title xmlns:cf="http://www.microsoft.com/schemas/rss/core/2005" cf:type="text"><![CDATA[Optimizing operator computation of MiniGo on high-performance heterogeneous accelerator]]></title>
<link><![CDATA[http://journal.nudt.edu.cn/gfkjdxxben/article/abstract/202401014]]></link>
<description xmlns:cf="http://www.microsoft.com/schemas/rss/core/2005" cf:type="html"><![CDATA[An efficient parallel computing method based on the characteristics of the high-performance heterogeneous accelerator and the training mode of MiniGo was proposed. The on-chip computing resources were reasonably planned to achieve pipelining parallel optimization between heterogeneous devices. The shared memory programming was designed according to the existence of shared storage segments between heterogeneous devices to reduce data transmission costs. According to the characteristics of multiple computing resources in a digital signal processing cluster, combined with the computing-memory access feature of the operators, different optimization strategies were designed. At the same time, this method provides an easy-use high-performance operator library for TensorFlow. The experimental results show that this method realizes the multi-core parallel computing of operators. The speedup of convolution was 24.69 compared with that was achieved on a single core. Compared with the cropped version of the 8-core FT2000+ CPU, the speedup of training and self-play execution on this method were 3.83 and 1.5, respectively.]]></description>
<pubDate>2024/1/28 0:00:00</pubDate>
<category><![CDATA[高性能计算与人工智能]]></category>
<author><![CDATA[QIAO Peng, HE Zhouyu, LI Rongchun, JIANG Jingfei]]></author>
<atom:author xmlns:atom="http://www.w3.org/2005/Atom">
<atom:name>QIAO Peng, HE Zhouyu, LI Rongchun, JIANG Jingfei</atom:name>
</atom:author>
<guid><![CDATA[http://journal.nudt.edu.cn/gfkjdxxben/article/abstract/202401014]]></guid><cfi:id>2</cfi:id><cfi:read>true</cfi:read></item>
<item>
<title xmlns:cf="http://www.microsoft.com/schemas/rss/core/2005" cf:type="text"><![CDATA[High-throughput LDPC decoder on GPU for 5G new radio]]></title>
<link><![CDATA[http://journal.nudt.edu.cn/gfkjdxxben/article/abstract/202401015]]></link>
<description xmlns:cf="http://www.microsoft.com/schemas/rss/core/2005" cf:type="html"><![CDATA[A GPU(graphic processing unit) based 5G software radio quasi cyclic LDPC (low-density parity check) code decoder was proposed. In order to save on chip and off chip bandwidth, code word shortening and punching techniques, two-stage quantization, and data packaging schemes were adopted to improve the utilization of data bandwidth. The experiment was based on the Nvidia RTX 2080Ti GPU platform to achieve parallel decoding of minimum and approximate decoding algorithms under high bit rates. By analyzing the optimal thread settings on the GPU, the decoding throughput of the 5/6 (2 080,1 760) LDPC algorithm is improved to 1.38 Gbit/s, and the decoding throughput performance is better than other GPU based LDPC decoders.]]></description>
<pubDate>2024/1/28 0:00:00</pubDate>
<category><![CDATA[高性能计算与人工智能]]></category>
<author><![CDATA[LI Rongchun, ZHOU Xin, QIAO Peng, WANG Qinglin]]></author>
<atom:author xmlns:atom="http://www.w3.org/2005/Atom">
<atom:name>LI Rongchun, ZHOU Xin, QIAO Peng, WANG Qinglin</atom:name>
</atom:author>
<guid><![CDATA[http://journal.nudt.edu.cn/gfkjdxxben/article/abstract/202401015]]></guid><cfi:id>1</cfi:id><cfi:read>true</cfi:read></item>
</channel>
</rss>