The Development and Optimization Strategy of High-Performance Computing Technology

高性能计算技术发展与优化策略

随着飞行器向高速、宽域、高机动和高隐身方向的发展，飞行器在高超或超声速流动中会面临流动不稳定和热防护等极端问题，直接影响到飞行器的整体性能。宽速域流动控制技术涉及在亚声速、跨声速、超声速乃至高超声速等广泛速度范围内，对流体流动进行精确控制的技术。在宽速域流动控制技术的研究中，气动特性是一个重要的研究方向，可以为飞行器的设计、优化和控制提供重要依据。宽速域条件下的流动控制技术和气动特性研究对于提升飞行器的性能、稳定性和可靠性至关重要，成为当前航空航天技术的核心内容之一。本专题涉及宽速域中的流动控制技术、气动特性等方面的领域，旨在为高超声速飞行器设计和应用提供新的思路和技术支持，推动航空航天推进技术的发展。

关键词：暂无

Display Type:

1 Survey on converged networks of high-performance computing network and data center network

LU Pingjing , DONG Dezun , LAI Mingche , QI Xingyun , XIONG Zeyu , CAO Jijun , XIAO Liquan

2023, 45(4):1-10. DOI: 10.11887/j.cn.202304001

[Abstract](7099) [HTML](365) [PDF 1.29 M](6340)

Abstract:
With the convergence trend of high performance computing, big data processing, cloud computing and artificial intelligence computing, the converged network of high-performance computing network and data center network becomes an important trend. The current research status of converged network was analyzed, and the representative converged networks were described in detail to comprehensively show the latest technologies and trends. The challenges faced by the converged network were put forward, and the trends of converged network were proposed, including the convergence and differentiation coexistence of the converged network protocols, the performance acceleration of the converged network based on in-network computing, and the performance optimization of the converged network for emerging applications.

2 Study on development mode of China′s supercomputing technology

SU Nuoya

2021, 43(3):86-97. DOI: 10.11887/j.cn.202103012

[Abstract](6999) [HTML](135) [PDF 8.98 M](6134)

Abstract:
Supercomputing is an important means to solve major challenges in the fields of national security, economic construction, scientific progress, social development, and national defense construction. It is a strategic high technology domain in the development of science and technology in various countries. Through investigation and empirical research, the roles of government and the enterprises as market players in catching up with world supercomputing technology pioneer were analyzed. For the requirement of strategic development, Chinese government has formed a solid accumulation of knowledge and talents through long-term funding under the condition of very limited financial capacity. To meet strategy of scientific and technological innovation, Chinese government has led cluster innovation across the country to achieve China′s supercomputing summit and built national supercomputer foundation facilities. To meet the strategy of comprehensive development, supercomputing applications are fully developed. At the same time, in accordance with the principle that the enterprises are the main body of the market, they achieve market breakthroughs by participating in supercomputing competition development and absorbs technology spillover from research body. The development model of supercomputing technology can provide experience for the other high-tech fields.

3 Parallel optimization of convolution algorithm on multi-core DSP

XU Jinwei , WANG Qinglin , LI Yalin , JIANG Jingfei , GAO Lei , LI Rongchun , LI Dongsheng

2024, 46(1):103-112. DOI: 10.11887/j.cn.202401011

[Abstract](9539) [HTML](833) [PDF 1.74 M](3450)

Abstract:
According to the characteristics of the heterogeneous multi-core DSP(digital signal processing) chip independently developed by National University of Defense Technology and the characteristics of the convolution algorithm, a high-performance multi-core parallel convolution implementation scheme for multi-core DSP architecture was proposed. A feature graph level multi-core parallel scheme is proposed for 1×1 convolution. For convolutions with kernels larger than 1, a window level multi-core parallel optimization design was proposed, and an element-wise vectorization based intra-core parallel optimization implementation was proposed. The experimental results show that the proposed parallel optimization method can reach a maximum single core computing efficiency of 64.95%. When the bandwidth is limited, the parallel expansion efficiency of multi-core can still reach 48.36% ~ 88.52%. Compared with E5-2640 CPU, the execution performance on the typical network ResNet50 achieves 5.39x performance acceleration.

4 Efficient RNN inference engine on very long vector processor

SU Huayou , CHEN Kangkang , YANG Qianming

2024, 46(1):121-130. DOI: 10.11887/j.cn.202401013

[Abstract](5162) [HTML](953) [PDF 3.14 M](3635)

Abstract:
With the increasing depth and the inconsistent length of processing sequences, the performance optimization of RNN(recurrent neural network) on different processors makes it difficult to researchers. An efficient RNN acceleration engine was implemented for the self-developed long vector processor FT-M7032. This engine proposed a row-first matrix vector multiplication algorithm and a data-aware multi-core parallel method to improve the computational efficiency of matrix vector multiplication. It proposed a two-level kernel fusion optimization method to reduce the overhead of temporary data transmission. Optimized handwritten assembly codes for multiple operators were integrated to further tap the performance potential of long vector processors. Experiments show that the RNN engine for long-vector processors is efficient, when compared with the multi-core ARM CPU and Intel Golden CPU, the RNN-like model long short term memory networks can achieve a performance acceleration of up to 62.68 times and 3.12 times, respectively.

5 Optimizing operator computation of MiniGo on high-performance heterogeneous accelerator

QIAO Peng , HE Zhouyu , LI Rongchun , JIANG Jingfei

2024, 46(1):131-140. DOI: 10.11887/j.cn.202401014

[Abstract](8358) [HTML](957) [PDF 4.79 M](3472)

Abstract:
An efficient parallel computing method based on the characteristics of the high-performance heterogeneous accelerator and the training mode of MiniGo was proposed. The on-chip computing resources were reasonably planned to achieve pipelining parallel optimization between heterogeneous devices. The shared memory programming was designed according to the existence of shared storage segments between heterogeneous devices to reduce data transmission costs. According to the characteristics of multiple computing resources in a digital signal processing cluster, combined with the computing-memory access feature of the operators, different optimization strategies were designed. At the same time, this method provides an easy-use high-performance operator library for TensorFlow. The experimental results show that this method realizes the multi-core parallel computing of operators. The speedup of convolution was 24.69 compared with that was achieved on a single core. Compared with the cropped version of the 8-core FT2000+ CPU, the speedup of training and self-play execution on this method were 3.83 and 1.5, respectively.

6 Optimizing parallel matrix transpose algorithm on multi-core digital signal processors

PEI Xiangdong , WANG Qinglin , LIAO Linyu , LI Rongchun , MEI Songzhu , LIU Jie , PANG Zhengbin

2023, 45(1):57-66. DOI: 10.11887/j.cn.202301006

[Abstract](18182) [HTML](231) [PDF 1.57 M](7885)

Abstract:
Matrix transpose is one of the common matrix operations, which is widely employed in various fields such as signal processing, scientific computing, and deep learning. With the popularization of Phytium heterogeneous multi-core DSPs(digital signal processors) developed by National University of Defense Technology, there is a strong demand for high-performance matrix transpose implementations for Phytium multi-core DSPs. Based on the architecture of multi-core DSPs and the characteristic of matrix transpose operations, a parallel matrix transpose algorithm (called ftmMT) for matrices with different element bit widths (8 B, 4 B, and 2 B) was proposed. In ftmMT, the main optimizations include vectorization based on vector Load/Store functions, core-level parallelization based on matrix blocking, and overlapping between vectorization and memory access through implicit ping-pong methods. The experimental results show that ftmMT can significantly improve the performance of matrix transpose operations, and achieve a speedup of up to 8.99 times in comparison with the open-source transpose library HPTT running on CPU.

7 Design and implementation of high speed parallel Gardner algorithm

HU Wanru , WANG Zhugang , MEI Ruru , CHEN Xuan , ZHANG Ying

2023, 45(2):95-104. DOI: 10.11887/j.cn.202302011

[Abstract](5235) [HTML](167) [PDF 8.49 M](3917)

Abstract:
With the gradual increase of space exploration tasks and the increasing tension of space channel spectrum resources, the traditional Gardner timing synchronization algorithm can no longer meet the demand of high throughput and high reliability of high-speed data transmission system. In order to improve the throughput and increase the correctable error range of Gardner timing synchronization algorithm, a high-speed parallel Gardner algorithm was proposed. To ensure the interpolation accuracy and reduce the multiplier consumption, a parallel piecewise parabolic interpolation filter was designed. To facilitate the parallel pipeline design and optimal sampling point selection, a counting module and a timing cache adjustment module were built. To improve the equivalent throughput rate, the pipelined parallel loop filter structure and the pipelined parallel numerically controlled oscillator structure were reconstructed. Results show that the equivalent throughput rate of the algorithm can reach 1 739.13 Msps, the digital signal processor resource consumption can be reduced by 44%, and the timing error of 2×10^-3 can be corrected.

8 Prediction method of port blocking failure in high performance interconnection networks

XU Jiaqing , HU Xiaotao , YANG Hanzhi , WANG Qiang , ZHANG Lei , TANG Fuqiao

2022, 44(5):1-12. DOI: 10.11887/j.cn.202205001

[Abstract](6007) [HTML](247) [PDF 12.06 M](4167)

Abstract:
With the increase of system scale, chip power consumption and link rate, the overall failure rate of high-performance interconnection networks will continue rising, and the traditional operation and maintenance methods will be difficult to sustain, which brings great challenges to the overall reliability and availability of HPC(high performance computing). An unsupervised algorithm prediction model for serious network failures such as network port blocking was proposed. In this model, the symptomatic rules were extracted from the history information of the switch port status register and a new feature vector was formed. The K-means clustering algorithm was used to learn and classify the feature vectors. In the prediction, the DES(double exponential smoothing) algorithm was used to predict the port state in the future through a combination of the current state of the port, and a new feature vector was obtained and K-means algorithm was used to predict whether the port blocking failure would occur. The topology information was used to build independent sub prediction models with ToR switch ports and Spine switch ports respectively, so as to further improve the accuracy of prediction. The experimental results show that the prediction model can maintain the recall rate of 88.2%, and reach the accuracy rate of 65.2%. It can provide effective early warning and guidance for the operation and maintenance personnel in the actual system.

9 Accelerating parallel reduction and scan primitives on ReRAM-based architectures

JIN Zhou , DUAN Yiru , YI Enxin , JI Haonan , LIU Weifeng

2022, 44(5):80-91. DOI: 10.11887/j.cn.202205009

[Abstract](5192) [HTML](232) [PDF 19.75 M](4120)

Abstract:
Reduction and scan are two critical primitives in parallel computing. Thus, accelerating reduction and scan shows great importance. However, the Von Neumann architecture suffers from performance and energy bottlenecks known as “memory wall” due to the unavoidable data migration. Recently, NVM (non-volatile memory) such as ReRAM (resistive random access memory), enables in-situ computing without data movement and its crossbar architecture can perform parallel GEMV (matrix-vector multiplication) operation naturally in one step. ReRAM-based architecture has demonstrated great success in many areas, e.g. accelerating machine learning and graph computing applications, etc. Parallel acceleration methods were proposed for reduction and scan primitives on ReRAM-based PIM(processing in memory) architecture, the computing process in terms of GEMV and the mapping method on the ReRAM crossbar were focused, and the co-design of software and hardware was realized to reduce power consumption and improve performance. Compared with GPU, the proposed reduction and scan algorithm achieved substantial speedup by two orders of magnitude, and the average acceleration ratio can also reach two orders of magnitude. The case of segmentation can achieve up to five (four on average) orders of magnitude. Meanwhile, the power consumption decreased by 79%.

10 Node priority optimization in distributed heterogeneous clusters

HU Yahong , QIU Yuanyuan , MAO Jiafa

2022, 44(5):102-113. DOI: 10.11887/j.cn.202205011

[Abstract](4915) [HTML](236) [PDF 9.19 M](4134)

Abstract:
Node priority is often used to evaluate the performance of heterogeneous cluster nodes, and it is of great importance to provide suitable weight for each priority evaluation index. The AHP (analytic hierarchy process) was chosen to establish the evaluation index system of node priority, and the initial weight of each index was calculated. The BP (back propagation) neural network was then used to optimize the weights obtained by using AHP. The input of the BP neural network was the node′s performance index values collected during execution of cluster, and the output was the corresponding priority of the node. After the network training, the weight matrix was obtained and used to calculate the optimized weights. The experimental results show that the cluster node priority evaluation model based on AHP and BP can evaluate the node performance more accurately. Compared with the default resource allocation algorithm of Spark and the comparison algorithm with unoptimized weights, the cluster performance is improved effectively by using the node priority optimized. When running the same kind of load with different amount of data, the average cluster performance increases by 16.64% and 9.76%, respectively; and when running different loads with the same amount of data, the average performance of the cluster increases by 12.49% and 6.54%, respectively.

Home

About Journal

Guide for Authors

Editorial Board

Publication Statement

Open Access Statement

Contact

Journal Subscription

Rss

AI assistant

Chinese