高精度CFD程序的内外子区域划分异构并行算法

doi:10.11887/j.cn.202002004

首页 > 过刊浏览>2020年第42卷第2期 >31-40. DOI:10.11887/j.cn.202002004

高精度CFD程序的内外子区域划分异构并行算法
DOI:
                        10.11887/j.cn.202002004
                    
作者:
                        
                        
                    
作者单位:(国防科技大学 计算机学院 量子信息研究所兼高性能计算国家重点实验室, 湖南 长沙 410073)
作者简介:王巍(1988—),男,湖南长沙人,工程师,硕士, E-mail:wangw111@icloud.com
通讯作者:
中图分类号:TN95
基金项目:国家重点研发计划资助项目(2017YFB0202403)；国家自然科学基金资助项目(61561146395,61772542)

Inner-out subdomain dividing heterogeneous parallel algorithm for high order CFD solver

Author:

Affiliation:

(Institute for Quantum Information & State Key Laboratory of High Performance Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China)

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

对计算流体力学(Computational Fluid Dynamics, CFD)程序CNS提出一种Offload模式下对任务内外子区域划分的异构并行算法,结合结构化网格下有限差分计算和四阶龙格-库塔方法的特点,引入ghost网格点区域,设计了一种ghost区域收缩计算策略,显著降低了异构计算资源之间的数据传输开销,负载均衡时CPU端的计算与MPI通信完全和加速器端的计算重叠,提高了异构协同并行性。推导了保证计算正确性的ghost区域的参数,分析了负载均衡的条件。在“CPU(Intel Haswell Xeon E5-2670 12 cores ×2)＋加速器(Xeon Phi 7120A ×2)”的服务器上测得该算法较直接将任务子块整体迁至加速器端计算的异构算法性能平均提升至5.9倍,较MPI/OpenMP两级并行算法使用24个纯CPU核的性能,该算法使用单加速器时加速至1.27倍,使用双加速器加速至1.45倍。讨论和分析了性能瓶颈与存在的问题。

Abstract:

An Offload-mode heterogeneous parallel algorithm via inner-out subdomain dividing was proposed for CFD(computational fluid dynamics) program CNS. Combined with the characteristics of finite difference computing and fourth order Runge-Kutta method in structure mesh, the scheme of ghost region was introduced, based on which a Ghost-Region-Shrinking computing scheme was designed, significantly reducing the overhead of data movement between heterogeneous computing resources, making the computing and MPI communication on CPU absolutely overlap with the accelerator computing under load balance condition, bringing better heterogeneous synergetic parallelism. Parameter of the ghost region for the computing validity was given and load balance tuning was demonstrated. On a server with CPU (Intel Haswell Xeon E5-2670 12 cores×2)+MIC (Xeon Phi 7120A ×2), an averaged performance improvement of 5.9× was gained over the algorithm of using accelerator with task blocks integrally. Compared with MPI/OpenMP two-level parallel algorithm running on 24 Intel Haswell CPU cores, the proposed method achieved an accelerating of 1.27× with one MIC and 1.45× with two MICs. Finally the bottleneck and disadvantage were discussed.

参考文献

相似文献

引证文献

引用本文

王巍,徐传福,车永刚.高精度CFD程序的内外子区域划分异构并行算法[J].国防科技大学学报,2020,42(2):31-40.
WANG Wei, XU Chuanfu, CHE Yonggang. Inner-out subdomain dividing heterogeneous parallel algorithm for high order CFD solver[J]. Journal of National University of Defense Technology,2020,42(2):31-40.

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2019-10-10
最后修改日期:
录用日期:
在线发布日期: 2020-04-29
出版日期: 2020-04-28

首页

期刊介绍

投稿指南

编委会

期刊订阅

联系我们

Email订阅

Rss

English

引用本文

分享

文章指标

历史

文章二维码