A data and knowledge-driven practice for ensuring stability in ultra-large intelligent computing clusters

A data and knowledge-driven stability assurance scheme for such clusters was proposed to address the issues of frequent hardware failures, persistently high task training failure rates, and difficulties in cross-domain problem localization within ultra-large intelligent computing clusters with over...

Full description

Saved in:
Bibliographic Details
Main Authors: NIU Hongweihua, HUANG Yongbao, DING Guoqiang, HUANG Bao, ZHAO Zhiwen, XU Yang, WANG Tao, ZHANG Ruiling, WANG Xuan, ZHANG Yixiang
Format: Article
Language:zho
Published: Beijing Xintong Media Co., Ltd 2025-07-01
Series:Dianxin kexue
Subjects:
Online Access:http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025151/
Tags: Add Tag
No Tags, Be the first to tag this record!