A data and knowledge-driven practice for ensuring stability in ultra-large intelligent computing clusters
A data and knowledge-driven stability assurance scheme for such clusters was proposed to address the issues of frequent hardware failures, persistently high task training failure rates, and difficulties in cross-domain problem localization within ultra-large intelligent computing clusters with over...
Saved in:
| Main Authors: | NIU Hongweihua, HUANG Yongbao, DING Guoqiang, HUANG Bao, ZHAO Zhiwen, XU Yang, WANG Tao, ZHANG Ruiling, WANG Xuan, ZHANG Yixiang |
|---|---|
| Format: | Article |
| Language: | zho |
| Published: |
Beijing Xintong Media Co., Ltd
2025-07-01
|
| Series: | Dianxin kexue |
| Subjects: | |
| Online Access: | http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025151/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
-
An intelligent fault diagnosis model for bearings with adaptive hyperparameter tuning in multi-condition and limited sample scenarios
by: Jianqiao Li, et al.
Published: (2025-03-01) -
A Temporal Convolutional Network–Bidirectional Long Short-Term Memory (TCN-BiLSTM) Prediction Model for Temporal Faults in Industrial Equipment
by: Jinyin Bai, et al.
Published: (2025-02-01) -
A diffusion enhanced CRF and BiLSTM framework for accurate entity recognition
by: Yunfei Qiu, et al.
Published: (2025-06-01) -
Construction of Knowledge Graph for Marine Diesel Engine Faults Based on Deep Learning Methods
by: Xiaohe Tian, et al.
Published: (2025-03-01) -
Multi scale convolutional neural network combining BiLSTM and attention mechanism for bearing fault diagnosis under multiple working conditions
by: Zhao Dengfeng, et al.
Published: (2025-04-01)