A data and knowledge-driven practice for ensuring stability in ultra-large intelligent computing clusters
A data and knowledge-driven stability assurance scheme for such clusters was proposed to address the issues of frequent hardware failures, persistently high task training failure rates, and difficulties in cross-domain problem localization within ultra-large intelligent computing clusters with over...
Saved in:
| Main Authors: | NIU Hongweihua, HUANG Yongbao, DING Guoqiang, HUANG Bao, ZHAO Zhiwen, XU Yang, WANG Tao, ZHANG Ruiling, WANG Xuan, ZHANG Yixiang |
|---|---|
| Format: | Article |
| Language: | zho |
| Published: |
Beijing Xintong Media Co., Ltd
2025-07-01
|
| Series: | Dianxin kexue |
| Subjects: | |
| Online Access: | http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025151/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
-
Deep Learning for Real-Time PPE Usage Monitoring Using Wearable IMU Sensors
by: Pedro Carvalho da Fonseca Guimaraes, et al.
Published: (2025-01-01) -
Music audio emotion regression using the fusion of convolutional neural networks and bidirectional long short-term memory models
by: Yi Qiu, et al.
Published: (2025-07-01) -
A Method Based on CNN–BiLSTM–Attention for Wind Farm Line Fault Distance Prediction
by: Ming Zhang, et al.
Published: (2025-07-01) -
Monthly precipitation prediction based on quadratic decomposition and improved parrot algorithm
by: Weijie Zhang, et al.
Published: (2025-07-01) -
A text mining-based approach for comprehensive understanding of Chinese railway operational equipment failure reports
by: Xiaorui Yang, et al.
Published: (2025-07-01)