A data and knowledge-driven practice for ensuring stability in ultra-large intelligent computing clusters
A data and knowledge-driven stability assurance scheme for such clusters was proposed to address the issues of frequent hardware failures, persistently high task training failure rates, and difficulties in cross-domain problem localization within ultra-large intelligent computing clusters with over...
Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | zho |
| Published: |
Beijing Xintong Media Co., Ltd
2025-07-01
|
| Series: | Dianxin kexue |
| Subjects: | |
| Online Access: | http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025151/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | A data and knowledge-driven stability assurance scheme for such clusters was proposed to address the issues of frequent hardware failures, persistently high task training failure rates, and difficulties in cross-domain problem localization within ultra-large intelligent computing clusters with over ten thousand computing cards. The cluster performance data was collected by employing heterogeneous resource integrated collection technology and distributed real-time big data ETL techniques. Fault diagnosis was performed using an enhanced SA-BiLSTM deep learning model, improving the explainability of diagnostic model outputs via knowledge graph analysis and matching for the generation of fault diagnosis reports. In the process of extracting time series features with the deep learning model, weighted fusion of features extracted at different scales , thereby improving the accuracy of the fault diagnosis model. In fault diagnosis simulation experiments conducted on an 18 000-card cluster, it was observed that the loss value gradually converged and stabilized at 0.047, achieving an accuracy rate of 98.4%. Practical has shown that the proposed stability assurance scheme can effectively support large-scale model training and enhance the reliability of intelligent computing clusters, providing a solid foundation for the construction of larger-scale intelligent computing clusters and the training of large models in the future. |
|---|---|
| ISSN: | 1000-0801 |