A data and knowledge-driven practice for ensuring stability in ultra-large intelligent computing clusters
A data and knowledge-driven stability assurance scheme for such clusters was proposed to address the issues of frequent hardware failures, persistently high task training failure rates, and difficulties in cross-domain problem localization within ultra-large intelligent computing clusters with over...
Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | zho |
| Published: |
Beijing Xintong Media Co., Ltd
2025-07-01
|
| Series: | Dianxin kexue |
| Subjects: | |
| Online Access: | http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025151/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|