Virtual Machine Proactive Fault Tolerance Using Log-Based Anomaly Detection

Virtual Machine (VM) fault tolerance ensures high availability in cloud computing environments. Proactive fault tolerance strategies avert service disruptions by detecting potential failures before they occur and migrating the VMs to healthy hosts. In this paper, we propose Virtual Machine Proactive...

Full description

Saved in:
Bibliographic Details
Main Authors: Pratheek Senevirathne, Samindu Cooray, Jerome Dinal Herath, Dinuni Fernando
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10767421/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846128518946095104
author Pratheek Senevirathne
Samindu Cooray
Jerome Dinal Herath
Dinuni Fernando
author_facet Pratheek Senevirathne
Samindu Cooray
Jerome Dinal Herath
Dinuni Fernando
author_sort Pratheek Senevirathne
collection DOAJ
description Virtual Machine (VM) fault tolerance ensures high availability in cloud computing environments. Proactive fault tolerance strategies avert service disruptions by detecting potential failures before they occur and migrating the VMs to healthy hosts. In this paper, we propose Virtual Machine Proactive Fault Tolerance using Log-based Anomaly Detection (VMFT-LAD), a semi-supervised, real-time log anomaly detection model capable of detecting failures ahead of time to provide effective VM fault tolerance. VMFT-LAD leverages the efficiency of the Matrix Profile for anomaly detection and the log inference capability of Large Language Models (LLMs) to identify potential VM failures early, while minimizing false positives. Our improved Matrix Profile enables VMFT-LAD to continuously learn and identify potential failures, including unforeseen fault types, with minimal human intervention. Additionally, its semi-supervised nature eliminates the need for labeled failure data. Extensive evaluations on several datasets, using two distinct criteria to validate anomaly detection and early failure detection capabilities, demonstrate VMFT-LAD’s outstanding performance. VMFT-LAD achieves a Numenta Anomaly Benchmark (NAB) standard score of 90.74 for predicting failures in advance, with a high early detection rate of 96.28% and a low false positive rate of 0.02%, enabling accurate and timely VM migration before failures occur.
format Article
id doaj-art-21c1554b805849ec9b08b75e9f246ddc
institution Kabale University
issn 2169-3536
language English
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-21c1554b805849ec9b08b75e9f246ddc2024-12-11T00:04:58ZengIEEEIEEE Access2169-35362024-01-011217895117897010.1109/ACCESS.2024.350683310767421Virtual Machine Proactive Fault Tolerance Using Log-Based Anomaly DetectionPratheek Senevirathne0https://orcid.org/0009-0004-6380-9277Samindu Cooray1https://orcid.org/0009-0005-3401-2784Jerome Dinal Herath2Dinuni Fernando3https://orcid.org/0000-0001-5597-4185School of Computing, University of Colombo, Colombo, Sri LankaSchool of Computing, University of Colombo, Colombo, Sri LankaSchool of Computing, University of Colombo, Colombo, Sri LankaSchool of Computing, University of Colombo, Colombo, Sri LankaVirtual Machine (VM) fault tolerance ensures high availability in cloud computing environments. Proactive fault tolerance strategies avert service disruptions by detecting potential failures before they occur and migrating the VMs to healthy hosts. In this paper, we propose Virtual Machine Proactive Fault Tolerance using Log-based Anomaly Detection (VMFT-LAD), a semi-supervised, real-time log anomaly detection model capable of detecting failures ahead of time to provide effective VM fault tolerance. VMFT-LAD leverages the efficiency of the Matrix Profile for anomaly detection and the log inference capability of Large Language Models (LLMs) to identify potential VM failures early, while minimizing false positives. Our improved Matrix Profile enables VMFT-LAD to continuously learn and identify potential failures, including unforeseen fault types, with minimal human intervention. Additionally, its semi-supervised nature eliminates the need for labeled failure data. Extensive evaluations on several datasets, using two distinct criteria to validate anomaly detection and early failure detection capabilities, demonstrate VMFT-LAD’s outstanding performance. VMFT-LAD achieves a Numenta Anomaly Benchmark (NAB) standard score of 90.74 for predicting failures in advance, with a high early detection rate of 96.28% and a low false positive rate of 0.02%, enabling accurate and timely VM migration before failures occur.https://ieeexplore.ieee.org/document/10767421/Adaptive learninganomaly detectioncloud computingfault tolerancelarge language modelslog analysis
spellingShingle Pratheek Senevirathne
Samindu Cooray
Jerome Dinal Herath
Dinuni Fernando
Virtual Machine Proactive Fault Tolerance Using Log-Based Anomaly Detection
IEEE Access
Adaptive learning
anomaly detection
cloud computing
fault tolerance
large language models
log analysis
title Virtual Machine Proactive Fault Tolerance Using Log-Based Anomaly Detection
title_full Virtual Machine Proactive Fault Tolerance Using Log-Based Anomaly Detection
title_fullStr Virtual Machine Proactive Fault Tolerance Using Log-Based Anomaly Detection
title_full_unstemmed Virtual Machine Proactive Fault Tolerance Using Log-Based Anomaly Detection
title_short Virtual Machine Proactive Fault Tolerance Using Log-Based Anomaly Detection
title_sort virtual machine proactive fault tolerance using log based anomaly detection
topic Adaptive learning
anomaly detection
cloud computing
fault tolerance
large language models
log analysis
url https://ieeexplore.ieee.org/document/10767421/
work_keys_str_mv AT pratheeksenevirathne virtualmachineproactivefaulttoleranceusinglogbasedanomalydetection
AT saminducooray virtualmachineproactivefaulttoleranceusinglogbasedanomalydetection
AT jeromedinalherath virtualmachineproactivefaulttoleranceusinglogbasedanomalydetection
AT dinunifernando virtualmachineproactivefaulttoleranceusinglogbasedanomalydetection