Liguori, Pietro (2022) Fault Injection For Cloud Computing Systems: From Failure Mode Analysis To Runtime Failure Detection. [Tesi di dottorato]


Download (5MB) | Preview
[error in script] [error in script]
Item Type: Tesi di dottorato
Resource language: English
Title: Fault Injection For Cloud Computing Systems: From Failure Mode Analysis To Runtime Failure Detection
Date: 14 March 2022
Number of Pages: 250
Institution: Università degli Studi di Napoli Federico II
Department: Ingegneria Elettrica e delle Tecnologie dell'Informazione
Dottorato: Information technology and electrical engineering
Ciclo di dottorato: 34
Coordinatore del Corso di dottorato:
Cotroneo, DomenicoUNSPECIFIED
Natella, RobertoUNSPECIFIED
Date: 14 March 2022
Number of Pages: 250
Keywords: Cloud Computing; Reliability; Fault Injection; Failure Mode Analysis; Runtime Failure Detection;
Settori scientifico-disciplinari del MIUR: Area 09 - Ingegneria industriale e dell'informazione > ING-INF/05 - Sistemi di elaborazione delle informazioni
Date Deposited: 22 May 2022 21:33
Last Modified: 28 Feb 2024 11:01

Collection description

Nowadays, cloud computing systems are considered an attractive solution for running services with high-reliability requirements, such as in the telecom and healthcare domains, and have gained huge attention over the past decades because of continuously increasing demands. These systems consist of processes distributed across a data center, which cooperate by message passing and remote procedure calls. They are very complex, as they typically consist of software components of millions of lines of code, which run across dozens of computing nodes. It is very difficult to avoid software bugs when implementing the rich set of services of cloud computing systems. As a result, many high-severity failures have been occurring in the cloud infrastructures of popular providers, causing outages of several hours and the unrecoverable loss of user data. Therefore, the high-reliability requirements of such systems are still too far to reach. Fault-injection techniques, i.e., the deliberate insertion of faults into an operational system to determine its response, offer an effective solution to improve the reliability of the systems. These techniques are also important to identify failure modes of the infrastructure, in order to improve the detection and the recovery capabilities of the entire system. Although fault injection has reached a level of maturity that it is routinely used in many real-world systems, its adoption in cloud computing infrastructures raises several issues that have to be addressed. First, the user needs to inject realistic faults to be emulated in the experiments when targeting complex and distributed systems. The problem of defining a fault model becomes more difficult when injecting software faults (i.e., design and/or programming defects), since they depend on a variety of technical and organizational factors, including the programming language, the software development process, the maturity of the system, the expertise of developers, and the application domain. Second, the execution of the fault injection experiments in cloud systems is not trivial. Given the complexity of such systems (millions of LoCs), the fault injection campaigns can easily reach thousands of experiments due to the combination of the number of realistic fault types to inject and the space of the fault points where to inject. To assess the effects of the injection, failure data should be collected during every experiment by guaranteeing independence among the executions (e.g., by performing the system clean-up, the restart of the services, the revert of the database, etc.). In the light of these considerations, the execution of the fault-injection experiments should ideally be fully automated and supported by a complete fault injection workflow. Finally, the identification of the failure symptoms, a key step towards improving the reliability of cloud systems, often relies on the knowledge, the experience, and the intuition of human analysts since existing fault injection solutions provide limited support to the analyst for understanding what happened during an experiment. Unfortunately, manual analysis is too difficult and time-consuming, because of i) the high volume of messages generated by large distributed systems that the human analyst needs to scrutinize; ii) the non-determinism in distributed systems, in which the timing and the order of messages can unpredictably change even if there is no failure, which introduces noise in the analysis, and increases the effort of the human analyst to pinpoint the failure (i.e., to discriminate the anomalies caused by a fault from genuine variations of the system); iii) the use of “off-the-shelf” software components, either proprietary or open-source (such as application frameworks, middleware, data stores, etc.), whose events and protocols can be difficult to understand and to manually analyze. The first contribution of this thesis is a fault-injection tool-suite for cloud systems. The tool-suite is designed to be programmable and highly usable, by performing fault injection campaigns with customized fault types. The tool has been used to empirically analyze the impact of high-severity failures in the context of a large-scale, industry-applied case study and for subsequent analysis that aims to better understand the failure nature of these systems and to design run time monitoring strategy, which is capable of improving the failure detection capabilities. As for the failure nature, we know that these systems fail in complex and unexpected ways. For instance, recent outages reports showed that failures escape fault-tolerance mechanisms, due to unexpected combinations of events and of interactions among hardware and software components, which were not anticipated by the system designers. These failures are especially problematic when they are silent, i.e., not accompanied by any explicit failure notification, such as API error codes, or error entries in the logs. This behavior hinders the timely detection and recovery, lets the failures silently propagate through the system, and makes the traceback of the root cause more difficult, and recovery actions more costly (e.g., reverting a database state). Therefore, understanding how the system can fail (i.e., the failure mode analysis) and promptly identifying the failure at runtime (i.e., runtime failure detection) are crucial activities to improve the fault-tolerance mechanisms and define proper recovery strategies of cloud systems. As for the failure mode analysis, the thesis proposes a novel algorithm to identify failure symptoms and error propagation analysis. The algorithm adopts a probabilistic model and revealed to be very accurate in identifying the anomalies, i.e., failure symptoms, in noisy execution traces of the system, by significantly reducing the false alarms (i.e., genuine variations are not mistaken for failure symptoms) without discarding true anomalies (i.e., actual anomalies caused by a fault are not missed). In order to analyze failures from the set of anomalies and find recurring failure patterns, this thesis adopts two machine learning approaches: one based on unsupervised learning algorithms and, the other, based on deep learning ones. The former approach combines clustering with the proposed anomaly detection algorithm in order to automatically identify the failure classes among large sets of fault injection experiments. The approach achieved high accuracy (90% purity) under different conditions, but at the cost of manually setting the weights of the features, which requires a deep knowledge of the system internals, and efforts to best tune them concerning the specific workload. The latter approach, instead, overcomes the challenges of noise and complexity of the feature space by leveraging deep learning for unsupervised machine learning. The approach saves the manual efforts spent on feature engineering, by using an autoencoder to automatically transform the raw failure data into a compact set of features. The results demonstrate that the proposed approach can identify clusters with accuracy similar, or in some cases, even superior, to the fine-tuned clustering, with a low computational cost. The empirical analysis pointed out that cloud systems often exhibit a non-fail-stop behavior, in which it continues to execute despite inconsistencies in the state of the virtual resources due to missing or incorrect error handlers. From these results, the thesis proposes a lightweight approach to runtime verification tailored for the monitoring and analysis of cloud computing systems. The approach defines a set of monitoring rules from correct executions of the system in order to specify the desired system behavior. The rules are then synthesized in a runtime monitor that verifies whether the system’s behavior follows the desired one. Any runtime violation of the monitoring rules gives a timely notification to avoid undesired consequences, e.g., non-logged failures, non-fail-stop behavior, failure propagation across sub-systems, etc. The approach reveals to be very effective, achieving a failure detection rate of over 90% and improving the fault-tolerance mechanisms of the system.


Downloads per month over past year

Actions (login required)

View Item View Item