RELIABLE EVENT DISSEMINATION FOR TIME-SENSIBLE APPLICATIONS OVER WIDE-AREA NETWORKS
Esposito, Christiancarmine (2009) RELIABLE EVENT DISSEMINATION FOR TIME-SENSIBLE APPLICATIONS OVER WIDE-AREA NETWORKS. [Tesi di dottorato] (Inedito)
Full text disponibile come:
Introduction Context In the recent decades we have witnessed a massive proliferation of the Internet, which succeeded to pervade all our daily activities and to be adopted throughout the entire world. The emergence of the Internet as a general communication channel is considerably affecting the scale of current software systems and deeply transforming the architecture of future critical systems. %In fact, a report, produced by Carnegie Mellon University's Software Engineering Institute (SEI) in June 2006, envisioned how future software systems are going to be architected, introducing the so-called Ultra Large Scale (ULS) systems, which are defined as federations of heterogeneous and independent systems glued together by a middleware solution. Such systems are characterized by (i) billions lines of code, (ii) several users, (iii) large amount of data stored, accessed, manipulated, and refined, (iv) many connections and interdependencies, and (v) extremely-high geographic distribution. Traditionally, a critical system consists of a monolithic, "close world'', architecture, i.e., several computing nodes interconnected by a dedicated network with limited or no connectivity towards the outside world. An example of such traditional architecture is Supervisory Control And Data Acquisition (SCADA), e.g., which is used in several current critical systems such as the control room of power plants or air traffic control systems. However, future critical systems will shift to an innovative federated, ``open world'', architecture, namely Large scale Complex Critical Infrastructure (LCCI), which belongs to the group of the so-called Ultra Large Scale (ULS) systems, which were envisioned in a report produced by Carnegie Mellon University's Software Engineering Institute (SEI) in June 2006. Specifically, an LCCI consists in a dynamic Internet-scale hierarchy / constellation of interacting heterogeneous, inconsistent, and changing systems, which cooperate to perform critical functionalities. Many of the ideas behind LCCIs are increasingly ``in the air'' in several current projects that aim to develop innovative critical systems. For example, EuroCONTROL has funded a project to device the novel European framework for Air Traffic Management (ATM) in Europe, called Single European Sky (SESAR). Current European airspace is fragmented in several areas, each one managed by a single control system. Such traditional ATM approach has been demonstrated to be not suitable to handle the future avionic traffic, so it is going to be substituted by a more integrated approach. In fact, SESAR aims to develop a seamless infrastructure that allows control systems to cooperate each other in order to have a wider vision of the airspace, which is no more limited only to their assigned fragment. As previously stated, traditional critical systems have been characterized by the use of dedicated machines and networks, so hardware and software faults were considered the only threats to the reliability and effectiveness of the system, while communication failures were assumed to be almost improbable to occur. Therefore, in the last decades the research has spent a lot of efforts investigating on how to deal with the former two kind of faults, paying less attention on how to treat communication failures. As a proof of this lack of attention, main standardized, and mature, commercial middleware used in building critical systems do not address them at all, such as Java Message Service (JMS), or provides very basic mechanisms, such as the recent OMG standard called Data Distribution Service (DDS). However, LCCIs cannot use dedicated networks due to their geographical extension, but they adopt wide-area networks that exhibit an availability between 95 percent and a little over 99 percent and do not provide any guarantees on the offered Quality-of-Service (QoS). So, when a federated architecture is adopted to device critical systems, communication failures have a high probability to occur, even greater than hardware and software failures, so guaranteeing an efficient data distribution constitutes the pivotal factor to accomplish the mission of LCCIs. The aim of this thesis is to bring a significant contribution in addressing such issue, with the goal of enabling the definition of novel strategies to support effective communication among several critical systems interconnected over wide-area networks. Problem Statement Mostly all the critical systems fall within the wider class of Monitor and Control (M&C) systems, i.e., the environment is continuously monitored and the system responds appropriately avoiding threats that may lead to losses of human lives and/or money. For example, an Air Traffic Management (ATM) system keeps track of all the flight in a given portion of the airspace (i.e., the sensing part of the system) and may change the routes of those aircraft that risk to collide (i.e., the responding part of the system). Therefore, one of the main measures to assess the effectiveness of a critical system is timeliness, i.e., a treat has to be detected on time in order to perform proper actions to avoid it. For example, a collision has to be detected within a certain time before its likely occurrence so that aircrafts have time to change their route and prevent the collision to happen. So, critical operations account the right answer delivered too late as the wrong answer, and this means that the adopted middleware has to cope with timing failures and to guarantee that deliveries occur within given deadlines, i.e., on-time information dissemination is required. For example, a radar scans a given area of the airspace hundred times in a second, and a control system usually combines the data received by several radars to view the position of all the aircrafts in a given portion of the airspace. If a message produced by a radar reaches an ACC later than 0,6 seconds, it is not usable since the current state of the flights does not match the content of the received message, and the control system that receives it has an out-of-time view of the position of the aircrafts. This can cause disastrous consequences: when late-received radar data are fused with the timely-delivered ones, several false positives and false negatives can be generated through the process of collision detection As previously asserted, message deliveries over wide-area networks exhibit not-negligible bursty loss patterns, i.e., a message has a considerable probability P to be lost during the delivery and the succession of consecutive dropped messages has an average length ABL greater than two. The critical nature of LCCIs demands that messages have to be delivered to all the destinations despite of the faulty behaviour of the network, so the adopted middleware has to provide some means to tolerate the message losses imposed by the network in order to achieve a reliable message distribution. However, the reliability gain is always achieved at the expenses of worsening the predictability of the delivery time and leading to timing failures. Since LCCIs require that messages are guaranteed to be timely delivered to all the interested consumers despite of the occurrence of several failures, it's needed to provide a trade-off between the achievable reliability and timeliness degree. The ultra large scale of LCCIs worsens the already-tough challenge to join reliability and timeliness since several solutions to tolerate message drops exhibit severe scalability limitations. In addiction, since LCCIs are smeared on several networking domains due to their geographic distribution, network conditions, i.e. propagation latency and loss pattern, do not result uniform all over the infrastructure, but the overall LCCI is composed of several portions each one characterized by a particular configuration of the network behaviour. Therefore, the approach "one solution fits all" does not work in the case of LCCIs, but the adopted middleware has to autonomously choose the proper message delivery strategy to the experienced network conditions in order to support a reliable and timely data distribution. Last, wide-area networks do not exhibit a stable behaviour but network conditions continuously change. Therefore, the adopted delivery strategy has also to provide self-configuring capabilities in order to adapt to any changes in the behaviour of the underlying network and to provide almost the same reliability and timeliness degree masking fluctuations in the network conditions. Open Issues Publish/Subscribe interaction model is an asynchronous messaging paradigm where consumers, namely subscribers, receive only the messages produced by the so-called publishers in which they have expressed interest through a subscription predicate. Middleware services based on such model are suitable to be adopted in ULS systems since they exhibit strong decoupling properties that enforce scalability. As previously mentioned, LCCI require a reliable data distribution, so the adopted publish/subscribe service needs to provide means to cope with message losses. However, since its inception, the publish/subscribe community has been more focused on scalable architectures, efficient delivery, and expressive subscriptions rather than reliable event dissemination. However, this status quo is changing as more and more publish/subscribe services have started to be used in application domains that expose stringent reliability requirements. For example, a middleware complaint to the recent specification standardized by OMG for publish/subscribe services, namely Data Distribution Service (DDS), has been used in the novel combat management system proposed by Thales, namely TACTICOS, to manage all the C4I functionalities on several kind of warships. Most of the research efforts on investigating reliable publish/subscribe services focused on how to maintain the connectivity among publishers and subscribers after the occurrence of failures, while less interest has been put on dealing with message losses~. This is due to consideration that a publisher/subscriber service can be built on top of a given multicast protocol, and how to achieve loss-tolerance has not been felt challenging since it can be resolved by using a reliable multicast protocol. However, the loss-tolerance issues raised by LCCIs are far from been completely treated even using one of the best protocols available in the literature of the reliable multicast. In fact, most of the current reliable multicast approaches adopt reactive techniques to tolerate message losses, i.e., a dropped message is somehow detected by one of the destinations and a retransmission is triggered so that the message can be recovered. Such techniques allows guaranteeing an high degree of loss-tolerance, but there are no assurances on the timeliness of the deliveries. In fact, the number of retransmissions needed to successfully deliver a message depends on the number of consecutive messages dropped by the network. However, since ABL is not known a priori, it is unlikely to forecast how many retransmissions are needed to deliver a message, so the time to deliver a message in case of drops is not predictable and timeliness is not achieved. On the other hand, proactive techniques, i.e., necessary countermeasures to deal with possible message drops are taken prior their happening, are the only feasible means to guarantee both timeliness and reliability. In fact, such approaches minimize the time to recover a lost message, so communications over faulty networks do not suffer of performance fluctuations. However, current proactive approaches exhibit scalability and reliability limitations that prevent their usability in the context of LCCI. In addiction, current reliable publish/subscribe services that treat message losses suffer of two main drawbacks: (i) the event dissemination strategy is not chosen with respect to the specific conditions experienced by the network and (ii) the same strategy is adopted all over the service even if it is segmented in portion with different network conditions. Thesis Contribution The aim of this PhD thesis is to bring a significant contribution in the area of reliable multicast, with the goal of enabling the definition of novel strategies to provide both reliability and timeliness in large scale critical systems. The efforts striven in this dissertation result into the design of (i) innovative proactive techniques for Application-Layer Multicast (ALM) and (ii) a hierarchical cluster-based overlay network that optimizes the adopted reliability strategy according to the behaviour of the given routing domain. During the three years experience of the doctorate the main proactive methods have been investigated: -Forward Error Correction: additional data is coded from the message, so that the destination can recover the lost packages by decoding the received packages. FEC techniques exhibit the drawback that the coding actions are focused on the message senders leading to evident scalability problems. In fact, the redundancy degree is tailored on the destination that experiences the worst loss patter leading to unneeded traffic towards destinations that experience better loss patter than the worst one. This dissertation overcomes such issue by proposing a decentralized FEC technique where only a sub-set of interior nodes of a multicast tree performs coding actions. This allows to distributed the coding duties in several parts of the dissemination infrastructure and experimental data proved that scalability is proved respect to traditional FEC without affecting the achievable reliability degree. -Multi-tree Dissemination: studies have demonstrated that topology of the Internet is characterized by an intrinsic path redundancy degree, i.e., there are several distinct paths from a source to a given destination. Multi-tree dissemination takes advantage of such redundancy by sending several copies of a message through several paths towards the given destination. Reliability is achieved if the multicast forest, i.e., the several multicast trees built by the system among the subscribed applications, verifies the essential requirement of Path Diversity, the several paths connecting a source to a destination do not have to contain overlapping network devices. In fact, if path diversity is not verify, it is possible that the failure of an overlapping devices can leave to loss of all the messages and a destination can experience message drops. This dissertation resolves such matter by proposing a novel algorithm to built diverse trees. However, multi-tree dissemination is affected by a serious problem: a source originates additional traffic that wastes network resources and can cause congestion, which worsen the loss patterns imposed by the network. To deal with such problem, multi-tree dissemination is teamed with Network Coding. Network coding has been used in multicast solutions to achieve optimal throughput. Experimental data shows that network coding helps to mitigate the problem of un-efficient use of network resources that affect multi-tree dissemination. The last contribution of this thesis is a hierarchical architecture of a event dissemination solution for LCCIs. Such architecture is an hybrid peer-to-peer topology, characterized by two different layers: 1) the low layer is composed by a pure peer-to-peer cluster, and 2) the higher layer consists of a network of all the coordinators of each cluster of the LCCI. The cluster can be defined at deployment time of each system of the LCCI (e.g., each cluster consists of an entire system), or at run time, using a given proximity measure in order to group all the closer peers (e.g., each system may be made of several clusters). On one hand, each system administration can choose a reliability strategy tailored on the routing domain of the managed system, without considering the choice in other systems. In fact, to assist in such choice, this thesis presents a study of how the different available reliability means for ALMs perform under different network conditions. On the other hand, the network of the coordinators uses a single reliability technique that is able to guarantee event delivery even in the worst loss pattern. Since communications among coordinators have to be as much reliable and timely as possible, only the previously introduced proactive techniques are used. Thesis Organization - Chapter 1 provides the basic concepts of ULS systems and LCCIs and a description of the LCCI example that motivates the the studies which have been conducted in this thesis, concluding by emphasizing the problem of data dissemination in the context of LCCIs. On the other hand, it also gives an overview of publish/subscribe middleware, focusing on the issues of supporting reliable event dissemination. In addition, it describe the related literature on reliable publish/subscribe services, highlighting their relations with respect to LCCIs. - Chapter 2 provides an overview of the proposed loss-tolerance approaches available in literature and analyzes their pros and cons. - Chapter 3 describes the hybrid peer-to-peer topology to architect a data dissemination service suitable for LCCIs and how to choose the reliability strategy considering the network conditions experienced by the publishers/subscribers. - Chapter 4 is finally devoted to the experimental campaigns performed to assess the effectiveness of the approaches proposed in this dissertation.
Solo per gli Amministratori dell'archivio: edita il record