Non-Intrusive and Efficient Detection of Latent Reliability Bottlenecks within Cloud Storage Services
Large-scale storage systems, common in cloud computing, employ replication techniques to ensure reliability. Along with the increasing scale of modern cloud platforms, however, replica servers may inadvertently depend on deep, common infrastructure components, e.g., switches and DNS servers. Such unexpected common dependencies are defined as Latent Reliability Bottlenecks (or LRBs), which can result in correlated failures undermining the replication efforts. While there exist significant efforts in localizing faults after they occur, this paper proposes a novel system, SONDE, that offers non-intrusive and efficient LRBs detection before failures occur, by three steps: 1) automatically collecting service components and their dependency information, 2) constructing a fault tree model using this information, and 3) efficiently analyzing the fault tree to identify and rank LRBs based on their severity. SONDE is novel in its Step 1 and 3. In Step 1, SONDE's automatic dependency collection mechanism not only is accurate and efficient, but also does not need any human intervention or additional agent adoption. In Step 3, SONDE introduces a high-performance fault tree analysis engine by leveraging Z3 SMT solver, making LRBs analysis scalable to cloud-scale systems. We evaluate SONDE through detecting LRBs in a realistic storage service, and also based on large-scale datasets. For example, SONDE can detect 100% of the critical LRBs in a 70,656-node system, within 5 minutes.