Finding latent performance bugs in systems implementations

C Killian, K Nagaraj, S Pervez, R Braud… - Proceedings of the …, 2010 - dl.acm.org
C Killian, K Nagaraj, S Pervez, R Braud, JW Anderson, R Jhala
Proceedings of the eighteenth ACM SIGSOFT international symposium on …, 2010dl.acm.org
Robust distributed systems commonly employ high-level recovery mechanisms enabling the
system to recover from a wide variety of problematic environmental conditions such as node
failures, packet drops and link disconnections. Unfortunately, these recovery mechanisms
also effectively mask additional serious design and implementation errors, disguising them
as latent performance bugs that severely degrade end-to-end system performance. These
bugs typically go unnoticed due to the challenge of distinguishing between a bug and an …
Robust distributed systems commonly employ high-level recovery mechanisms enabling the system to recover from a wide variety of problematic environmental conditions such as node failures, packet drops and link disconnections. Unfortunately, these recovery mechanisms also effectively mask additional serious design and implementation errors, disguising them as latent performance bugs that severely degrade end-to-end system performance. These bugs typically go unnoticed due to the challenge of distinguishing between a bug and an intermittent environmental condition that must be tolerated by the system. We present techniques that can automatically pinpoint latent performance bugs in systems implementations, in the spirit of recent advances in model checking by systematic state space exploration. The techniques proceed by automating the process of conducting random simulations, identifying performance anomalies, and analyzing anomalous executions to pinpoint the circumstances leading to performance degradation.
By focusing our implementation on the MACE toolkit, MACEPC can be used to test our implementations directly, without modification. We have applied MACEPC to five thoroughly tested and trusted distributed systems implementations. MACEPC was able to find significant, previously unknown, long-standing performance bugs in each of the systems, and led to fixes that significantly improved the end-to-end performance of the systems.
ACM Digital Library
Showing the best result for this search. See all results