Skip to main content

testing-distributed-systems

List of resources on testing distributed systems​

List of resources on testing distributed systems curated by Andrey Satarin (@asatarin). If you are interested in my other stuff, checkout [talks] page. For any questions or suggestions you can reach out to me on Twitter (@asatarin) or LinkedIn.

Contents

Overview of testing approaches​

Research Papers​

Technologies for Testing Distributed Systems by Colin Scott​

Colin Scott shares his viewpoint from academia on testing distributed systems, specifically regression testing for correctness and performance bugs.

Testing in a Distributed World by Ines Sombra (RICON 2014)​

Great overview of techniques for testing distributed systems from practitioner, the video did age well and still extremely good overview of the landscape. Additional materials could be found in this Github repo

Resilience In Complex Adaptive Systems​

These materials are not directly related to testing distributed systems, but they greatly contribute to general understanding of such systems.

Jepsen​

State of the art approach to testing stateful distributed systems.

Elle transactional consistency checker for black-box databases:

Some notable Jepsen analyses:

Jepsen is used by CockroachDB, VoltDB, Cassandra, ScyllaDB and others.

Formal Methods​

Companies using TLA+ to verify correctness of algorithms:

Lineage-driven Fault Injection​

Netflix adopted lineage-driven fault injection techniques for testing microservices.

Chaos Engineering​

Netflix pioneered chaos engineering discipline.

Fuzzing​

There are two flavors of fuzzing. First, randomized concurrency testing, where the ordering of messages is fuzzed:

And input fuzzing, where message contents or user inputs are fuzzed:

Microservices​

Amazing and comprehensive overview of different strategies to test systems built with microservices by Cindy Sridharan.

Series of blog posts specifically on testing in production β€” best practices, pitfaults, etc:

Game Days​

Performance and Benchmarking​

See also benchmarking tools.

Test Case Reduction​

Misc​

Specific approaches in different distributed systems​

Amazon Web Services​

See also formal methods section.

Netflix​

Automated failure injection (see also Lineage-driven Fault Injection):

Random/manual failure injection testing:

See also Chaos Engineering.

Twitter​

Cassandra​

ScyllaDB​

They published series of blog posts on testing ScyllaDB:

VoltDB​

Series of post on testing at VoltDB:

Additional resources:

MemSQL​

CockroachLabs (CockroachDB)​

PingCap (TiDB)​

See also formal methods section.

MongoDB​

See also formal methods section.

Cloudera​

FoundationDB​

Wallaroo Labs​

There is also talk from Sean T. Allen on testing stream processing system at Wallaroo Labs (ex. Sendence)

Google​

Microsoft​

See also formal methods section.

Dropbox​

  • Mysteries of Dropbox Property-Based Testing of a Distributed Synchronization Service β€” example of how to use QuickCheck to test synchronisation in Dropbox and similar tools (Google Drive). John Hughes gave a talk on this. See also QuickCheck.
  • Data Checking at Dropbox β€” If you have lots of data, you have to verify that is doesn't bit rot and protect it against rare bugs (e.g. race conditions) to guarantee long term durability. This talks explains intricacies of building data consistency checker(s) at Dropbox scale.
  • Dropbox's Exabyte Storage System (aka Magic Pocket) talk by James Cowling β€” describes number of strategies to achieve exteremely high durability. This includes:
    • guard against faulty disks,
    • guard against software defects,
    • guard against black swan events,
    • operational safeguards to reduce blast radius,
    • safeguards against deletes with multi stage soft-delete,
    • comprehensive testing strategy in-depth with increased scale,
    • redundancy across varios axis in software and hardware stacks,
    • continuous data integrity validation on many levels,
    • etc
  • Testing sync at Dropbox β€” comprehensive overview of two test frameworks at Dropbox for new sync engine implementation. CanopyCheck β€” single threaded and fully deterministic randomized testing framework with minimization for synchronization planner component of the engine. The other framework Trinity focuses on concurrency and larger surface area of componenents. Great discussion on tradeoffs between determinism, strengh of test oracles vs width of coverage and size of the system under test.

Atomix Copycat​

Onyx​

LinkedIn​

Druid.io​

Salesforce​

InfluxDB​

Shopify​

Confluent (Kafka)​

See also formal methods section.

Elastic (Elasticsearch)​

YugabyteDB​

FaunaDB​

Hazelcast​

Basho (Riak)​

CoreOS (etcd)​

Red Planet Labs​

Coil (TigerBeetle)​

Single node systems​

These examples are not about distributed systems, but they demostrate testing concurrency and level of sofistication required in distributed systems.

SQLite​

SQLite is not a distributed system by any stretch of the imagination, but provides good example of comprehensive testing of a database implementation.

Sled​

Clickhouse​

Tools​

Network Simulation​

QuickCheck​

Benchmarking​

Linkbench​

YCSB​