Despite its overarching abstractions, it is semantically non-uniform and its complicated transaction and job scheduling heuristics ordered around a dependently networked object system create pathological failure cases with little debugging context that would otherwise not necessarily occur on systems with less layers of indirection. The use of bus APIs complicate communication with the service manager and lead to duplication of the object model for little gain. Further, the unit file options often carry implicit state or are not sufficiently expressive. There is an imbalance with regards to features of an eager service manager and that of a lazy loading service manager, having rusty edge cases of both with non-generic, manager-specific facilities. The approach to logging and the circularly dependent architecture seem to imply that lots of prior art has been ignored or understudied.
Great paper from Ben Maurer of Facebook in ACM Queue.
A “move-fast” mentality does not have to be at odds with reliability. To make these philosophies compatible, Facebook’s infrastructure provides safety valves.This is full of interesting techniques. * Rapidly deployed configuration changes: Make everybody use a common configuration system; Statically validate configuration changes; Run a canary; Hold on to good configurations; Make it easy to revert. * Hard dependencies on core services: Cache data from core services. Provide hardened APIs. Run fire drills. * Increased latency and resource exhaustion: Controlled Delay (based on the anti-bufferbloat CoDel algorithm — this is really cool); Adaptive LIFO (last-in, first-out) for queue busting; Concurrency Control (essentially a form of circuit breaker). * Tools that Help Diagnose Failures: High-Density Dashboards with Cubism (horizon charts); What just changed? * Learning from Failure: the DERP (!) methodology,
(tags: ben-maurer facebook reliability algorithms codel circuit-breakers derp failure ops cubism horizon-charts charts dependencies soa microservices uptime deployment configuration change-management)