The paper describing the innards of Spark Streaming and its RDD-based recomputation algorithm:
we use a data structure called Resilient Distributed Datasets (RDDs), which keeps data in memory and can recover it without replication by tracking the lineage graph of operations that were used to build it. With RDDs, we show that we can attain sub-second end-to-end latencies. We believe that this is sufficient for many real-world big data applications, where the timescale of the events tracked (e.g., trends in social media) is much higher.
Gor, a very nice-looking tool to log and replay HTTP traffic, specifically designed to “tee” live traffic from production to staging for pre-release testing
Well-written description of the pros and cons. I’m a rebaser, fwiw. (via Darrell)
To sum up, if you want a perfect performance you need to: Ensure traffic is distributed evenly across many RX queues and SO_REUSEPORT processes. In practice, the load usually is well distributed as long as there are a large number of connections (or flows). You need to have enough spare CPU capacity to actually pick up the packets from the kernel. To make the things harder, both RX queues and receiver processes should be on a single NUMA node.