Links for 2014-11-04

  • Zookeeper: not so great as a highly-available service registry

    Turns out ZK isn’t a good choice as a service discovery system, if you want to be able to use that service discovery system while partitioned from the rest of the ZK cluster:

    I went into one of the instances and quickly did an iptables DROP on all packets coming from the other two instances.  This would simulate an availability zone continuing to function, but that zone losing network connectivity to the other availability zones.  What I saw was that the two other instances noticed the first server “going away”, but they continued to function as they still saw a majority (66%).  More interestingly the first instance noticed the other two servers “going away”, dropping the ensemble availability to 33%.  This caused the first server to stop serving requests to clients (not only writes, but also reads).
    So: within that offline AZ, service discovery *reads* (as well as writes) stopped working due to a lack of ZK quorum. This is quite a feasible outage scenario for EC2, by the way, since (at least when I was working there) the network links between AZs, and the links with the external internet, were not 100% overlapping. In other words, if you want a highly-available service discovery system in the fact of network partitions, you want an AP service discovery system, rather than a CP one — and ZK is a CP system. Another risk, noted on the Netflix Eureka mailing list at https://groups.google.com/d/msg/eureka_netflix/LXKWoD14RFY/tA9UnerrBHUJ :
    ZooKeeper, while tolerant against single node failures, doesn’t react well to long partitioning events. For us, it’s vastly more important that we maintain an available registry than a necessarily consistent registry. If us-east-1d sees 23 nodes, and us-east-1c sees 22 nodes for a little bit, that’s OK with us.
    I guess this means that a long partition can trigger SESSION_EXPIRED state, resulting in ZK client libraries requiring a restart/reconnect to fix. I’m not entirely clear what happens to the ZK cluster itself in this scenario though. Finally, Pinterest ran into other issues relying on ZK for service discovery and registration, described at http://engineering.pinterest.com/post/77933733851/zookeeper-resilience-at-pinterest ; sounds like this was mainly around load and the “thundering herd” overload problem. Their workaround was to decouple ZK availability from their services’ availability, by building a Smartstack-style sidecar daemon on each host which tracked/cached ZK data.

    (tags: zookeeper service-discovery ops ha cap ap cp service-registry availability ec2 aws network partitions eureka smartstack pinterest)

  • Why We Didn’t Use Kafka for a Very Kafka-Shaped Problem

    A good story of when Kafka _didn’t_ fit the use case:

    We came up with a complicated process of app-level replication for our messages into two separate Kafka clusters. We would then do end-to-end checking of the two clusters, detecting dropped messages in each cluster based on messages that weren’t in both. It was ugly. It was clearly going to be fragile and error-prone. It was going to be a lot of app-level replication and horrible heuristics to see when we were losing messages and at least alert us, even if we couldn’t fix every failure case. Despite us building a Kafka prototype for our ETL — having an existing investment in it — it just wasn’t going to do what we wanted. And that meant we needed to leave it behind, rewriting the ETL prototype.

    (tags: cassandra java kafka scala network-partitions availability multi-region multi-az aws replication onlive)

  • Madhumita Venkataramanan: My identity for sale (Wired UK)

    If the data aggregators know everything about you — including biometric data, healthcare history, where you live, where you work, what you do at the weekend, what medicines you take, etc. — and can track you as an individual, does it really matter that they don’t know your _name_? They legally track, and sell, everything else.

    As the data we generate about ourselves continues to grow exponentially, brokers and aggregators are moving on from real-time profiling — they’re cross-linking data sets to predict our future behaviour. Decisions about what we see and buy and sign up for aren’t made by us any more; they were made long before. The aggregate of what’s been collected about us previously — which is near impossible for us to see in its entirety — defines us to companies we’ve never met. What I am giving up without consent, then, is not just my anonymity, but also my right to self-determination and free choice. All I get to keep is my name.

    (tags: wired privacy data-aggregation identity-theft future grim biometrics opt-out healthcare data data-protection tracking)

  • Linux kernel’s Transparent Huge Pages feature causing 300ms-800ms pauses

    bad news for low-latency apps. See also its impact on redis: http://antirez.com/news/84

    (tags: redis memory defrag huge-pages linux kernel ops latency performance transparent-huge-pages)

  • Please grow your buffers exponentially

    Although in some cases x1.5 is considered good practice. YMMV I guess

    (tags: malloc memory coding buffers exponential jemalloc firefox heap allocation)

  • How I created two images with the same MD5 hash

    I found that I was able to run the algorithm in about 10 hours on an AWS large GPU instance bringing it in at about $0.65 plus tax.
    Bottom line: MD5 is feasibly attackable by pretty much anyone now.

    (tags: crypto images md5 security hashing collisions ec2 via:hn)

This entry was posted in Uncategorized. Bookmark the permalink. Both comments and trackbacks are currently closed.

3 Comments

  1. Nix
    Posted November 5, 2014 at 02:06 | Permalink

    That THP comment is somewhat out of date. There are still problems, e.g. http://lwn.net/Articles/592011/, but the huge performance drops under swap-inducing memory pressure are definitely being chewed at and are much less bad than they were. 75f30861a12a6b09b759dfeeb9290b681af89057 is the most recent improvement, which landed in 3.16.

  2. Posted November 5, 2014 at 17:34 | Permalink

    thanks Nix — very good point. I guess we’re stuck with what’s on our LTS platforms :(

  3. Nix
    Posted November 6, 2014 at 01:14 | Permalink

    Oh. LTS. What fun…

    (I just hit a bug because I was using O_PATH and it’s not available on, get this, 2.6.32. I didn’t know I was working in a computer museum until today…)