Some good factoids about Loggly’s Kafka usage and scales
Some good details from Boyan Dimitrov at Hailo, on their orchestration, deployment, provisioning infra they’ve built
A probabilistic data structure for frequency/k-occurrence cardinality estimation of multisets. Sample implementation(via Patrick McFadin)
Another GC-coordination strategy, similar to Blade (qv), with some real-world examples using Cassandra
Good overview of the state of the art in NLP nowadays. I particularly like word2vec interesting:
Embedding words as real-numbered vectors using a skip-gram, negative-sampling model (word2vec code) was mentioned in nearly every talk I attended. Either companies are using various word2vec implementations directly or they are building diffs off of the basic framework. Trained on large corpora, the vector representations encode concepts in a large dimensional space (usually 200-300 dim).Quite similar to some tokenization approaches we experimented with in SpamAssassin, so I don’t find this too surprising….