Hadoop, a batch-generated read-only Voldemort cluster, and an intriguing optimal-storage histogram bucketing algorithm:
The optimal histogram is computed using a random-restart hill climbing approximated algorithm. The algorithm has been shown very fast and accurate: we achieved 99% accuracy compared to an exact dynamic algorithm, with a speed increase of one factor. […] The amount of information to serve in Voldemort for one year of BBVA’s credit card transactions on Spain is 270 GB. The whole processing flow would run in 11 hours on a cluster of 24 “m1.large” instances. The whole infrastructure, including the EC2 instances needed to serve the resulting data would cost approximately $3500/month.
‘Splout is a scalable, open-source, easy-to-manage SQL big data view. Splout is to Hadoop + SQL what Voldemort or Elephant DB are to Hadoop + Key/Value. Splout serves a read-only, partitioned SQL view which is generated and indexed by Hadoop.’ Some FAQs: ‘What’s the difference between Splout SQL and Dremel-like solutions such as BigQuery, Impala or Apache Drill? Splout SQL is not a “fast analytics” Dremel-like engine. It is more thought to be used for serving datasets under web / mobile high-throughput, many lookups, low-latency applications. Splout SQL is more like a NoSQL database in the sense that it has been thought for answering queries under sub-second latencies. It has been thought for performing queries that impact a very small subset of the data, not queries that analyze the whole dataset at once.’
impressively high-quality newbie’s guide from the Goonswarm Federation — as themittani.com describes it, ‘frankly a work of art: a 1950’s Pulp Scifi magazine full of internet spaceships and sociopathy.’