good preso from Percona Live 2015 on the messiness of MySQL vs UTF-8 and utf8mb4
(tags: utf-8 utf8mb4 mysql storage databases slides booking.com character-sets)
A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications. The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to product a data structure that is related to the Q-digest. This t-digest data structure can be used to estimate quantiles or compute other rank statistics. The advantage of the t-digest over the Q-digest is that the t-digest can handle floating point values while the Q-digest is limited to integers. With small changes, the t-digest can handle any values from any ordered set that has something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by Q-digests in spite of the fact that t-digests are more compact when stored on disk.Super-nice feature is that it’s mergeable, so amenable to parallel usage across multiple hosts if required. Java implementation, ASL licensing.
(tags: data-structures algorithms java t-digest statistics quantiles percentiles aggregation digests estimation ranking)