Links for 2020-01-22

  • A Review of Netflix’s Metaflow

    Metaflow looks nice, and used by $work’s data scientists

    (tags: metaflow data-science data batch architecture)

  • XGBoost

    ‘an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.’

    (tags: python xgboost gradient-boosting ml machine-learning mpi)

  • Historic S3 data corruption due to a fault load balancer

    This came up in a discussion of using hashes for end-to-end data resiliency on the og-aws slack. Turns out AWS support staff wrote it up at the time:

    We’ve isolated this issue to a single load balancer that was brought into service at 10:55pm PDT on Friday, 6/20 [2008].  It was taken out of service at 11am PDT Sunday, 6/22.  While it was in service it handled a small fraction of Amazon S3’s total requests in the US.  Intermittently, under load, it was corrupting single bytes in the byte stream.  When the requests reached Amazon S3, if the Content-MD5 header was specified, Amazon S3 returned an error indicating the object did not match the MD5 supplied.  When no MD5 is specified, we are unable to determine if transmission errors occurred, and Amazon S3 must assume that the object has been correctly transmitted. Based on our investigation with both internal and external customers, the small amount of traffic received by this particular load balancer, and the intermittent nature of the above issue on this one load balancer, this appears to have impacted a very small portion of PUTs during this time frame. One of the things we’ll do is improve our logging of requests with MD5s, so that we can look for anomalies in their 400 error rates.  Doing this will allow us to provide more proactive notification on potential transmission issues in the future, for customers who use MD5s and those who do not. In addition to taking the actions noted above, we encourage all of our customers to take advantage of mechanisms designed to protect their applications from incorrect data transmission.  For all PUT requests, Amazon S3 computes its own MD5, stores it with the object, and then returns the computed MD5 as part of the PUT response code in the ETag.  By validating the ETag returned in the response, customers can verify that Amazon S3 received the correct bytes even if the Content MD5 header wasn’t specified in the PUT request.  Because network transmission errors can occur at any point between the customer and Amazon S3, we recommend that all customers use the Content-MD5 header and/or validate the ETag returned on a PUT request to ensure that the object was correctly transmitted.  This is a best practice that we’ll emphasize more heavily in our documentation to help customers build applications that can handle this situation.

    (tags: aws s3 outages postmortems load-balancing data-corruption corruption failure md5 hashing hashes)

  • Expert reaction to World Health Organisation Q&A on e-cigarettes

    It does seem that scaremongering about vaping is hurting efforts to get people off cigarettes:

    “Practically all the factual statements in it are wrong. There is no evidence that vaping is ‘highly addictive’ – less than 1% of non-smokers become regular vapers.  Vaping does not lead young people to smoking – smoking among young people is at all time low.  There is no evidence that vaping increases risk of heart disease or that could have any effect at all on bystanders’ health. The US outbreak of lung injuries is due to contaminants in illegal marijuana cartridges and has nothing to do with nicotine vaping. There is clear evidence that e-cigarettes help smokers quit. “The authors of this document should take responsibility for using blatant misinformation to prevent smokers from switching to a much less risky alternative.”

    (tags: cigarettes smoking vaping addiction health medicine scaremongering who cancer)

  • The No Code Movement

    ‘No code is the best way to write secure and reliable applications. Write nothing; deploy nowhere.’

    (tags: coding no nocode funny true)

  • Star-Tree Index: Powering Fast Aggregations on Pinot | LinkedIn Engineering

    An interesting new indexing technique for multi-dimensional data set queries, where you can predefine the _order_ of query dimensions:

    With such huge improvements for both latency and throughput, the Star-Tree index only costs about 12% extra storage space compared to data without indexing techniques and 6% extra compared to data with inverted index.

    (tags: star-tree sql querying search pinot linkedin algorithms databases indexing indexes)

  • Boing Boing is 20 (or 33) years old today.

    Wow. happy birthday from this happy mutant

    (tags: boing-boing blogs history 1990s zines)

This entry was posted in Uncategorized. Bookmark the permalink. Both comments and trackbacks are currently closed.