June 30, 2014 - Justin Mason's Weblog

Facebook Doesn’t Understand The Fuss About Its Emotion Manipulation Study

This is quite unethical, and I’m amazed it was published at all. Kashmir Hill at Forbes nails it:
While many users may already expect and be willing to have their behavior studied — and while that may be warranted with “research” being one of the 9,045 words in the data use policy — they don’t expect that Facebook will actively manipulate their environment in order to see how they react. That’s a new level of experimentation, turning Facebook from a fishbowl into a petri dish, and it’s why people are flipping out about this.
Shocking stuff. We need a new social publishing platform, built on ethical, open systems.

(tags: ethics facebook privacy academia depression feelings emotion social-publishing social experimentation papers)
Building a Smarter Application Stack – DevOps Ireland

This sounds like a very interesting Dublin meetup — Engine Yard on thursday night:
This month, we’ll have Tomas Doran from Yelp talking about Docker, service discovery, and deployments. ‘There are many advantages to a container based, microservices architecture – however, as always, there is no silver bullet. Any serious deployment will involve multiple host machines, and will have a pressing need to migrate containers between hosts at some point. In such a dynamic world hard coding IP addresses, or even host names is not a viable solution. This talk will take a journey through how Yelp has solved the discovery problems using Airbnb’s SmartStack to dynamically discover service dependencies, and how this is helping unify our architecture, from traditional metal to EC2 ‘immutable’ SOA images, to Docker containers.’

(tags: meetups talks dublin deployment smartstack ec2 docker yelp service-discovery)
Smart Integration Testing with Dropwizard, Flyway and Retrofit

Retrofit in particular looks neat. Mind you having worked with in-memory SQL databases before for integration testing, I’d never do that again — too many interop glitches compared to “real world” MySQL/Postgres

(tags: testing integration-testing retrofit flyway dropwizard logentries)
Twitter’s TSAR

TSAR = “Time Series AggregatoR”. Twitter’s new event processor-style architecture for internal metrics. It’s notable that now Twitter and Google are both apparently moving towards this idea of a model of code which is designed to run equally in realtime streaming and batch modes (Summingbird, Millwheel, Flume).

(tags: analytics architecture twitter tsar aggregation event-processing metrics streaming hadoop batch)
‘Robust De-anonymization of Large Sparse Datasets’ [pdf]

paper by Arvind Narayanan and Vitaly Shmatikov, 2008. ‘We present a new class of statistical de- anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on. Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary’s background knowledge. We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.’

(tags: anonymisation anonymization sanitisation databases data-dumps privacy security papers)
HSE data releases may be de-anonymisable

Although the data has been kept anonymous, the increasing sophistication of computer-driven data-mining techniques has led to fears patients could be identified. A HSE spokesman confirmed yesterday that the office responded to requests for data from a variety of sources, including researchers, the universities, GPs, the media, health insurers and pharmaceutical companies. An average of about two requests a week was received. […] The information provided by the HPO has significant patient identifiers removed, such as name and date of birth. According to the HSE spokesman, individual patient information is not provided and, where information is sought for a small group of patients, this is not provided where the number involved is under five. “In such circumstances, it is highly unlikely that anyone could be identified. Nevertheless, we will have another look at data releases from the office,” he said.
I’d say this could be readily reversible, from the sounds of it.

(tags: anonymisation sanitisation data-dumps hse health privacy via:tjmcintyre)
Beautiful algorithm visualisations from Mike Bostock

This is a few days old, but unmissable. I swear, the ‘Wilson’s algorithm transformed into a tidy tree layout’ viz brought tears to my eyes ;)

(tags: dataviz algorithms visualization visualisation mazes trees sorting animation mike-bostock)
ByteArrayOutputStream is really, really slow sometimes in JDK6

This leads us to the bug. The size of the array is determined by Math.max(buf.length << 1, newcount). Ordinarily, buf.length << 1 returns double buf.length, which would always be much larger than newcount for a 2 byte write. Why was it not? The problem is that for all integers larger than Integer.MAX_INTEGER / 2, shifting left by one place causes overflow, setting the sign bit. The result is a negative integer, which is always less than newcount. So for all byte arrays larger than 1073741824 bytes (i.e. one GB), any write will cause the array to resize, and only to exactly the size required.
Ouch.

(tags: bugs java jdk6 bytearrayoutputstream impala performance overflow)
Cory Doctorow on Thomas Piketty’s ‘Capital in the 21st Century’

quite a leftie analysis

(tags: history capitalism economics piketty capital finance taxation growth money cory-doctorow thomas-piketty)
ThreadSanitizer

Google’s purify/valgrind-like concurrency checking tool: ‘As a bonus, ThreadSanitizer finds some other types of bugs: thread leaks, deadlocks, incorrect uses of mutexes, malloc calls in signal handlers, and more. It also natively understands atomic operations and thus can find bugs in lock-free algorithms. […] The tool is supported by both Clang and GCC compilers (only on Linux/Intel64). Using it is very simple: you just need to add a -fsanitize=thread flag during compilation and linking. For Go programs, you simply need to add a -race flag to the go tool (supported on Linux, Mac and Windows).’

(tags: concurrency bugs valgrind threadsanitizer threading deadlocks mutexes locking synchronization coding testing)

Comments closed

Archives

Links for 2014-06-30