yet another laughable UI kludge in OS X. ridiculous
“Dear Mr Tilman, this is the only way I can help you. saluti, Giorgio Moroder”. I love it — someone call Tufte
Probably not deliberate, but pretty damn inept.
Over the past weeks several movie studios have been trying to suppress the availability of TPB-AFK [the Pirate Bay documentary] by asking Google to remove links to the documentary from its search engine. The links are carefully hidden in standard DMCA takedown notices for popular movies and TV-shows. The silent attacks come from multiple Hollywood sources including Viacom, Paramount, Fox and Lionsgate and are being sent out by multiple anti-piracy outfits. Fox, with help from six-strikes monitoring company Dtecnet, asked Google to remove a link to TPB-AFK on Mechodownload. Paramount did the same with a link on the Warez.ag forums. Viacom sent at least two takedown requests targeting links to the Pirate Bay documentary on Mrworldpremiere and Rapidmoviez. Finally, Lionsgate jumped in by asking Google to remove a copy of TPB-AFK from a popular Pirate Bay proxy.
This is about the best tech journalism I’ve ever read on Flickr. nice one Mat Honan
An anarchist critique of the Freeman movement from the WSM:
This has been a very brief overview of the Freeman movement that has tried to capture with broad strokes its nature and possible responses. There is room for much more work, including a more in-depth analysis of the various flaws in the approach to the law. The greatest danger however is allowing a movement to develop within anarchist circles that ignores the principle of mutual aid and implicitly promotes private ownership of resources, that by granting absolute right to individuals gives them the ability to ignore their responsibilities to the wider community and ecology that sustains them. In more traditional terms, the movement is one all about negative freedoms, ignoring positive freedom as a concept.
Another leftie view on the Freeman movement
Well, apparently tomorrow, but close enough. Happy birthday to bradfitz’ greatest creation and its wonderful slab allocator!
I am loving this. Particularly this:
At trial in East Texas Cheng took the stand to tell Newegg’s story. Alcatel-Lucent’s corporate representative, at the heart of its massive licensing campaign, couldn’t even name the technology or the patents it was suing Newegg over. “Successful defendants have their litigation managed by people who care,” said Cheng. “For me, it’s easy. I believe in Newegg, I care about Newegg. Alcatel Lucent, meanwhile, they drag out some random VP—who happens to be a decorated Navy veteran, who happens to be handsome and has a beautiful wife and kids—but the guy didn’t know what patents were being asserted. What a joke.” “Shareholders of public companies that engage in patent trolling should ask themselves if they’re really well-served by their management teams,” Cheng added. “Are they properly monetizing their R&D? Surely there are better ways to make money than to just rely on litigating patents. If I was a shareholder, I would take a hard look as to whether their management was competent.”
Kyle “aphyr” Kingsbury expands on his slides demonstrating the real-world failure scenarios that arise during some kinds of partitions (specifically, the TCP-hang, no clear routing failure, network partition scenario). Great set of blog posts clarifying CAP
Welcome to the Galapagos of Chinese “open” source. I call it “gongkai” (??). Gongkai is the transliteration of “open” as applied to “open source”. I feel it deserves a term of its own, as the phenomenon has grown beyond the so-called “shanzhai” (??) and is becoming a self-sustaining innovation ecosystem of its own. Just as the Galapagos Islands is a unique biological ecosystem evolved in the absence of continental species, gongkai is a unique innovation ecosystem evolved with little western influence, thanks to political, language, and cultural isolation. Of course, just as the Galapagos was seeded by hardy species that found their way to the islands, gongkai was also seeded by hardy ideas that came from the west. These ideas fell on the fertile minds of the Pearl River delta, took root, and are evolving. Significantly, gongkai isn’t a totally lawless free-for-all. It’s a network of ideas, spread peer-to-peer, with certain rules to enforce sharing and to prevent leeching. It’s very different from Western IP concepts, but I’m trying to have an open mind about it.
Michael “Release It!” Nygard’s slides from a recent O’Reilly event, discussing large-scale service reliability design patterns
Good interview with Alan Maguire, the satirist behind the very funny @NotTheRTEGuide on Twitter:
I’ve always been a huge fan of TV Go Home and Charlie Brooker in general and it seemed like Irish TV and culture was a good target for the kind of barbed surrealism that he does. (I’m not claiming I’m in his league or anything but he’s the main influence). I was really surprised that there hadn’t been a parody RTÉ Guide already. TV listings are 140-ish characters already and the RTÉ Guide has a kind of weird place in Irish culture where everybody knows it but nobody our age really has any idea of what’s in it anymore. We associate it with a small-c conservatism, or I did at least and I play that up occasionally with the account.
‘based on my observations while I was a Site Reliability Engineer at Google.’ – by Rob Ewaschuk; very good, and matching the similar recommendations and best practices at Amazon for that matter
Page in the AWS docs which describes their derived metrics and how they are computed — these are visible in the AWS Management Console, and alarmable, but not viewable in the Cloudwatch UI. grr. (page-joshea!)
Bloody hell. This is stupidity of the highest order, and a canonical example of “filter creep” by a government — secret state censorship of 1200 websites due to a single investment scam site.
The Federal Government has confirmed its financial regulator has started requiring Australian Internet service providers to block websites suspected of providing fraudulent financial opportunities, in a move which appears to also open the door for other government agencies to unilaterally block sites they deem questionable in their own portfolios. The instrument through which the ISPs are blocking the Interpol list of sites is Section 313 of the Telecommunications Act. Under the Act, the Australian Federal Police is allowed to issue notices to telcos asking for reasonable assistance in upholding the law. [...] Tonight Senator Conroy’s office revealed that the incident that resulted in Melbourne Free University and more than a thousand other sites being blocked originated from a different source — financial regulator the Australian Securities and Investment Commission. On 22 March this year, ASIC issued a media release warning consumers about the activities of a cold-calling investment scam using the name ‘Global Capital Wealth’, which ASIC said was operating several fraudulent websites — www.globalcapitalwealth.com and www.globalcapitalaustralia.com. In its release on that date, ASIC stated: “ASIC has already blocked access to these websites.”
“Twitter / gavinsblog: For sake of clarity here is helpful pie chart of the 95.4% of fixed charge notices not terminated #missingthepoint” Paging Edward Tufte: classic example of an obfuscatory pie-chart, diagramming the wrong thing misleadingly. By presenting it like this, it appears that the 95.4% of cases where fixed charge notices were issued by the guards are relevant to the discussion of the other classes; in reality, that means that 4.6% of cases, 37,000 cases, were terminated, some for good reasons, others for not, and it’s the difference between those two classes that are relevant. In my opinion, 2 separate pie charts would be better; one to show the dismissed-versus-undismissed count (which IMO could have been omitted entirely), and one to show the good-vs-not-so-good termination reason counts (which is the meat of the issue).
background white paper on the BDB-JE innards and design, from 2006. Still pretty accurate and good info
This Court has developed a new awareness and understanding of a category of vexatious litigant. As we shall see, while there is often a lack of homogeneity, and some individuals or groups have no name or special identity, they (by their own admission or by descriptions given by others) often fall into the following descriptions: Detaxers; Freemen or Freemen-on-the-Land; Sovereign Men or Sovereign Citizens; Church of the Ecumenical Redemption International (CERI); Moorish Law; and other labels – there is no closed list. In the absence of a better moniker, I have collectively labelled them as Organized Pseudolegal Commercial Argument litigants [“OPCA litigants”], to functionally define them collectively for what they literally are. These persons employ a collection of techniques and arguments promoted and sold by ‘gurus’ (as hereafter defined) to disrupt court operations and to attempt to frustrate the legal rights of governments, corporations, and individuals. Over a decade of reported cases have proven that the individual concepts advanced by OPCA litigants are invalid. What remains is to categorize these schemes and concepts, identify global defects to simplify future response to variations of identified and invalid OPCA themes, and develop court procedures and sanctions for persons who adopt and advance these vexatious litigation strategies. One participant in this matter [...] appears to be a sophisticated and educated person, but is also an OPCA litigant. One of the purposes of these Reasons is, through this litigant, to uncover, expose, collate, and publish the tactics employed by the OPCA community, as a part of a process to eradicate the growing abuse that these litigants direct towards the justice and legal system we otherwise enjoy in Alberta and across Canada. I will respond on a point-by-point basis to the broad spectrum of OPCA schemes, concepts, and arguments advanced in this action by [him].Via Ronan Lupton
This classic came up in discussions yesterday…
In the Linux Kernel community Rusty Russell came up with a API rating scheme to help us determine if our API is sensible, or not. It’s a rating from -10 to 10, where 10 is perfect is -10 is hell. Unfortunately there are too many examples at the wrong end of the scale.
hooray! Command-line gmailish goodness returns. And with a signed gem, to boot
on the mechanical-sympathy mailing list. Some really interesting discussion on handling insane quantities of TCP connections using low volumes of hardware:
This talk has some good points and I think the subject is really interesting. I would take the suggested approach with serious caution. For starters the Linux kernel is nowhere near as bad as it made out. Last year I worked with a client and we scaled a single server to 1 million concurrent connections with async programming in Java and some sensible kernel tuning. I’ve heard they have since taken this to over 5 million concurrent connections. BTW Open Onload is an open source implementation. Writing a network stack is a serious undertaking. In a previous life I wrote a network probe and had to reassemble TCP streams and kept getting tripped up by edge cases. It is a great exercise in data structures and lock-free programming. If you need very high-end performance I’d talk to the Solarflare or Mellanox guys before writing my own. There are some errors and omissions in this talk. For example, his range of ephemeral ports is not quite right, and atomic operations are only 15 cycles on Sandy Bridge when hitting local cache. A big issue for me is when he defined C10M he did not mention the TIME_WAIT issue with closing connections. Creating and destroying 1 million connections per second is a major issue. A protocol like HTTP is very broken in that the server closes the socket and therefore has to retain the TCB until the specified timeout occurs to ensure no older packet is delivered to a new socket connection.
This program creates an EBS snapshot for an Amazon EC2 EBS volume. To help ensure consistent data in the snapshot, it tries to flush and freeze the filesystem(s) first as well as flushing and locking the database, if applicable. Filesystems can be frozen during the snapshot. Prior to Linux kernel 2.6.29, XFS must be used for freezing support. While frozen, a filesystem will be consistent on disk and all writes will block. There are a number of timeouts to reduce the risk of interfering with the normal database operation while improving the chances of getting a consistent snapshot. If you have multiple EBS volumes in a RAID configuration, you can specify all of the volume ids on the command line and it will create snapshots for each while the filesystem and database are locked. Note that it is your responsibility to keep track of the resulting snapshot ids and to figure out how to put these back together when you need to restore the RAID setup.Handy!
Another good writeup on iostat and EBS, from Ilya Grigorik
Great post from AndrewC@EBS on interpreting iostat output on EBS volumes — from 2009, but still looks reasonable enough
This is so damn spot on.
Functional silos (and a standalone DevOps team is a great example of one) decouple actions from responsibility. Functional silos allow people to ignore, or at least feel disconnected from, the consequences of their actions. DevOps is a cultural change that encourages, rewards and exposes people taking responsibility for what they do, and what is expected from them. As Werner Vogels from Amazon Web Services says, “you build it, you run it”. So a “DevOps team” is a risky and ultimately doomed strategy. Sure there are some technical roles, specifically related to the enablement of DevOps as an approach and these roles and tools need to be filled and built. Self service platforms, collaboration and communication systems, tool chains for testing, deployment and operations are all necessary. Sure someone needs to deliver on that stuff. But those are specific technical deliverables and not DevOps. DevOps is about people, communication and collaboration. Organizations ignore that at their peril.
including on paid-for, losslessly-compressed digital audio music files:
Why isn’t UMG’s watermark talked about more? Maybe people think the audio quality problems are due to some kind of lossy compression, as I did, and ignore it completely, or blame the streaming service/distributor. The problem here is that the UMG watermark degrades the audio to about the equivalent of a 96 kbit MP3. My guess is that if consumers were informed about what is going on, they would care. Especially those who pay full retail price for digital downloads advertised as lossless audio.
Aphyr’s epic RICON talk, exploring distributed-database failure modes through music. and what a lot of fail there is! Bottom line: CRDTs win
we are proud to announce the first production drop of Impala, which reflects feedback from across the user community based on multiple types of real-world workloads. Just as a refresher, the main design principle behind Impala is complete integration with the Hadoop platform (jointly utilizing a single pool of storage, metadata model, security framework, and set of system resources). This integration allows Impala users to take advantage of the time-tested cost, flexibility, and scale advantages of Hadoop for interactive SQL queries, and makes SQL a first-class Hadoop citizen alongside MapReduce and other frameworks. The net result is that all your data becomes available for interactive analysis simultaneously with all other types of processing, with no ETL delays needed.Along with some great benchmark numbers against Hive. nifty stuff
Insightful response, worth bookmarking. (the original post is at http://damienkatz.net/2013/05/dynamo_sure_works_hard.html ).
while you are saving on read traffic (online reads only go to the master), you are now decreasing availability (contrary to your stated goal), and increasing system complexity. You also do hurt performance by requiring all writes and reads to be serialized through a single node: unless you plan to have a leader election whenever the node fails to meet a read SLA (which is going to result a disaster — I am speaking from personal experience), you will have to accept that you’re bottlenecked by a single node. With a Dynamo-style quorum (for either reads or writes), a single straggler will not reduce whole-cluster latency. The core point of Dynamo is low latency, availability and handling of all kinds of partitions: whether clean partitions (long term single node failures), transient failures (garbage collection pauses, slow disks, network blips, etc…), or even more complex dependent failures. The reality, of course, is that availability is neither the sole, nor the principal concern of every system. It’s perfect fine to trade off availability for other goals — you just need to be aware of that trade off.
Another good clarification about CAP which resurfaced during last week’s discussion:
So what causes partitions? Two things, really. The first is obvious – a network failure, for example due to a faulty switch, can cause the network to partition. The other is less obvious, but fits with the definition [...]: machine failures, either hard or soft. In an asynchronous network, i.e. one where processing a message could take unbounded time, it is impossible to distinguish between machine failures and lost messages. Therefore a single machine failure partitions it from the rest of the network. A correlated failure of several machines partitions them all from the network. Not being able to receive a message is the same as the network not delivering it. In the face of sufficiently many machine failures, it is still impossible to maintain availability and consistency, not because two writes may go to separate partitions, but because the failure of an entire ‘quorum’ of servers may render some recent writes unreadable.(sorry, catching up on old interesting things posted last week…)
nicely done, very readable
Looks like many Aussie network operators were legally required to block 1,200 websites (presumably, one target and 1199 false positives), in secret. Quoting http://lists.ausnog.net/pipermail/ausnog/2013-April/017993.html : “You get a notice to block. You block or either get fined, go to jail or lose your carrier licence. It is a blunt instrument and it is a condition of being at ‘the big boys table’ i.e. you’re a carrier or a carriage service provider.”
good info on the system metrics recorded by BDB-JE’s EnvironmentStats code, particularly where cache and cleaner activity are concerned. Particularly useful for Voldemort
nice, readable intro to SpaceSaving (which I’ve linked to before) — a simple stream-processing cardinality top-K estimation algorithm with bounded error.
good interview — lots of food for thought!
As it is, we’ve seen no discernible increase in piracy on any of our titles, despite them being DRM-free for nearly a year.
fantastic in-depth presentation on EBS usage; lots of good advice here if you’re using EBS volumes with/without PIOPS
Github get good results using Judy arrays to replace a Ruby hash. However: the whole blog post is a bit dodgy to me. It feels like there are much better ways to fix the problem: 1. the big one: don’t do GC-heavy activity in the front-end web servers. Split that language-classification code into a separate service. Write its results to a cache and don’t re-query needlessly. 2. why isn’t this benchmarked against a C/C++ hash? it’s only 36000 entries, loaded once at startup. lookups against that should be blisteringly fast even with the basic data structures, and that would also be outside the Ruby heap so avoid the GC overhead. Feels like the use of a Judy array was a “because I want to” decision. 3. personally, I’d have preferred they spend time fixing their uptime problems…. See also https://news.ycombinator.com/item?id=5639013 for more kvetching.
Mozilla’s experience with Kanban. We’ve had good results in Amazon, too. good intro links in this post — might start talking about it in Swrve…
Thunberg’s admission that [the E-Sports Entertainment Association client software] ran Bitcoin-mining software without explicit user consent is startling. Aside from potentially opening the company up to huge legal liability, the move is likely to engender distrust among some of the company’s most loyal fans. The nonchalance of some of Thunberg’s comments may only add insult to the betrayal many users are likely to feel. “But for the record, I told jag he shouldn’t be lazy and run the miner in a separate process,” he wrote in a post, referring to one of his software engineers with the screen name Jaguar, who didn’t take steps to conceal the Bitcoin miner. “Rookie move.” In the later post he wrote: “100% of the funds are going into the s14 prize pot, so at the very least your melted gpus contributed to a good cause.”
Interesting, first time I’d heard of it; the Model-View-View Model pattern.
very nice single-purpose site — figure out who represents any given Irish postal address
Good lecture notes on the current state of the art in data structure research.
Data structures play a central role in modern computer science. You interact with data structures even more often than with algorithms (think Google, your mail server, and even your network routers). In addition, data structures are essential building blocks in obtaining efficient algorithms. This course covers major results and current directions of research in data structures: TIME TRAVEL We can remember the past efficiently (a technique called persistence), but in general it’s difficult to change the past and see the outcomes on the present (retroactivity). So alas, Back To The Future isn’t really possible. GEOMETRY When data has more than one dimension (e.g. maps, database tables). DYNAMIC OPTIMALITY Is there one binary search tree that’s as good as all others? We still don’t know, but we’re close. MEMORY HIERARCHY Real computers have multiple levels of caches. We can optimize the number of cache misses, often without even knowing the size of the cache. HASHING Hashing is the most used data structure in computer science. And it’s still an active area of research. INTEGERS Logarithmic time is too easy. By careful analysis of the information you’re dealing with, you can often reduce the operation times substantially, sometimes even to constant. We will also cover lower bounds that illustrate when this is not possible. DYNAMIC GRAPHS A network link went down, or you just added or deleted a friend in a social network. We can still maintain essential information about the connectivity as it changes. STRINGS Searching for phrases in giant text (think Google or DNA). SUCCINCT Most “linear size” data structures you know are much larger than they need to be, often by an order of magnitude. Some data structures require almost no space beyond the raw data but are still fast (think heaps, but much cooler).(via Tim Freeman)
At least in terms of StackOverflow rep:
For the first part of the study, the researchers compared the age of users with their reputation scores. They found that an individual’s reputation increases with age, at least into a user’s 40s. There wasn’t enough data to draw meaningful conclusions for older programmers. The researchers then looked at the number of different subjects that users asked and answered questions about, which reflects the breadth of their programming interests. The researchers found that there is a sharp decline in the number of subjects users weighed in on between the ages of 15 and 30 – but that the range of subjects increased steadily through the programmers’ 30s and into their early 50s. Finally, the researchers evaluated the knowledge of older programmers (ages 37 and older) compared to younger programmers (younger than 37) in regard to relatively recent technologies – meaning technologies that have been around for less than 10 years. For two smartphone operating systems, iOS and Windows Phone 7, the veteran programmers had a significant edge in knowledge over their younger counterparts. For every other technology, from Django to Silverlight, there was no statistically significant difference between older and younger programmers. “The data doesn’t support the bias against older programmers – if anything, just the opposite,” Murphy-Hill says.Damn right ;)
Test Double is a generic term for any case where you replace a production object for testing purposes. There are various kinds of double that Gerard lists: Dummy objects are passed around but never actually used. Usually they are just used to fill parameter lists. Fake objects actually have working implementations, but usually take some shortcut which makes them not suitable for production (an InMemoryTestDatabase is a good example). Stubs provide canned answers to calls made during the test, usually not responding at all to anything outside what’s programmed in for the test. Spies are stubs that also record some information based on how they were called. One form of this might be an email service that records how many messages it was sent. Mocks are pre-programmed with expectations which form a specification of the calls they are expected to receive. They can throw an exception if they receive a call they don’t expect and are checked during verification to ensure they got all the calls they were expecting.
Oh for god’s sake. I know a few people who’ve made a trip to Mayo explicitly because the Greenway was there to visit. This is shocking, backwards stuff:
The success of [Mayo's] Great Western Greenway [trail] has overtaken that of others, such as the Great Southern Trail group, which has been working hard to install a walking and cycling route on sections of the former Limerick-Tralee railway line. On February 2nd, to mark the 50th anniversary of its closure, about 150 members and supporters of the Great Southern Trail set out from the old railway station at Abbeyfeale, Co Limerick, along the most recently developed section to cross the Kerry county boundary. The trailers were greeted by a barricade on the border, manned by more than 30 farmers, including the Listowel Fine Gael town councillor Denis Stack. A stand-off continued for three hours, with the Garda mediating in vain. The farmers were trying to lay claim to the land occupied by the disused railway line, even though Minister for Transport Leo Varadkar had made it clear that CIÉ “is the owner of the property [and] will object to any application by others to register these lands”.(via Rossa McMahon)
like sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text. [it] is written in portable C, and it has zero runtime dependencies. You can download a single binary, scp it to a far away machine, and expect it to work.Nice tool. Needs to get into the Debian/Ubuntu apt repos pronto ;)
Hmm, this seems nifty as a compositional building block for Java code to enable concurrency without thread-safety and sync problems.
Functional reactive programming offers efficient execution and composition by providing a collection of operators capable of filtering, selecting, transforming, combining and composing Observable’s. The Observable data type can be thought of as a “push” equivalent to Iterable which is “pull”. With an Iterable, the consumer pulls values from the producer and the thread blocks until those values arrive. By contrast with the Observable type, the producer pushes values to the consumer whenever values are available. This approach is more flexible, because values can arrive synchronously or asynchronously.
Daniel Lemire comments on the recent cases of bugs in spreadsheets causing major impact:
There are several critical problems with a tool like Excel that need to be widely known: * Spreadsheets do not support testing. For anything that matters, you should validate and test your code automatically and systematically; * Spreadsheets make code reviews impractical. To visually inspect the code, you need to click and each and every cell. In practice, this means that you cannot reasonably ask someone to read over your formulas to make sure that there is no mistake; * Spreadsheets encourage redundancies. Spreadsheets encourage copy-and-paste. Though copying and pasting is sometimes the right tool, it also creates redundancies. These redundancies make it very difficult to update a spreadsheet: are you absolutely sure that you have changed the formula throughout?Agreed on all three, particularly on the impossibility of testing. IMO, everyone who may be in a job where automation via spreadsheet is likely, needs training in SDE fundamentals: unit testing, the important of open source and open data for reproducibility, version control, and code review. We are all computer scientists now.
implemented using the LMAX Disruptor library — very impressive performance figures. I presume in real-world usage, these latencies are dwarfed by hardware costs, though
Google Drive and GMail have a built-in scripting engine. I had no idea
How the Irish media are partly to blame for the catastrophic property bubble, from a paper entitled _The Role Of The Media In Propping Up Ireland’s Housing Bubble_, by Dr Julien Mercille, in the _Social Europe Journal_:
“The overall argument is that the Irish media are part and parcel of the political and corporate establishment, and as such the news they convey tend to reflect those sectors’ interests and views. In particular, the Celtic Tiger years involved the financialisation of the economy and a large property bubble, all of it wrapped in an implicit neoliberal ideology. The media, embedded within this particular political economy and itself a constitutive element of it, thus mostly presented stories sustaining it. In particular, news organisations acquired direct stakes in an inflated real estate market by purchasing property websites and receiving vital advertising revenue from the real estate sector. Moreover, a number of their board members were current or former high officials in the finance industry and government, including banks deeply involved in the bubble’s expansion.”
Ugh. low-end ISPs MITM’ing DNS queries:
Some ISP’s are now using a technology called ‘Transparent DNS proxy’. Using this technology, they will intercept all DNS lookup requests (TCP/UDP port 53) and transparently proxy the results. This effectively forces you to use their DNS service for all DNS lookups. If you have changed your DNS settings to an open DNS service such as Google, Comodo or OpenDNS expecting that your DNS traffic is no longer being sent to your ISP’s DNS server, you may be surprised to find out that they are using transparent DNS proxying.(via Nelson)
As kragen says, ‘a decentralized way to sync a folder of large files, using BitTorrent instead of an untrustworthy central server’. Windows, OSX, and Linux supported
250 million tweets per day, 30-node HBase cluster, 400TB of storage, Kafka and 0mq. This is from 2011, hence this dated line: ‘for a distributed application they thought AWS was too limited, especially in the network. AWS doesn’t do well when nodes are connected together and they need to talk to each other. Not low enough latency network. Their customers care about latency.’ (Nowadays, it would be damn hard to build a lower-latency network than that attached to a cc2.8xlarge instance.)
Great presentation from Google on HTML5 CSS+JS render speed, 3G/4G network latency, etc. (via John G)
a Presentation from Simon Willnauer on optimization work performed on Lucene in 2011. The most interesting stuff here is the work done to replace an O(n^2) FuzzyQuery fuzzy-match algorithm with a FSM trie is extremely cool — benchmarked at 214 times faster!
Miguel de Icaza says it’s witchcraft — I’m inclined to agree:
Code Digger analyzes possible execution paths through your .NET code. The result is a table where each row shows a unique behavior of your code. The table helps you understand the behavior of the code, and it may also uncover hidden bugs. Through the new context menu item “Generate Inputs / Outputs Table” in the Visual Studio editor, you can invoke Code Digger to analyze your code. Code Digger computes and displays input-output pairs. Code Digger systematically hunts for bugs, exceptions, and assertion failures.
Sixteen years ago, journalists had a much easier job assembling “balanced” stories about MMR in south Wales. When I wrote about the measles outbreak last week, I suggested that it was related to Andrew Wakefield’s discredited 1998 Lancet research, but the Swansea contagion seems more likely to be the result of a separate scare a year earlier in the South Wales Evening Post. Before 1997, uptake of MMR in the distribution area of the Post was 91%, and 87.2% in the rest of Wales. After the Post’s campaign, uptake in the distribution area fell to 77.4% (it was 86.8% in the rest of Wales). That’s almost a 14% drop where the Post had influence, compared with less than 3% elsewhere. In the dry wording of the BMJ, “the [South West Evening Post] campaign is the most likely explanation”. In other words, what we can see in Swansea is the local effect of local reporting‚ in all probability, just a taster of what happens when the news irresponsibly creates unfounded terror. [...] The 1997 coverage focused on a group of families who blamed MMR for various ailments in their children, including learning difficulties, digestive problems and autism‚ none of which have been found to have any connection with the vaccine. The Post’s coverage was at the time deemed a success, and in 1998 it won a prize for investigative reporting in the BT Wales Press Awards. That year, the SWEP ran at least 39 stories related to the alleged dangers of MMR. And yes, it’s true that the paper never directly endorsed non-vaccination. What it did do was publicise the idea of “vaccine damage” as a risk, one that parents would then likely weigh up against the risk of contracting measles, mumps or rubella. And this went beyond the reporting of parental anxieties‚ it was part of the Post’s editorial line. One article is entitled “Young bodies cannot take it”. The all-important “journalistic balance” was constantly available, thanks to campaigning parents and their solicitor Richard Barr. (It was Barr who engaged Wakefield for a lawsuit, leading to the “fishing expedition” research that became the Lancet paper.) They were happy to provide a quote on the dangers of the “triple jab”, which health authorities were then obliged to rebut politely. The Post also seemed to downplay the risk of measles, reporting on 6 July 1998 that “not a single child has been hit by the illness‚ despite a 13% drop in take-up levels”. It’s not parents who should feel embarrassed by the Swansea measles outbreak: some may have acted from overt dread at the prospect of harming their child, and some simply from omission, but all were encouraged by a press that focused on non-existent risks and downplayed the genuine horror of the diseases MMR prevents. The shame belongs to journalists: those of the South West Evening Post who allowed themselves to be recruited in the service of a speculative lawsuit, and any who let a specious devotion to “balance” overrule a duty to tell the truth.
mostly a DynamoDB puff-piece from last week’s Amazon Cloud Connect, but contains some good real-world figures for a 20-billion-GUID deduping table use-case at end. ($4,150 per month, to cut to the chase)
Wow, this is a great software-quality story — I knew Excel was the most widely used programming environment out there, but this is a factor I’d overlooked:
In his remarks on the final panel, Frank Partnoy mentioned something I missed when it came out a few weeks ago: the role of Microsoft Excel in the “London Whale” trading debacle. [..] To summarize: JPMorgan’s Chief Investment Office needed a new value-at-risk (VaR) model for the synthetic credit portfolio (the one that blew up) and assigned a quantitative whiz [...] to create it. The new model “operated through a series of Excel spreadsheets, which had to be completed manually, by a process of copying and pasting data from one spreadsheet to another.” The internal Model Review Group identified this problem as well as a few others, but approved the model, while saying that it should be automated and another significant flaw should be fixed. After the London Whale trade blew up, the Model Review Group discovered that the model had not been automated and found several other errors. Most spectacularly, “After subtracting the old rate from the new rate, the spreadsheet divided by their sum instead of their average, as the modeler had intended. This error likely had the effect of muting volatility by a factor of two and of lowering the VaR …” I write periodically about the perils of bad software in the business world in general and the financial industry in particular, by which I usually mean back-end enterprise software that is poorly designed, insufficiently tested, and dangerously error-prone. But this is something different. [...] While Excel the program is reasonably robust, the spreadsheets that people create with Excel are incredibly fragile. There is no way to trace where your data come from, there’s no audit trail (so you can overtype numbers and not know it), and there’s no easy way to test spreadsheets, for starters. The biggest problem is that anyone can create Excel spreadsheets — badly. Because it’s so easy to use, the creation of even important spreadsheets is not restricted to people who understand programming and do it in a methodical, well-documented way. This is why the JPMorgan VaR model is the rule, not the exception: manual data entry, manual copy-and-paste, and formula errors. This is another important reason why you should pause whenever you hear that banks’ quantitative experts are smarter than Einstein, or that sophisticated risk management technology can protect banks from blowing up. At the end of the day, it’s all software. While all software breaks occasionally, Excel spreadsheets break all the time. But they don’t tell you when they break: they just give you the wrong number.
Good (albeit draft) write-up of the implications of CAP, allow_mult, and last_write_wins conflict-resolution policies in Riak:
As Brewer’s CAP theorem established, distributed systems have to make hard choices. Network partition is inevitable. Hardware failure is inevitable. When a partition occurs, a well-behaved system must choose its behavior from a spectrum of options ranging from “stop accepting any writes until the outage is resolved” (thus maintaining absolute consistency) to “allow any writes and worry about consistency later” (to maximize availability). Riak leans toward the availability end of the spectrum, but allows the operator and even the developer to tune read and write requests to better meet the business needs for any given set of data.
Yahoo! sucks. shutting down in days? ArchiveTeam Warrior to the rescue; install the VM!
Krugman on the Reinhart-Rogoff Excel-bug fiasco.
What the Reinhart-Rogoff affair shows is the extent to which austerity has been sold on false pretenses. For three years, the turn to austerity has been presented not as a choice but as a necessity. Economic research, austerity advocates insisted, showed that terrible things happen once debt exceeds 90 percent of G.D.P. But “economic research” showed no such thing; a couple of economists made that assertion, while many others disagreed. Policy makers abandoned the unemployed and turned to austerity because they wanted to, not because they had to. So will toppling Reinhart-Rogoff from its pedestal change anything? I’d like to think so. But I predict that the usual suspects will just find another dubious piece of economic analysis to canonize, and the depression will go on and on.
‘Stochastic monte-carlo epidemic SIR model to reveal herd immunity’. Fantastic demo of this important medical concept (via Colin Whittaker)
compute an image-similarity metric, to discover mostly-identical-but-slightly-tweaked images:
SIMILAR computes the normalized cross correlation similarity metric between two equal dimensioned images. The normalized cross correlation metric measures how similar two images are, not how different they are. The range of ncc metric values is between 0 (dissimilar) and 1 (similar). If mode=g, then the two images will be converted to grayscale. If mode=rgb, then the two images first will be converted to colorspace=rgb. Next, the ncc similarity metric will be computed for each channel. Finally, they will be combined into an rms value.(via Dan O’Neill)
a first-person game prototype in which players navigate a 3D space while picking up orbs that reduce the speed of light in increments. Custom-built, open-source relativistic graphics code allows the speed of light in the game to approach the player’s own maximum walking speed. Visual effects of special relativity gradually become apparent to the player, increasing the challenge of gameplay. These effects, rendered in realtime to vertex accuracy, include the Doppler effect (red- and blue-shifting of visible light, and the shifting of infrared and ultraviolet light into the visible spectrum); the searchlight effect (increased brightness in the direction of travel); time dilation (differences in the perceived passage of time from the player and the outside world); Lorentz transformation (warping of space at near-light speeds); and the runtime effect (the ability to see objects as they were in the past, due to the travel time of light). Players can choose to share their mastery and experience of the game through Twitter. A Slower Speed of Light combines accessible gameplay and a fantasy setting with theoretical and computational physics research to deliver an engaging and pedagogically rich experience.
Good overview of the current state of eventually-consistent data store research, covering CALM and CRDTs, from Peter Bailis and Ali Ghodsi
the basics of running a service stack (web, app servers, data stores) on AWS. some good benchmark figures in the final slides
The Bottom Half Of The Internet — “Racism; typos; filth; spam; ignorance; rage – that’s all the bottom half of the internet is good for, right? Rob Manuel wants you to question the internet dictum, most beloved of high-profile columnists, that you should ignore all of the comments all of the time. The ‘war on comments’, he reckons, might just be an echo of a fourth estate that’s having trouble adjusting to the idea of an unwashed public disagreeing with their sacred opinions. Sous les pavés, la plage.” On Tuesday, le cool Dublin & Pilcrow present SPIEL. Rob Manuel is the flashy animator behind B3ta and he’s joined by Ed Melvin, who wants to educate you on ‘The Unreal Engines’ of virtual currencies and economies.
this product from JInspired appears to support runtime profiling of java apps with < 5% performance impact
ex-Nokia product design guru Jan Chipchase on Google Glass
Debunking this prolife talking point:
‘Our maternity services are amongst the best in the world’. This phrase has been much hackneyed since the heartbreaking death of Savita Halappanavar was revealed in mid October. James Reilly and other senior politicians are particularly guilty of citing this inaccurate position. So what is the state of Irish maternity services and how do our figures compare with other comparable countries? Let’s start with the statistics.The bottom line:
Eight deaths per 100,000 is not bad, but it ranks our maternity services far from the best in world and below countries such as Slovakia and Poland.
Founded in 2010, Kaggle is an online platform for data-mining and predictive-modeling competitions. A company arranges with Kaggle to post a dump of data with a proposed problem, and the site’s community of computer scientists and mathematicians — known these days as data scientists — take on the task, posting proposed solutions. [...] On one level, of course, Kaggle is just another spin on crowdsourcing, tapping the global brain to solve a big problem. That stuff has been around for a decade or more, at least back to Wikipedia (or farther back, Linux, etc). And companies like TaskRabbit and oDesk have thrown jobs to the crowd for several years. But I think Kaggle, and other online labor markets, represent more than that, and I’ll offer two arguments. First, Kaggle doesn’t incorporate work from all levels of proficiency, professionals to amateurs. Participants are experts, and they aren’t working for benevolent reasons alone: they want to win, and they want to get better to improve their chances of winning next time. Second, Kaggle doesn’t just create the incidental work product, it creates a new marketplace for work, a deeper disruption in a professional field. Unlike traditional temp labor, these aren’t bottom of the totem pole jobs. Kagglers are on top. And that disruption is what will kill Joy’s Law. Because here’s the thing: the Kaggle ranking has become an essential metric in the world of data science. Employers like American Express and the New York Times have begun listing a Kaggle rank as an essential qualification in their help wanted ads for data scientists. It’s not just a merit badge for the coders; it’s a more significant, more valuable, indicator of capability than our traditional benchmarks for proficiency or expertise. In other words, your Ivy League diploma and IBM resume don’t matter so much as my Kaggle score. It’s flipping the resume, where your work is measurable and metricized and your value in the marketplace is more valuable than the place you work.
a good reference, with lots of sample output. Not clear if it takes 1.6/1.7 differences into account, though
You’ve probably heard that countries with a high debt:GDP ratio suffer from slow economic growth. The specific number 90 percent has been invoked frequently. That’s all thanks to a study conducted by Carmen Reinhardt and Kenneth Rogoff for their book This Time It’s Different. But the results have been difficult for other researchers to replicate. Now three scholars at the University of Massachusetts have done so in “Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff” and they find that the Reinhart/Rogoff result is based on opportunistic exclusion of Commonwealth data in the late-1940s, a debatable premise about how to weight the data, and most of all a sloppy Excel coding error. Read Mike Konczal for the whole rundown, but I’ll just focus on the spreadsheet part. At one point they set cell L51 equal to AVERAGE(L30:L44) when the correct procuedure was AVERAGE(L30:L49). By typing wrong, they accidentally left Denmark, Canada, Belgium, Austria, and Australia out of the average. When you run the math correctly “the average real GDP growth rate for countries carrying a public debt-to-GDP ratio of over 90 percent is actually 2.2 percent, not -0.1 percent.”
How GroupOn are warming up a failover warm MySQL spare, using Percona stuff and a “tee” of the live in-flight queries. (via Dave Doran)
Interesting evidence; it appears Irish music promoters are getting “rebates” from the massive TicketMaster “booking fee”, on each ticket sold. This sounds like a cartel to me, and we need to regulate this. Where is the National Consumer Agency and Competition Authority?
The matter is something which should be of concern to every gig-going music fan, regardless of whether they go to Stradbally or not. For years, many have asked about TicketMaster’s quasi-monopoly position in the marketplace and why this is so. We’ve always been told that promoters preferred to deal with one company rather than several and that TM’s systems and nationwide reach yadda yadda yadda was the bees’ knees etc. Other companies have tried to compete but no-one has been able to beat TM at this game. But why would promoters go elsewhere when they’re getting a slice of the TM fees back as rebates? Those past off-the-record attempts by and briefings from promoters blaming TM for those fees can now be seen as hypocritical. They’re sticking with TM because they’re receiving a take of the fees paid by punters who have no other choice in service provider if they want to get their hands on tickets. You wonder what the acts make of this cash-grab – perhaps some whip-smart agent is already making a claim for a percentage of the rebates because there would be no rebates in the first place without the act. Surely this is an issue for the Competition Authority and National Consumers Association too, given the manner in which the rebates are made and TM’s deals with the promoters? While promoters under TM deals are free to sell a certain proportion of their tickets with another provider, it’s usually only a very small percentage of the total and unlikely to trouble TM’s bottom line. Also, given that the rebates are volume-driven, it’s better for the promoters to keep the largest possible chunk of their business with TM. It seems that we have a new suspect in the blame game about why ticket prices are so high.
Hooray, Eoin’s activism gets some coverage!
THE SCALE OF Dublin’s dumping problem is laid bare in a blog that has seen contributors send in photos of chairs, fridges and heaps of rubbish strewn on city streets. Eoin Parker, one of organisers behind DublinLitterBlog.com, spoke to TheJournal.ie about the problem, saying that the blog was set up following the privatisation of waste management by Dublin City Council in 2012.
To our knowledge, Ked is the first scripting language to emerge from The People’s Republic of Cork. Below is an account of what we know so far about the mysterious Corkonian language. Any suggested updates or contributions are encouraged.Genius.
A sobering examination by NAMAwinelake into the quagmire of Ireland’s publicly-funded national broadcaster:
It seems that RTE has become a disaster zone, with libels and incompetence overseen by incapable management, and this is reflected in that organisation’s financial results. RTE still employs nearly 2,000 people and supports jobs and industry across independent producers and suppliers; it is a major business. But the time has come to call a halt to delusional management that is sinking the organization deeper into a quagmire which will ultimately need to be bailed out by the State. And Noel Curran is fobbing us off with flying a kite about a reduction in 65-year old Pat Kenny’s salary from €630,000 to €570,000?!
wow, Pinterest have a pretty hardcore architecture. Sharding to the max. This is scary stuff for me:
a [Cassandra-style] Cluster Management Algorithm is a SPOF. If there’s a bug it impacts every node. This took them down 4 times.yeah, so, eek ;)
Dr. Jen Gunter again:
Dr. Knowles’ testimony confirms for me that the law played a role, because her statements indicate the standard of care for treatment of chorioamnionitis is less aggressive in Ireland. This can only be because of the law as there is no medical evidence to support delaying delivery when chorioamnionitis is diagnosed. Standard of care is not to wait until a woman is sick enough to need a termination, the idea is to treat her, you know, before she gets sick enough. An elevated white count and ruptured membranes at 17 weeks is typically enough to make the diagnosis, so Dr. Knowles needs to testify as to what in Savita’s medical record made it safe to not recommend a delivery. By the way, I also disagree with Dr. Knowles about her interpretation of Savita’s medical record, the chart doesn’t have “subtle indicators” of infection, it screams chorioamnionitis long before Wednesday morning. In North America the standard of care with chorioamnionitis is to recommend delivery as soon as the diagnosis is made, not wait until women enter the antechamber of death in the hopes that we can somehow snatch them back from the brink. If Irish law, or the interpretation thereof, had nothing to do with Savita’s death no expert would be mentioning sick enough at all.
Boundary implement week-on-week trend display. Pity they use silly “giant number” dashboard boxes showing comparisons of the current datapoint with the previous week’s datapoint; there’s no indication of smoothing being applied, and “giant number” dashboards are basically useless anyway compared to a time-series graph, for unsmoothed time-series data. Also, no prediction bands. :(
real-time service outage information on a map, from Ireland’s power network
Geir Magnusson explains how Gilt Groupe is using Project Voldemort to scale out their e-commerce transactional system. The initial SQL solution had to be replaced because it could not handle the transactional spikes the site is experiencing daily due to its particular way of selling their inventory: each day at noon. Magnusson explains why they chose Voldemort and talks about the architecture.via Filippo
a comment on Dr. Jen Gunter’s blog puts it all together
No holds barred:
Speaking today, spokesman Charles Stanley-Smith said; “This idea is insane. This area has suffered from dumping due to a lack of enforcement – yet the council now propose to effectively withdraw services altogether. As numerous studies such as ‘the broken window hypothesis’ indicate, where a small problem is left un-tackled it is likely to become far worse rather than better. In other words, rather than increase enforcement to solve the problem, Dublin City Council is going to remove enforcement. How will this deal with the problem? Imagine if that logic were applied to crime; would the removal of police services in an area help resolve criminal behaviour – or increase it? The answer is obvious.”
Written by Google, this library is a flexible, efficient, and powerful Java client library for accessing any resource on the web via HTTP. It features a pluggable HTTP transport abstraction that allows any low-level library to be used, such as java.net.HttpURLConnection, Apache HTTP Client, or URL Fetch on Google App Engine. It also features efficient JSON and XML data models for parsing and serialization of HTTP response and request content. The JSON and XML libraries are also fully pluggable, including support for Jackson and Android’s GSON libraries for JSON.Not quite as simple an API as Python’s requests, sadly, but still an improvement on the verbose Apache HttpComponent API. Good support for unit testing via a built-in mock-response class. Still in beta
Former IMF chief of mission to Ireland, Ashoka Mody, above left with Ajai Chopra in 2010. Melancholy of eye and large of loafer, Ashoka was involved in negotiating Ireland’s EU/IMF bailout. [...] This morning Ashok gave an interview to Gavin Jennings on Morning Ireland, in which he admitted Ireland’s bailout was riddled with mistakes, namely the non-burning of the senior bondholders and the program of austerity. Jennings: “So, if imposing austerity on Ireland was wrong, or a mistake; if not allowing any burning of bondholders, whether official, sovereign or private was a mistake; you were centrally involved in that program. I know Ajai Chopra was very much the public face of the IMF mission to Ireland. But you were centrally involved in constructing this bailout. How much responsibility do you take for those errors.” Mody: “Yes, so, obviously, I have to take the responsibility in…but I’m in very good company in taking responsibility in this. There were many parties involved. And my role really was to bring such matters to the attention of people who finally made these decisions.”Great.
A professional OB/GYN analyses the horrors coming to light in the Savita inquest. Here’s one particular gem:
Fetal survival with ruptured membranes at 17 weeks is 0%, this is from prospective study. [...but] “real and substantial risk” to the woman’s life is what is required by the Irish constitution to terminate a pregnancy, *whether or not the foetus is viable*.So the foetus had 0% chance of survival — but still termination was not considered an option. Bloody hell.
Lots of talk about “charging regimes”, “income-generating public sector bodies” etc., but not a single mention of open data or free access. Terrible stuff. :( (via conoro)
With Ack: in this mode, as far as compression is concerned, the data gets compressed at the producer, decompressed and compressed on the broker before it sends the ack to the producer. The producer throughput with Snappy compression was roughly 22.3MB/s as compared to 8.9MB/s of the GZIP producer. Producer throughput is 150% higher with Snappy as compared to GZIP. No ack, similar to Kafka 0.7 behavior: In this mode, the data gets compressed at the producer and it doesn’t wait for the ack from the broker. The producer throughput with Snappy compression was roughly 60.8MB/s as compared to 18.5MB/s of the GZIP producer. Producer throughput is 228% higher with Snappy as compared to GZIP. The higher compression savings in this test are due to the fact that the producer does not wait for the leader to re-compress and append the data; it simply compresses messages and fires away. Since Snappy has very high compression speed and low CPU usage, a single producer is able to compress the same amount of messages much faster as compared to GZIP.
The emergence of new hardware and platforms has led to reconsideration of how data management systems are designed. However, certain basic functions such as key indexed access to records remain essential. While we exploit the common architectural layering of prior systems, we make radically new design decisions about each layer. Our new form of B tree, called the Bw-tree achieves its very high performance via a latch-free approach that effectively exploits the processor caches of modern multi-core chips. Our storage manager uses a unique form of log structuring that blurs the distinction between a page and a record store and works well with flash storage. This paper describes the architecture and algorithms for the Bw-tree, focusing on the main memory aspects. The paper includes results of our experiments that demonstrate that this fresh approach produces outstanding performance.
Boundary on their TSD-on-Riak store.
Dietrich Featherston, Engineer at Boundary, walks through the process of designing Kobayashi, the time-series analytics database behind our network metrics. He goes through the false-starts and lessons learned in effectively using Riak as the storage layer for a large-scale OLAP database. The system is ultimately capable of answering complex, ad-hoc queries at interactive latencies.
A few days old, but already an instant Streisand-Effect classic:
Sometimes people borrow [Colin Purrington's free guide about making scientific posters] without giving him credit. This happens fairly regularly, and when he finds out about it, he sends an e-mail asking them to take it down. Usually they do. But when he sent an e-mail to the Consortium for Plant Biotechnology Research, asking that a roughly 1,200-word, near-verbatim, uncredited chunk from his guide be removed from the consortium’s materials, the response was unexpected. Rather than apologise, a lawyer sent him a cease-and-desist letter accusing him of plagiarizing the consortium’s materials and demanding that he take down his guide or face a lawsuit seeking damages up to $150,000.
Great benchmarking from Piotr Kozikowski at the LiveRamp team, into performance of the upcoming Kafka 0.8 release
an excellent writeup on Kafka 0.8’s use and operation, including details of the new replication features
‘A €10 silver coin being offered for sale to the public in honour of James Joyce by the Central Bank tomorrow contains a misquote from the author. The line used on the coin from Chapter 3 of Ulysses includes a superfluous conjunction – a rogue ‘that’.’ [..] The coin reads:
“Ineluctable modality of the visible: at least that if no more, thought through my eyes. Signatures of all things *that* I am here to read.”(Incorrect ‘that’ emphasised)
Via Mulley. Magnet doing well, with UPC coming second; UPC have dropped a fair bit in the past month. Would love to see it broken down by region…
In practice there are two gotchas that are so painful I am looking for a replacement with a different featureset than couchdb provides. The location tracking project icecondor.com uses couchdb to store 20,000 new records per day. It has more write traffic than read traffic and runs on modest hardware. Those two gotchas are: 1. View Index updates. While I have a vague understanding of why view index updates are slow and bulky and important, in practice it is unworkable. Every write sets up a trap for the first reader to come along after the write. The more writes there are, the bigger the trap for the first reader which has to wait on the couchdb process that refreshes the view index on an as-needed basis. I believe this trade-off was made to keep writes fast. No need to update the view index until all writes are actually complete, right? Write traffic is heavier than read traffic and the time needed for that index refresh causes the webapp to crash because its not setup to handle timeouts from a database query. The workaround is as hackish as one can imagine - cron jobs to hit every map/reduce query to keep indexes fresh. 2. Append only database file Append only is in theory a great way to ensure on-disk reliability. A system crash during an append should only affect that append. Its a crash during an update to existing parts of the file that risks the integrity of more than whats being updated. With so many layers of caching and optimizations in the kernel and the filesystem and now in the workings of SSD drives, I’m not sure append-only gives extra protection anymore. What it does do is a create a huge operational headache. The on-disk file can never grow beyond half the available storage space. Record deletion uses new disk space and if the half-full mark approaches, vacuuming must be done. The entire database is rewritten to the filesystem, leaving out no longer needed records. If the data file should happen to grow beyond half the partition, the system has esentially crashed because there is no way to compact the file and soon the partition will be full. This is a likely scenario when there is a lot of record deletion activity. The system in question does a lot of writes of temporary data that is followed up by deletes a few days later. There is also a lot of permanent storage that hardly gets used. Rewriting every byte of the records that are long-lived due to compaction is an enormous amount of wasted I/O – doubly so given SSD drives have a short write-cycle lifespan.
Jonathan Ellis on some CouchDB negatives:
Here are some reasons you should think twice and do careful testing before using CouchDB in a non-toy project: Writes are serialized. Not serialized as in the isolation level, serialized as in there can only be one write active at a time. Want to spread writes across multiple disks? Sorry. CouchDB uses a MVCC model, which means that updates and deletes need to be compacted for the space to be made available to new writes. Just like PostgreSQL, only without the man-years of effort to make vacuum hurt less. CouchDB is simple. Gloriously simple. Why is that a negative? It’s competing with systems (in the popular imagination, if not in its author’s mind) that have been maturing for years. The reason PostgreSQL et al have those features is because people want them. And if you don’t, you should at least ask a DBA with a few years of non-MySQL experience what you’ll be missing. The majority of CouchDB fans don’t appear to really understand what a good relational database gives them, just as a lot of PHP programmers don’t get what the big deal is with namespaces. A special case of simplicity deserves mention: nontrivial queries must be created as a view with mapreduce. MapReduce is a great approach to trivially parallelizing certain classes of problem. The problem is, it’s tedious and error-prone to write raw MapReduce code. This is why Google and Yahoo have both created high-level languages on top of it (Sawzall and Pig, respectively). Poor SQL; even with DSLs being the new hotness, people forget that SQL is one of the original domain-specific languages. It’s a little verbose, and you might be bored with it, but it’s much better than writing low-level mapreduce code.
Good write up of CouchDB replication
There really isn’t a separate “protocol” per se for replication. Instead, replication uses CouchDB’s REST API and data model. It’s therefore a bit difficult to talk about replication independently of the rest of CouchDB. In this document I’ll focus on the algorithm used, and link to documentation of the APIs it invokes. The “protocol” is simply the set of those APIs operating over HTTP.
A good writeup of how to detect cases of copyright infringement for photography, art and other visual media.
Von Glitschka, Modern Dog and myriad others make clear that the support of the creative community is absolutely vital in raising awareness of copyright infringements. Sites like www.youthoughtwewouldntnotice.com name and shame clear breaches of copyright, while the Modern Dog case shows that there is no better IP tracing system than the eyes and ears of the design community itself. “It’s the industry at large that has kept me aware of infringements,” states Von. “Without that I would miss most of them because I don’t go looking – they find me via the eyes of others.”
an [LGPL] open-source data processing library following the spirit of NoSQL movement. It offers a set of searching functions supported by compressed bitmap indexes. It treats user data in the column-oriented manner similar to well-known database management systems such as Sybase IQ, MonetDB, and Vertica. It is designed to accelerate user’s data selection tasks without imposing undue requirements. In particular, the user data is NOT required to be under the control of FastBit software, which allows the user to continue to use their existing data analysis tools. The key technology underlying the FastBit software is a set of compressed bitmap indexes. In database systems, an index is a data structure to accelerate data accesses and reduce the query response time. Most of the commonly used indexes are variants of the B-tree, such as B+-tree and B*-tree. FastBit implements a set of alternative indexes called compressed bitmap indexes. Compared with B-tree variants, these indexes provide very efficient searching and retrieval operations, but are somewhat slower to update after a modification of an individual record. A key innovation in FastBit is the Word-Aligned Hybrid compression (WAH) for the bitmaps.[...] Another innovation in FastBit is the multi-level bitmap encoding methods.
The bit array data structure is implemented in Java as the BitSet class. Unfortunately, this fails to scale without compression. JavaEWAH is a word-aligned compressed variant of the Java bitset class. It uses a 64-bit run-length encoding (RLE) compression scheme. We trade-off some compression for better processing speed. We also have a 32-bit version which compresses better, but is not as fast. In general, the goal of word-aligned compression is not to achieve the best compression, but rather to improve query processing time. Hence, we try to save CPU cycles, maybe at the expense of storage. However, the EWAH scheme we implemented is always more efficient storage-wise than an uncompressed bitmap (as implemented in the BitSet class). Unlike some alternatives, javaewah does not rely on a patented scheme.
the classic Etsy pro-metrics “measure everything” post. Some good basic rules and mindset
Test-driven infrastructure, using Chef — slides from Big Ruby 2013. Tools used: foodcritic (lol), Chefspec, minitest-chef-handler, fauxhai, cucumber chef. This is really good to see — TDD applied to ops. Video at: http://confreaks.com/videos/2309-bigruby2013-testing-your-automation-ttd-for-chef-cookbooks
Great investigative journalism, interviewing the legal team behind the current big patent-troll shakedown; that on scanning documents with a button press, using a scanner attached to a network. They express whole-hearted belief in the legality of their actions, unsurprisingly — they’re exactly what you think they’d be like (via Nelson)
Pretty good April Fools from this year — a patch to delete the entirety of Hadoop’s codebase:
To avoid any bias to the existing code and make the same mistakes we should just delete trunk completely. Attached it is a script that deletes everything.
+1. RVM is atrocious code — some of the worst bash script I’ve seen. And it’s not just installing as a command, it requires that it be sourced and hooks into your login shell. If you then use “set -e”, it crashes; “set -u”, it crashes; reset $HOME, crash. It’s dire.
Next April 11th, at the IIEA in North Gt Georges St:
Rick Falkvinge, founder of the Swedish Pirate Party, will examine the case for reform of copyright and patent law in the EU. Legalised file sharing, free sampling and shortened copyright protection times are the main elements of a proposal co-authored by Mr. Falkvinge which was submitted to the European Parliament in 2012. He will question whether, in the context of ever-increasing online activity, existing legal frameworks pose a threat to users’ civil liberties.
yeah yeah, Mongo. bookmarking for the good data on EBS+PIOPS
These notes are intended to help users and system administrators maximize TCP/IP performance on their computer systems. They summarize all of the end-system (computer system) network tuning issues including a tutorial on TCP tuning, easy configuration checks for non-experts, and a repository of operating system specific instructions for getting the best possible network performance on these platforms.Some tips for maximizing HPC network performance for the intra-DC case; recommended by the LinkedIn Kafka operations page.
good docs from EC2
an open source virtualized Ethernet networking stack. I am developing Snabb Switch in response to several exciting trends: x86 has risen to be a powerful networking platform. Virtualization and SDN are pulling more networking into servers. Optimized user-space software is out-performing kernel-space software. Snabb Switch’s simple and fast software-only data plane makes developing networking software easier than ever before.Written in LuaJIT but aiming to be very fast. cool stuff, worth watching
what, is this the first time our spam filtering approach of hashing a giant feature space is hitting mainstream machine learning? that can’t be right!
Joel On Software weighs in (via Tony Finch):
The fastest growing industry in the US right now, even during this time of slow economic growth, is probably the patent troll protection racket industry.
Cap’n Proto is an insanely fast data interchange format and capability-based RPC system. Think JSON, except binary. Or think Protocol Buffers, except faster. In fact, in benchmarks, Cap’n Proto is INFINITY TIMES faster than Protocol Buffers.Basically, marshalling like writing an aligned C struct to the wire, QNX messaging protocol-style. Wasteful on space, but responds to this by suggesting compression (which is a fair point tbh). C++-only for now. I’m not seeing the same kind of support for optional data that protobufs has though. Overall I’m worried there’s some useful features being omitted here…
Shared read-only data is easy to scale by using well-understood replication techniques. However, sharing mutable data at a large scale is a dicult problem, because of the CAP impossibility result . Two approaches dominate in practice. One ensures scalability by giving up consistency guarantees, for instance using the Last-Writer-Wins (LWW) approach . The alternative guarantees consistency by serialising all updates, which does not scale beyond a small cluster . Optimistic replication allows replicas to diverge, eventually resolving conflicts either by LWW-like methods or by serialisation . In some (limited) cases, a radical simplication is possible. If concurrent updates to some datum commute, and all of its replicas execute all updates in causal order, then the replicas converge.1 We call this a Commutative Replicated Data Type (CRDT). The CRDT approach ensures that there are no conflicts, hence, no need for consensus-based concurrency control. CRDTs are not a universal solution, but, perhaps surprisingly, we were able to design highly useful CRDTs. This new research direction is promising as it ensures consistency in the large scale at a low cost, at least for some applications.
‘The CRDT toolbox provides a collection of basic Conflict-free replicated data types as well as a common interface for defining your own CRDTs’. – in Eric Moritz’ github. Also includes some more links to CRDT background reading.
implementing CRDTs in Riak and Voldemort
What do you get if you take one accountant with “a fondness for spreadsheets, finance and business” and mix with “a life-long passion for video games”? Well it’s obvious isn’t it? A turn-based RPG made and played entirely in Microsoft Excel.(via Paul Moloney)
With serverspec, you can write RSpec tests for checking your servers are provisioned correctly. Serverspec tests your servers’ actual state through SSH access, so you don’t need to install any agent softwares on your servers and can use any provisioning tools, Puppet, Chef, CFEngine and so on.(via Dave Doran)
Joshua’s old tip on watching videos at 2x speed using Perian
This seems pretty significant. Is the tide turning in the Texas Eastern District against patent trolls, at last? And does it establish sufficient precedent?
A federal judge has thrown out a patent claim against Rackspace, ruling that mathematical algorithms can’t be patented. The ruling in the Eastern Disrict stemmed from a 2012 complaint filed by Uniloc USA asserting that processing of floating point numbers by the Linux operating system was a patent violation. Chief Judge Leonard Davis based the ruling on U.S. Supreme Court case law that prohibits the patenting of mathematical algorithms. According to Rackspace, this is the first reported instance in which the Eastern District of Texas has granted an early motion to dismiss finding a patent invalid because it claimed unpatentable subject matter. Red Hat, which supplies Linux to Rackspace, provided Rackspace’s defense. Red Hat has a policy of standing behind customers through its Open Source Assurance program.See https://news.ycombinator.com/item?id=5455869 for more discussion.
A distributed, fault-tolerant “cron” is something which comes up frequently — it makes for a great fault-tolerance building block. This one sounds like it’s too closely tied into Mesos, though (IMO).
Chronos is our replacement for cron. It is a distributed and fault-tolerant scheduler which runs on top of Mesos. It’s a framework and supports custom mesos executors as well as the default command executor. Thus by default, Chronos executes SH (on most systems BASH) scripts. Chronos can be used to interact with systems such as Hadoop (incl. EMR), even if the mesos slaves on which execution happens do not have Hadoop installed. Included wrapper scripts allow transfering files and executing them on a remote machine in the background and using asynchroneous callbacks to notify Chronos of job completion or failures.
Having a bad day on the Internet is nothing new. These are the types of events we deal with on a regular basis, and most large network operators are very good at responding quickly to deal with situations like this. In our case, we worked with Cloudflare to quickly identify the attack profile, rolled out global filters on our network to limit the attack traffic without adversely impacting legitimate users, and worked with our other partner networks (like NTT) to do the same. If the attacks had stopped here, nobody in the “mainstream media” would have noticed, and it would have been just another fun day for a few geeks on the Internet. The next part is where things got interesting, and is the part that nobody outside of extremely technical circles has actually bothered to try and understand yet. After attacking Cloudflare and their upstream Internet providers directly stopped having the desired effect, the attackers turned to any other interconnection point they could find, and stumbled upon Internet Exchange Points like LINX (in London), AMS-IX (in Amsterdam), and DEC-IX (in Frankfurt), three of the largest IXPs in the world. An IXP is an “interconnection fabric”, or essentially just a large switched LAN, which acts as a common meeting point for different networks to connect and exchange traffic with each other. One downside to the way this architecture works is that there is a single big IP block used at each of these IXPs, where every network who interconnects is given 1 IP address, and this IP block CAN be globally routable. When the attackers stumbled upon this, probably by accident, it resulted in a lot of bogus traffic being injected into the IXP fabrics in an unusual way, until the IXP operators were able to work with everyone to make certain the IXP IP blocks weren’t being globally re-advertised. Note that the vast majority of global Internet traffic does NOT travel over IXPs, but rather goes via direct private interconnections between specific networks. The IXP traffic represents more of the “long tail” of Internet traffic exchange, a larger number of smaller networks, which collectively still adds up to be a pretty big chunk of traffic. So, what you actually saw in this attack was a larger number of smaller networks being affected by something which was an completely unrelated and unintended side-effect of the actual attacks, and thus *poof* you have the recipe for a lot of people talking about it. :) Hopefully that clears up a bit of the situation.
Excellent data, this. I’d heard a few of these prices, but these graphs really hit home. $26k for a caesarean section at the 95th percentile!? talk about out of control price gouging.
A nice set of practical web/UI/tpyography design guidelines, naming specific sources (via Rob C)
‘13 Security Gotchas You Should Know About’
‘One of [the] purposes of monitoring systems was to provide data to allow us, as engineers, to detect patterns, and predict issues before they become production impacting. In order to do this, we need to be capturing data and storing it somewhere which allows us to analyse it. If we care about it – if the data could provide the kind of engineering insight which helps us to understand our systems and give early warning – we should be capturing it. ‘ …. ‘There are a couple of weaknesses in [Nagios' design]. Assuming we’ve agreed that if we care about a metric enough to want to alert on it then we should be gathering that data for analysis, and graphing it, then we already have the data upon which to base our check. Furthermore, this data is not on the machine we’re monitoring, so our checks don’t in any way add further stress to that machine.’ I would add that if we are alerting on a different set of data from what we collect for graphing, then using the graphs to investigate an alarm may run into problems if they don’t sync up.
From JPL’s Laboratory for Reliable Software (LaRS). Great reference; there’s some really useful recommendations here, and good explanations of familiar ones like “prefer composition over inheritance”. Many are supported by FindBugs, too. Here’s the full list:
compile with checks turned on; apply static analysis; document public elements; write unit tests; use the standard naming conventions; do not override field or class names; make imports explicit; do not have cyclic package and class dependencies; obey the contract for equals(); define both equals() and hashCode(); define equals when adding fields; define equals with parameter type Object; do not use finalizers; do not implement the Cloneable interface; do not call nonfinal methods in constructors; select composition over inheritance; make fields private; do not use static mutable fields; declare immutable fields final; initialize fields before use; use assertions; use annotations; restrict method overloading; do not assign to parameters; do not return null arrays or collections; do not call System.exit; have one concept per line; use braces in control structures; do not have empty blocks; use breaks in switch statements; end switch statements with default; terminate if-else-if with else; restrict side effects in expressions; use named constants for non-trivial literals; make operator precedence explicit; do not use reference equality; use only short-circuit logic operators; do not use octal values; do not use floating point equality; use one result type in conditional expressions; do not use string concatenation operator in loops; do not drop exceptions; do not abruptly exit a finally block; use generics; use interfaces as types when available; use primitive types; do not remove literals from collections; restrict numeric conversions; program against data races; program against deadlocks; do not rely on the scheduler for synchronization; wait and notify safely; reduce code complexity
a barely-averted disaster… phew.
while we planned for the case of the server losing a disk or entirely biting the dust, or the total loss of the VM’s filesystem, we didn’t plan for the case of filesystem corruption, and the way the corruption affected our mirroring system triggered some very unforeseen and pathological conditions. [...] the corruption was perfectly mirrored… or rather, due to its nature, imperfectly mirrored. And all data on the anongit [mirrors] was lost.One risk demonstrated: by trusting in mirroring, rather than a schedule of snapshot backups covering a wide time range, they nearly had a major outage. Silent data corruption, and code bugs, happen — backups protect against this, but RAID, replication, and mirrors do not. Another risk: they didn’t have a rate limit on project-deletion, which resulted in the “anongit” mirrors deleting their (safe) data copies in response to the upstream corruption. Rate limiting to sanity-check automated changes is vital. What they should have had in place was described by the fix: ‘If a new projects file is generated and is more than 1% different than the previous file, the previous file is kept intact (at 1500 repositories, that means 15 repositories would have to be created or deleted in the span of three minutes, which is extremely unlikely).’
Metrics rule the roost — I guess there’s been a long history of telemetry in space applications.
To make software more visible, you need to know what it is doing, he said, which means creating “metrics on everything you can think of”…. Those metrics should cover areas like performance, network utilization, CPU load, and so on. The metrics gathered, whether from testing or real-world use, should be stored as it is “incredibly valuable” to be able to go back through them, he said. For his systems, telemetry data is stored with the program metrics, as is the version of all of the code running so that everything can be reproduced if needed. SpaceX has programs to parse the metrics data and raise an alarm when “something goes bad”. It is important to automate that, Rose said, because forcing a human to do it “would suck”. The same programs run on the data whether it is generated from a developer’s test, from a run on the spacecraft, or from a mission. Any failures should be seen as an opportunity to add new metrics. It takes a while to “get into the rhythm” of doing so, but it is “very useful”. He likes to “geek out on error reporting”, using tools like libSegFault and ftrace. Automation is important, and continuous integration is “very valuable”, Rose said. He suggested building for every platform all of the time, even for “things you don’t use any more”. SpaceX does that and has found interesting problems when building unused code. Unit tests are run from the continuous integration system any time the code changes. “Everyone here has 100% unit test coverage”, he joked, but running whatever tests are available, and creating new ones is useful. When he worked on video games, they had a test to just “warp” the character to random locations in a level and had it look in the four directions, which regularly found problems. “Automate process processes”, he said. Things like coding standards, static analysis, spaces vs. tabs, or detecting the use of Emacs should be done automatically. SpaceX has a complicated process where changes cannot be made without tickets, code review, signoffs, and so forth, but all of that is checked automatically. If static analysis is part of the workflow, make it such that the code will not build unless it passes that analysis step. When the build fails, it should “fail loudly” with a “monitor that starts flashing red” and email to everyone on the team. When that happens, you should “respond immediately” to fix the problem. In his team, they have a full-size Justin Bieber cutout that gets placed facing the team member who broke the build. They found that “100% of software engineers don’t like Justin Bieber”, and will work quickly to fix the build problem.
‘the story of ketchup is a story of globalization and centuries of economic domination by a world superpower. But the superpower isn’t America, and the century isn’t ours. Ketchup’s origins in the fermented sauces of China and Southeast Asia mean that those little plastic packets under the seat of your car are a direct result of Chinese and Asian domination of a single global world economy for most of the last millenium.’