CloudBurst : ‘Highly Sensitive Short Read Mapping with MapReduce’. current state of the art in DNA sequence read-mapping algorithms.
CloudBurst uses well-known seed-and-extend algorithms to map reads to a reference genome. It can map reads with any number of differences or mismatches. [..] Given an exact seed, CloudBurst attempts to extend the alignment into an end-to-end alignment with at most k mismatches or differences by either counting mismatches of the two sequences, or with a dynamic programming algorithm to allow for gaps. CloudBurst uses [Hadoop] to catalog and extend the seeds. In the map phase, the map function emits all length-s k-mers from the reference sequences, and all non-overlapping length-s kmers from the reads. In the shuffle phase, read and reference kmers are brought together. In the reduce phase, the seeds are extended into end-to-end alignments. The power of MapReduce and CloudBurst is the map and reduce functions run in parallel over dozens or hundreds of processors.JM_SOUGHT — the next generation ;)
(tags: bioinformatics mapreduce hadoop read-alignment dna sequencing sought antispam algorithms)
Expensive lessons in Python performance tuning : some good advice for large-scale Python performance: prun and guppy for profiling, namedtuples for memory efficiency, and picloud for trivial EC2-based scale-out. (via Nelson)
(tags: picloud prun guppy namedtuples python optimization performance tuning profiling)
On Patents : Notch comes up with a perfect analogy for software patents.
I am mostly fine with the concept of “selling stuff you made”, so I’m also against copyright infringement. I don’t think it’s quite as bad as theft, and I’m not sure it’s good for society that some professions can get paid over and over long after they did the work (say, in the case of a game developer), whereas others need to perform the job over and over to get paid (say, in the case of a hairdresser or a lawyer). But yeah, “selling stuff you made” is good. But there is no way in hell you can convince me that it’s beneficial for society to not share ideas. Ideas are free. They improve on old things, make them better, and this results in all of society being better. Sharing ideas is how we improve. A common argument for patents is that inventors won’t invent unless they can protect their ideas. The problem with this argument is that patents apply even if the infringer came up with the idea independently. If the idea is that easy to think of, why do we need to reward the person who happened to be first?Of course, in reality it’s even worse, since you don’t actually have to be first to invent — just first to file without sufficient people noticing, and people are actively dissuaded from noticing (since it makes their lives riskier if they know about the existence of patents)…
(tags: business legal ip copyright patents notch minecraft patent-trolls)
Marsh’s Library : Dublin museum of antiquarian books, open to the public — well worth a visit, apparently (I will definitely be making my way there soon I suspect), to check out their new “Marvels of Science” exhibit. Not only that though, but they have a beautiful website with some great photos — exemplary
(tags: museum dublin ireland libraries books science)
‘Poisoning Attacks against Support Vector Machines’, Battista Biggio, Blaine Nelson, Pavel Laskov : The perils of auto-training SVMs on unvetted input.
We investigate a family of poisoning attacks against Support Vector Machines (SVM). Such attacks inject specially crafted training data that increases the SVM’s test error. Central to the motivation for these attacks is the fact that most learning algorithms assume that their training data comes from a natural or well-behaved distribution. However, this assumption does not generally hold in security-sensitive settings. As we demonstrate, an intelligent adversary can, to some extent, predict the change of the SVM’s decision function due to malicious input and use this ability to construct malicious data. The proposed attack uses a gradient ascent strategy in which the gradient is computed based on properties of the SVM’s optimal solution. This method can be kernelized and enables the attack to be constructed in the input space even for non-linear kernels. We experimentally demonstrate that our gradient ascent procedure reliably identifies good local maxima of the non-convex validation error surface, which significantly increases the classifier’s test error.Via Alexandre Dulaunoy
(tags: papers svm machine-learning poisoning auto-learning security via:adulau)