a software based, large-scale regex matcher designed to match multiple patterns at once (up to tens of thousands of patterns at once) and to ‘stream‘ (that is, match patterns across many different ‘stream writes’ without holding on to all the data you’ve ever seen). To my knowledge this makes it unique. RE2 is software based but doesn’t scale to large numbers of patterns; nor does it stream (although it could). It occupies a fundamentally different niche to Hyperscan; we compared the performance of RE2::Set (the RE2 multiple pattern interface) to Hyperscan a while back. Most back-tracking matchers (such as libpcre) are one pattern at a time and are inherently incapable of streaming, due to their requirement to backtrack into arbitrary amounts of old input.
Extremely thought-provoking thread on the horrors of Facebook/YouTube content moderation, from Andrew Strait:
My time doing this work convinced me there is no ultimate mitigation measure for the mental harm it causes. Automation is not a silver bullet – it requires massive labeled data sets by moderators on a continuing basis to ensure accuracy and proper model fit. There are steps to make this process less worse, but IMO it all comes back to a basic question – what technologies are worth the incredible human suffering and cost that moderators will inevitably experience? Is image search worth it? Is YouTube? Is Facebook? I don’t have an answer. But these platforms create the need for this kind of horrific work and that must be considered at the forefront of design and deployment of any platform, not as an afterthought.