The Blog

The Scunthorpe problem: the limits of automatic categorization-based web filtering

Cyber news
December 19, 2023

Content filtering or web filtering has been around for over 20 years, and is an effective way of securing employees' web traffic.

However, given the sheer number of websites - by 2023, there will be almost 2 billion websites worldwide - categorizing URLs can be problematic, both in terms of categorization accuracy, but also in terms of keeping information up to date over time (a URL may see its associated site or content change over the years, and consequently change category).

Automatic categorization, based on semantic content analysis, therefore seems to be a solution that is both effective and scalable. In fact, it's an approach adopted by most of the world's web filtering players, resulting in some of the funny situations we'll be looking at in this article.

The Scunthorpe problem

The Scunthorpe problem takes its name from a town of the same name in the UK. In 1996, this locality became famous in spite of itself due to the automated filtering policy implemented by certain online services, notably spam filters and parental controls. The problem was that the filter censored the term "Scunthorpe" because of the sequence of letters "c-u-n-t" it contained, considered offensive (we'll let you look at the translation).

This had both comical and problematic consequences for Scunthorpe residents, who had difficulty communicating online or registering on certain sites.

Among other amusing situations, the official website of the town of Bitche in Moselle was censored between March and April 2021 by Facebook because of its resemblance to an English word meaning something different.

Similarly, e-mails containing the word "specialist", frequently used in CVs, were filtered out and thus directed to the spam box, as they contained the character string Cialis, which corresponds to a brand of erectile dysfunction medication (Tadalafil for those asking on behalf of a friend) and whose trade is often used by spammers. The words "socialist" or "socialism" were also blocked by the same filter.

The challenges of automatic content categorization

The Scunthorpe Problem is an example of the complex challenges faced by automated categorization filtering solutions. These solutions, often based on pattern-matching algorithms, seek to block terms or content deemed inappropriate. However, they can also generate false positives, unwittingly censoring innocent terms that contain potentially problematic sequences of letters or characters.

The use of regular expressions, a common method in the development of automated categorization systems, can lead to errors when apparently offensive terms are present in innocuous words. But the nuances of language, wordplay and cultural contexts can easily elude algorithms. Likewise, language is constantly evolving, making updating filters a constant challenge. With more and more anglicisms and language mixes in the same content, automatic analyzers have their work cut out to distinguish illegitimate from relevant content.

Distinguish between legitimate content and semantics implying illegal content

Few professions are as demanding as that of healthcare professionals. Doctors, nurses, auxiliary staff, they all share a need for immediate access to relevant, life-saving information. They also have the particularity of evolving in a lexical field that may suggest illicit content or content whose consultation is governed by the law (drugs, chemical compounds, drugs, illnesses, drug addiction, illnesses, suicide, etc.).

In these conditions, the IT manager or CISO of these establishments has the difficult task of implementing filtering solutions for illicit content, without penalizing the work of care providers. Our experience with healthcare establishments shows that the solutions adopted are not always entirely satisfactory.

All too often, our teams are faced with blocked sites, even though they need to consult the content in order to carry out their work. All too often, we hear of scenarios where "system D" solutions are put in place to bypass filtering solutions: use of personal equipment or Internet connections, installation of an independent Internet box, etc. These solutions may solve problems, but expose organizations to significant legal and security risks. These solutions may solve problems, but expose organizations to significant legal and security risks.

The main reason for these difficulties lies at the very heart of filtering solutions: the database of sites (or URLs) and their categorization. Most filtering solutions use global databases with a single categorization common to all countries, without taking into account the cultural, legal and linguistic specificities of each country. Similarly, if you limit yourself to automatic categorization based on semantic analysis of websites, it's impossible to tell the difference between a site dealing with the sale of illegal marijuana and one selling therapeutic products for legal use in healthcare establishments.

The human operator: the only solution to limit false positives

Today, the only satisfactory approach that limits false positives involves validation of site categorization by a human operator, aided of course by an initial automatic pre-classification - AI-based or otherwise.

This choice ensures that legitimate sites are accessed by teams without impacting their work, while blocking sites with illicit content.