BLOG

The Scunthorpe problem: the limitations of web filtering based on automatic categorization

December 19, 2023

Content filtering or web filtering has been around for over 20 years and provides effective security for employee web traffic.

However, given the large number of websites—in 2023, there are said to be nearly 2 billion websites worldwide—categorizing URLs can be problematic, both in terms of the accuracy of the categorization and in terms of keeping information up to date over time (a URL may see the associated site or its content change over the years and therefore change categories).

Automatic categorization based on semantic content analysis therefore appears to be an effective solution that can be scaled up. It is also an approach adopted by most web filtering providers around the world, resulting in some amusing situations, which we will look at in this article.

The Scunthorpe problem

The Scunthorpe problem takes its name from a town of the same name in the United Kingdom. In 1996, this town became famous despite itself due to the automated filtering policy implemented by certain online services, notably spam filters and parental controls. The problem was that the filter censored the term "Scunthorpe" because it contained the letter sequence "c-u-n-t," which was considered offensive (we'll leave you to look up the translation).

This had comical consequences, but also caused problems for the residents of Scunthorpe, who encountered difficulties communicating online or registering on certain websites.

Among other amusing situations, we can cite the official website of the town of Bitche in Moselle, which was censored between March and April 2021 by Facebook because of its resemblance to an English word meaning something different.

Similarly, emails containing the word "specialist," which is commonly found in resumes, were filtered and sent to the spam folder because they contained the character string Cialis, which is the name of a brand of medication used to treat erectile dysfunction (Tadalafil, for those asking on behalf of a friend) and is often used by spammers. It should be noted that the words "socialist" and "socialism" were also blocked by the same filter.

The challenges of automatic content categorization

The Scunthorpe Problem is an example illustrating the complex challenges faced by automated categorization filtering solutions. These solutions, often based on pattern matching algorithms, seek to block terms or content deemed inappropriate. However, they can also generate false positives, unintentionally censoring innocent terms that contain potentially problematic letter or character sequences.

The use of regular expressions, a common method in the development of automated categorization systems, can lead to errors when seemingly offensive terms are present in harmless words. But the nuances of language, wordplay, and cultural contexts can easily escape algorithms. Similarly, language is constantly evolving, making filter updates a ongoing challenge. With more and more Anglicisms or language mixes in the same content, automatic analyzers have a hard time distinguishing illegitimate content from relevant content.

Distinguishing legitimate content from content that suggests illegality through semantics

Few professions are as demanding as that of healthcare workers. Doctors, nurses, and auxiliary staff all share a need for immediate access to relevant information that can save lives. They also have the particularity of working in a lexical field that may suggest illegal content or whose consultation is regulated by law (medications, chemical compounds, drugs, illnesses, addiction, suicide, etc.).

Under these circumstances, the IT manager or CISO of these establishments has the difficult task of implementing solutions to filter illegal content without hindering the work of healthcare professionals. Our experience in the healthcare sector shows that the solutions chosen are not always entirely satisfactory.

All too often, teams are faced with blocked websites when they need to access their content in order to do their jobs. We hear too often about scenarios where workarounds are implemented to bypass filtering solutions: use of personal equipment or internet connections, installation of an independent internet box, etc. These solutions may solve the problems, but they expose organizations to significant legal and security risks.

The main reason for these difficulties lies at the very heart of filtering solutions: the database of websites (or URLs) and their categorization. Most filtering solutions use global databases with a single categorization system that is common to all countries, without taking into account the cultural, legal, and linguistic specificities of each country. Similarly, by limiting themselves to automatic categorization based on semantic analysis of websites, it is impossible to distinguish between a site dealing with the illegal sale of marijuana and a site selling therapeutic products for legal use by healthcare establishments.

Human operators: the only solution for limiting false positives

The only approach that is currently satisfactory and limits false positives involves validation of the site's categorization by a human operator, assisted, of course, by an initial automatic pre-classification—whether AI-based or not.

This choice ensures that legitimate sites can be accessed by teams without impacting their work, while blocking sites with illegal content.

THE BLOG

Discover our latest articles

VPN vs. ZTNA: Why CIOs Need to Rethink Remote Access in 2026
Cyber news
July 22, 2026
Shadow AI in the Workplace: How CIOs Can Regain Control with SSE
Cyber news
May 12, 2026
How is a "Trust-Centric" protective layer reinventing Internet security?
Cyber news
February 17, 2026

The Scunthorpe problem: the limitations of web filtering based on automatic categorization

The Scunthorpe problem

The challenges of automatic content categorization

Distinguishing legitimate content from content that suggests illegality through semantics

Human operators: the only solution for limiting false positives

VPN vs. ZTNA: Why CIOs Need to Rethink Remote Access in 2026

Shadow AI in the Workplace: How CIOs Can Regain Control with SSE

How is a "Trust-Centric" protective layer reinventing Internet security?