August 14, 2025

Fighting AI and other Bots

Image showing AI scanned by Cloudflare

At the moment AI and their malicious bots are the scourge of the internet. They ignore all the conventions and implicit rules built on trust over many years. Cloudflare has called out Perplexity on this abuse at Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives. You can also read more on the impact of these bots at The crawl before the fall… of referrals: understanding AI’s impact on content providers.

At the end of July 2025 I was a victim of abuse crawling by an army of bots that ignored the robots.txt files on my sites hosted on a small VPS(Virtual Private Server) at vultr (Affiliate link). The site that was hit the hardest was https://webtrees.greylingfamily.org.za. The bots were iterating through the calendar on the site day by day. The calendar is excluded by robots.txt for crawling. If a single bot crawls the calendar it would be following about 130000 links. A single bot is not a huge problem other than causing a lot of useless hits in the stats. There were however more than a 1000 bots at any one time doing the crawling which caused the site to run out of memory and crash. These bots also ignored the crawl delay of 10 seconds set in the robots.txt file.

Previously I used some simple measures to fight bots on the VPS. The sites on the VPS are all behind Cloudflare’s CDN. Years ago I did the basic settings on Cloudflare without thinking too much about it. I also added nginx ultimate bad bot blocker to the nginx configuration and then used fail2ban to block the IP addresses of the bots caught by the bad bot blocker using IPFW on FreeBSD and also add it to the IP blocklist on Cloudflare via their API. This worked very well until mid July 2025.

I first noticed a problem when the php-fpm pool for https://webtrees.greylingfamily.org.za started crashing with memory errors. A quick look at the logs showed multiple connections to the calendar function. All the connections showed User Agents that looked legitimate at first glance, making blocking them with “nginx ultimate bad bot blocker” a game of Whack-a-Mole. I then used whob to have a look where the connections were coming from. I quickly saw it was from all over but mostly Brazil.

I came to the conclusion that I probably won’t be able to do the blocking on my server and decided to look at possibilities on Cloudflare. A look at the security eventlog showed that the traffic came from Brazil, Singapore, Vietnam and India.

As a first step I created a WAF custom rule to block those countries. The rule is simple and the rule expression looks like this

(ip.src.country eq "VN") or (ip.src.country eq "BR") or (ip.src.country eq "IN") or (ip.src.country eq "SG")

The Cloudflare interface allows you to create a rule with a graphical expression builder or edit it directly in a textbox.

This provided immediate relief allowing me to look at the other security options on Cloudflare.

I went to the Security-Settings page and enabled all the “Bot Fighting” options excluding the “AI Instruct AI bot traffic with robots.txt”. I switched on the “AI Labyrinth” as well. The “AI Labyrinth” is probably not for everyone, but it is another way to fight AI without using my own resources.

I then created a rule to give a “Managed Challenge” to any connections from anywhere outside of South Africa. Again the rule is simple and the expression looks like this:

(ip.src.country ne "ZA")

After a few days I realised this rule is probably casting too wide a net and disabled it. In its place I created a rule to give a “managed Challenge” to all connections to the Calendar function. The rule expression looks like this:

(http.request.uri.path contains "/calendar/")

The fail2ban setup on the server still runs and updates the Cloudflare “IP access rules” using the Cloudflare API.

I also run a few other sites. Grumpy Old Techie and The Grumpy Old Techie’s Photos are also on the same small VPS. I decided to add “Managed Challenges” to the login pages of these sites. The rule expression looks like this:

(http.request.uri.path contains "identification.php") or (http.request.uri.path contains "/wp-login.php") or (http.request.uri.path contains "/wp-admin/" and http.request.uri.path ne "/wp-admin/admin-ajax.php")

I also did the same configuration as before on the Security-Settings page for grumpyoldtechie.com.

The rule blocking Brazil, Singapore, Vietnam and India still blocks about 3000 connections a day trying to access the calendar. This is down from about 150000 a day when the problem started.

A the moment these measures seem to work reasonably well to keep bots at bay without being too extreme.

© Arnold Greyling 2025