Handling Google's Incessant Errors

I try to run a pretty tight ship here at Impossibly Stupid. When someone comes to my site looking for a file that can't be found (AKA, a 404 error), I know that either they're up to no good (e.g., scanning for a vulnerability) or I have a problem that needs to be fixed. But what do you do when, through no fault of your own, a bad request comes – over and over again – from the world's largest search engine?

You probably see the same sort of things in your logs. Often times it's endless requests for “common” files you've never hosted:

[Fri Jun 03 19:22:45.324293 2016] [core:info] [pid 13859] [client 66.249.64.105:63196] AH00128: File does not exist: /apple-app-site-association
[Fri Jun 03 20:45:45.364060 2016] [core:info] [pid 13729] [client 66.249.64.45:35691] AH00128: File does not exist: /.well-known/apple-app-site-association
[Fri Jun 03 21:22:05.342801 2016] [core:info] [pid 13608] [client 66.249.64.45:51209] AH00128: File does not exist: /apple-app-site-association
[Fri Jun 03 21:35:26.077749 2016] [core:info] [pid 13621] [client 66.249.64.45:51298] AH00128: File does not exist: /apple-app-site-association
[Fri Jun 03 23:57:07.558024 2016] [core:info] [pid 14179] [client 66.249.64.168:46975] AH00128: File does not exist: /apple-app-site-association
[Sat Jun 04 00:31:23.280907 2016] [core:info] [pid 13729] [client 66.249.64.168:34644] AH00128: File does not exist: /.well-known/apple-app-site-association
[Sat Jun 04 00:37:06.128225 2016] [core:info] [pid 14175] [client 66.249.64.225:38670] AH00128: File does not exist: /.well-known/apple-app-site-association
[Sat Jun 04 01:47:57.351926 2016] [core:info] [pid 14175] [client 66.249.64.225:49894] AH00128: File does not exist: /apple-app-site-association

But the worst is the randomly generated names that Google intentionally fabricates to “test” your server:

66.249.64.235 - - [26/May/2016:11:23:55 -0400] "GET /yrjclqajwyshc.html HTTP/1.1" 204 266 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.64.10 - - [27/May/2016:11:15:02 -0400] "GET /ysveybimgdu.html HTTP/1.1" 204 266 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.64.3 - - [02/Jun/2016:10:20:53 -0400] "GET /iqswwijkbkk.html HTTP/1.1" 204 266 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.64.243 - - [03/Jun/2016:10:11:18 -0400] "GET /qfmtujzxykv.html HTTP/1.1" 204 266 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

If you look close at what I'm sending back, though, you'll see it's not the standard 404 error, but a much more logging-friendly 204 “no content” response. After all, it's not a problem on my end that no such files exist. How I do it is this very simple Apache configuration directive added to my .htaccess file:

RewriteEngine on
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^[a-z]{8,16}.html$ http://www.google.com/ [R=204,L,CO=google:stop_your_404_probing:impossiblystupid.com]

Notice that I also add a cookie in the response containing a little message for the guys back at Google. I'm certain nobody has ever read it. Still, I think I'm going to have to start using this same technique for the other errant requests, because simply adding placeholder files to stop the 404's isn't nearly as satisfying.