网站根目录下设置的 robots.txt 规则现在爬虫机器人不遵守了嘛

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

已注册用户请登录

请不要在回答技术问题时复制粘贴 AI 生成的内容

网站根目录下设置的 robots.txt 规则貌似对 gptbot 和 facebook 的 crawler 不生效啊

User-agent: * Disallow: / User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: /

设置 robots.txt 的时间已经超过了 30 个小时。都不遵守 robots 的话，只能从 nginx 配置了。

10M 的宽带直接被爬虫跑满了

20.171.207.130 - - [17/Oct/2025:09:16:41 +0800] "GET /?s=search/index/cid/323/bid/24/scid/85C4/peid/27/ov/new-asc.html HTTP/1.1" 200 38211 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)" 117.50.153.198 - - [17/Oct/2025:09:16:42 +0800] "GET /?s=search/index/cid/316/scid/85C4/poid/33/bid/8/ov/new-asc/peid/7.html HTTP/1.1" 200 38340 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0" 57.141.0.25 - - [17/Oct/2025:09:16:42 +0800] "GET /?s=search/index/poid/33/scid/9EBB198E982B/cid/444/peid/17/bid/12.html HTTP/1.1" 200 637932 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)" 57.141.0.12 - - [17/Oct/2025:09:16:42 +0800] "GET /?s=search/index/poid/33/scid/9EBB198E982B/cid/631/peid/29/bid/28/ov/price-asc.html HTTP/1.1" 200 637644 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)" 57.141.0.74 - - [17/Oct/2025:09:16:42 +0800] "GET /?s=search/index/poid/33/scid/C4/cid/608/peid/7/ov/new-asc.html HTTP/1.1" 200 635769 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)" 57.141.0.63 - - [17/Oct/2025:09:16:42 +0800] "GET /?s=search/index/poid/33/peid/29/bid/24/scid/C4/cid/570/ov/access-desc.html HTTP/1.1" 200 618851 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)" 117.50.153.198 - - [17/Oct/2025:09:16:43 +0800] "GET /?s=search/index/cid/321/bid/29/scid/85C4/ov/new-desc/peid/7/poid/33.html HTTP/1.1" 200 38368 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0" 57.141.0.34 - - [17/Oct/2025:09:16:43 +0800] "GET /?s=search/index/poid/33/peid/18/ov/new-desc/scid/9EBB198E982B/bid/8/cid/367.html HTTP/1.1" 200 467003 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)"

第 1 条附言 13 小时 8 分钟前

nginx 层面直接 return 403;

清净了

robots.txt

gptbot

crawler

25 条回复 2025-10-17 18:12:55 +08:00

Configuration

17 小时 43 分钟前

1 这是君子协定
2 UA 可以伪造

keer

17 小时 40 分钟前

@Configuration 这样来看，他们是一点也不君子了呀

SuperGeorge

17 小时 38 分钟前

点名 YisouSpider ，robots.txt 毫无作用，UA + IP 段都拉黑后还是疯狂爬，403 状态码告警就没停过。

iugo

17 小时 33 分钟前

参考: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/

1. UA 是 `meta-externalagent`
2. 判断一下 IP 是否是 Meta 声明的爬虫 IP

OpenAI 的爬虫, 不予置评.

1up

17 小时 31 分钟前

一直不遵守啊

Goooooos

17 小时 30 分钟前

现在 AI 的爬虫都不当自己是爬虫，完全乱来

liuidetmks

17 小时 23 分钟前

识别到是 AI 爬虫，能不能随机输出乱序假文？

搜索引擎还能反哺网站流量，AI 就是纯喝血了

bgm004

17 小时 13 分钟前

ai 的爬虫就和当年的迅雷一样。

picone

17 小时 9 分钟前

我也发现了，直接根据 UA 返回 403 了，真的乱来

laobaiguolai

17 小时 3 分钟前

用的 cloudflare ，他们家的识别和阻止能力还是可以的

opengps

17 小时 3 分钟前

我最近刚好做了相关的，搜索引擎的爬虫，至少人家 UA 是明确的，虽然可以轻松伪造，但如果你不想，可以从 UA 入手拦截官方的爬虫。（按伦理来讲，至少官方的爬虫不至于明目张胆伪造 UA ）。
顺便附赠几个最近关注到的主要的 AI 爬虫 UA 关键字："mj12bot","openai","gptbot","claudebot","semrushbot","siteauditbot"