一个诡异的爬虫,求分析。 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
sanp
V2EX    问与答

一个诡异的爬虫,求分析。

  •  
  •   sanp 2014-08-05 12:47:25 +08:00 4465 次点击
    这是一个创建于 4137 天前的主题,其中的信息可能已经有所发展或是发生改变。
    这是我截取的access log. 其中/{xxx}代表的是我网站的某个路径,其他的都是原始的log未做改动。
    这个爬虫IP不固定,封了后过一会会有新的IP爬过来。
    这个爬虫从大概2年前就开始爬我的站,中间我的站关掉了一年左右,现在重新开,没想到这个爬虫居然还在。不知道什么路数。很有可能我关站的这段时间他还在爬。大家给分析分析

    89.248.162.170 - - [05/Aug/2014:04:42:36 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a3" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1"
    89.248.162.170 - - [05/Aug/2014:04:42:36 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a2" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11"
    89.248.162.170 - - [05/Aug/2014:04:42:36 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a0" "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1"
    89.248.162.170 - - [05/Aug/2014:04:42:36 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a3" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11"
    89.248.162.170 - - [05/Aug/2014:04:42:36 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a4" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/11.10 Chromium/17.0.963.65 Chrome/17.0.963.65 Safari/535.11"
    89.248.162.170 - - [05/Aug/2014:04:42:37 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a2" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11"
    89.248.162.170 - - [05/Aug/2014:04:42:37 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a9" "Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1"
    89.248.162.170 - - [05/Aug/2014:04:42:37 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a3" "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1"
    94.102.49.31 - - [05/Aug/2014:04:42:56 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a6" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_4) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.65 Safari/535.11"
    94.102.49.31 - - [05/Aug/2014:04:42:56 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a8" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/10.10 Chromium/17.0.963.65 Chrome/17.0.963.65 Safari/535.11"
    94.102.49.31 - - [05/Aug/2014:04:42:57 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a1" "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1"
    94.102.49.31 - - [05/Aug/2014:04:42:58 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a3" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/11.04 Chromium/17.0.963.65 Chrome/17.0.963.65 Safari/535.11"
    94.102.49.31 - - [05/Aug/2014:04:42:58 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a8" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11"
    94.102.49.31 - - [05/Aug/2014:04:42:58 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a1" "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1"
    94.102.49.31 - - [05/Aug/2014:04:42:58 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a7" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11"
    94.102.49.31 - - [05/Aug/2014:04:42:58 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a3" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/11.10 Chromium/17.0.963.65 Chrome/17.0.963.65 Safari/535.11"
    94.102.49.31 - - [05/Aug/2014:04:43:00 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a1" "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1"
    94.102.49.31 - - [05/Aug/2014:04:43:01 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a2" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.16) Gecko/20120427 Firefox/15.0a1"
    15 条回复    2014-08-05 15:25:55 +08:00
    liangdi
        1
    liangdi  
       2014-08-05 12:48:57 +08:00
    log呢?
    sanp
        2
    sanp  
    OP
       2014-08-05 12:50:30 +08:00
    @liangdi 刚才按了Enter居然自动发布了。。。我刚编辑了。
    plprapper
        3
    plprapper  
       2014-08-05 13:16:23 +08:00
    这哪里是爬虫, 简直是癞皮狗。。。。
    liangdi
        4
    liangdi  
       2014-08-05 13:26:59 +08:00
    是采集器吧 lz什么站?
    sintrb
        5
    sintrb  
       2014-08-05 13:28:21 +08:00
    这爬虫好可怜。。
    popbones
        6
    popbones  
       2014-08-05 14:20:01 +08:00
    IP : 89.248.162.170
    Host : server156950.santrex.net
    Country : Netherlands

    IP : 94.102.49.31
    Host : ?
    Country : Netherlands
    captainhcg
        7
    captainhcg  
       2014-08-05 14:34:48 +08:00
    http://www.projecthoneypot.org/ip_94.102.49.213
    貌似是发送垃圾评论的,你的站点是不是用了wordpress之类的框架?
    ChanneW
        8
    ChanneW  
       2014-08-05 14:35:14 +08:00
    怎么看出不是真 google 的
    avrillavigne
        9
    avrillavigne  
       2014-08-05 15:04:17 +08:00
    http://antivirus.neu.edu.cn/ssh/lists/base_30days.txt 东北大学把它列进黑名单了 - -
    vicacheung
        10
    vicacheung  
       2014-08-05 15:06:38 +08:00
    @sanp 现在可以编辑主题了?
    sanp
        11
    sanp  
    OP
       2014-08-05 15:22:05 +08:00
    @liangdi 一个工具类的站,查询数据的,对方是遍历抓取的。我奇怪的是我站都关了一年多。重新开了,他居然还在。
    sanp
        12
    sanp  
    OP
       2014-08-05 15:22:40 +08:00
    @vicacheung 刚发布时候可以编辑的。
    sanp
        13
    sanp  
    OP
       2014-08-05 15:24:02 +08:00
    @plprapper 确实,一般的爬虫禁了就行了,这个是禁了吗,过会就有别的IP过来,而且抓取很频繁,基本不停的爬。
    sanp
        14
    sanp  
    OP
       2014-08-05 15:25:03 +08:00
    @captainhcg 没有用wordpress。这个爬虫是遍历网站页面,然后就不停的爬。
    sanp
        15
    sanp  
    OP
       2014-08-05 15:25:55 +08:00
    @avrillavigne 确实被互联网上不少地方列黑名单了。我就是奇怪他咋就不停的爬。
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     5172 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 27ms UTC 05:46 PVG 13:46 LAX 21:46 JFK 00:46
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86