
好几个月前写的了,写的比较挫。
并没有写成爬取一个博客的所有内容,本来是用来网站的,如果要爬所有内容,会让用户等待太久。
# -*- coding=utf-8 -*- from threading import Thread import Queue import requests import re import os import sys import time api_url='http://%s.tumblr.com/api/read?&num=50&start=' UQueue=Queue.Queue() def getpost(uid,queue): url='http://%s.tumblr.com/api/read?&num=50'%uid page=requests.get(url).content total=re.findall('<posts start="0" total="(.*?)">',page)[0] total=int(total) a=[i*50 for i in range(1000) if i*50-total<0] ul=api_url%uid for i in a: queue.put(ul+str(i)) extractpicre = re.compile(r'(?<=<photo-url max-width="1280">).+?(?=</photo-url>)',flags=re.S) #search for url of maxium size of a picture, which starts with '<photo-url max-width="1280">' and ends with '</photo-url>' extractvideore=re.compile('/tumblr_(.*?)" type="video/mp4"') video_links = [] pic_links = [] vhead = 'https://vt.tumblr.com/tumblr_%s.mp4' class Consumer(Thread): def __init__(self, l_queue): super(Consumer,self).__init__() self.queue = l_queue def run(self): session = requests.Session() while 1: link = self.queue.get() print 'start parse post: ' + link try: cOntent= session.get(link).content videos = extractvideore.findall(content) video_links.extend([vhead % v for v in videos]) pic_links.extend(extractpicre.findall(content)) except: print 'url: %s parse failed\n' % link if self.queue.empty(): break def main(): task=[] for i in range(min(10,UQueue.qsize())): t=Consumer(UQueue) task.append(t) for t in task: t.start() for t in task: t.join while 1: for t in task: if t.is_alive(): continue else: task.remove(t) if len(task)==0: break def write(): videos=[i.replace('/480','') for i in video_links] pictures=pic_links with open('pictures.txt','w') as f: for i in pictures: f.write('%s\n'%i) with open('videos.txt','w') as f: for i in videos: f.write('%s\n'%i) if __name__=='__main__': #name=sys.argv[1] #name=name.strip() name='mzcyx2011' getpost(name,UQueue) main() write() 1 miketeam Mark |
2 TKKONE OP PRO 忘了去重了!在 write 函数里面 videos=list(set(videos)) pictures=list(set(pictures)) |
3 sammiriam 2016-10-29 01:08:48 +08:00 mark ,明天起来再看 |
4 wjm2038 2016-10-29 01:15:12 +08:00 via Android mark |
5 weipang 2016-10-29 06:22:08 +08:00 via iPhone 然而不会用 |
6 cszhiyue 2016-10-29 07:58:37 +08:00 |
7 TKKONE OP PRO @cszhiyue Python 下载没多少意义,下载起来慢。所以我是写出文件,可以用迅雷下载 |
8 aksoft 2016-10-29 09:17:48 +08:00 刚需啊,出售营养快线! |
13 programdog 2016-10-29 09:53:59 +08:00 感谢楼主 |
15 freaks 2016-10-29 10:06:58 +08:00 via Android 这样的在线解析不要太多(⊙o⊙)哦! |
16 0915240 2016-10-29 10:13:11 +08:00 olddrivertaketakeme |
17 Nicksxs 2016-10-29 10:23:13 +08:00 不是被墙了么, vps 上下吗 |
19 exoticknight 2016-10-29 10:51:17 +08:00 这东西是好,但是我觉得爬出提供资源的 tumblr 名字更重要 |
21 TKKONE OP PRO @exoticknight 名字没办法 |
22 guokeke 2016-10-29 11:54:29 +08:00 via Android Mark |
23 cevincheung 2016-10-29 11:58:33 +08:00 然后就可以 wget 了? |
24 exalex 2016-10-29 12:09:26 +08:00 能不能简述下爬虫效果。。。 |
25 guonning 2016-10-29 16:51:33 +08:00 via Android 收藏了 |
26 LeoEatle 2016-10-29 20:34:31 +08:00 name 改成什么好,能否给个名单: ) |
27 yangonee 2016-10-29 21:12:00 +08:00 求 name_list |
28 lycos 2016-10-29 23:48:36 +08:00 via iPad mark |
29 leetom 2016-10-30 00:07:26 +08:00 @cszhiyue 下载到一半会这样 Traceback (most recent call last): File "turmla.py", line 150, in <module> for square in tqdm(pool.imap_unordered(download_base_dir, urls), total=len(urls)): File "/home/leetom/.pyenv/versions/2.7.10/lib/python2.7/site-packages/tqdm/_tqdm.py", line 713, in __iter__ for obj in iterable: File "/home/leetom/.pyenv/versions/2.7.10/lib/python2.7/multiprocessing/pool.py", line 668, in next raise value Exception: Unexpected response. |
30 thinks 2016-10-30 10:22:00 +08:00 via Android Mark ,哎,老司机一言不合就发车啊。 |
31 sangmong 2016-10-30 21:59:24 +08:00 mark |
32 errorlife 2016-10-31 01:58:11 +08:00 没人知道 www.tumblrget.com 吗 |
35 Nutlee 2016-10-31 09:52:35 +08:00 战略 Mark |
36 iewgnaw 2016-10-31 11:12:14 +08:00 不是有现成的 API 吗 |
38 znoodl 2016-10-31 12:19:27 +08:00 via iPhone 我也用 golang 爬过。。。后来被墙就没搞了 |
39 Layne 2016-10-31 13:01:29 +08:00 默默点个赞 :) |
40 itqls 2016-10-31 14:57:08 +08:00 一天到晚搞事情 |
41 weaming 2016-10-31 17:37:37 +08:00 搞事搞事 |
43 GreatMartial 2016-11-01 14:55:39 +08:00 via Android @tumbzzc 楼主,我要访问你的网站,我要做的你粉丝 |
44 TKKONE OP PRO @GreatMartial 少儿不宜哈哈哈 |
45 firefox12 2016-11-01 15:26:01 +08:00 |
46 Doggy 2016-11-05 10:54:13 +08:00 with open('pictures.txt','r') as fobj: for eachline in fobj: pngurl=eachline.strip() filename='.//getpic//test-{0}.jpg'.format(i) print '[-]parsng:{0}'.format(filename) urllib.urlretrieve(pngurl,filename) i+=1 |
47 dickeny 2016-11-06 22:05:57 +08:00 for i in range(0, total, 50): queue.put(ul+str(i)) |
48 hard2reg 2016-11-27 23:58:38 +08:00 看完表示自己 python 白学了。。。 人家的爬虫都是多线程,队列,类 我的爬虫都是。。。 while if for .... |
50 dr3am 2017-10-31 17:43:43 +08:00 求 LZ 网站 |
52 giveupAK47 2018-09-22 18:22:54 +08:00 请问老哥您的博客是什么?想深入学习一下爬虫。 |