你们想要的 Tumblr 爬虫

# -*- coding=utf-8 -*- from threading import Thread import Queue import requests import re import os import sys import time api_url='http://%s.tumblr.com/api/read?&num=50&start=' UQueue=Queue.Queue() def getpost(uid,queue): url='http://%s.tumblr.com/api/read?&num=50'%uid page=requests.get(url).content total=re.findall('<posts start="0" total="(.*?)">',page)[0] total=int(total) a=[i*50 for i in range(1000) if i*50-total<0] ul=api_url%uid for i in a: queue.put(ul+str(i)) extractpicre = re.compile(r'(?<=<photo-url max-width="1280">).+?(?=</photo-url>)',flags=re.S) #search for url of maxium size of a picture, which starts with '<photo-url max-width="1280">' and ends with '</photo-url>' extractvideore=re.compile('/tumblr_(.*?)" type="video/mp4"') video_links = [] pic_links = [] vhead = 'https://vt.tumblr.com/tumblr_%s.mp4' class Consumer(Thread): def __init__(self, l_queue): super(Consumer,self).__init__() self.queue = l_queue def run(self): session = requests.Session() while 1: link = self.queue.get() print 'start parse post: ' + link try: cOntent= session.get(link).content videos = extractvideore.findall(content) video_links.extend([vhead % v for v in videos]) pic_links.extend(extractpicre.findall(content)) except: print 'url: %s parse failed\n' % link if self.queue.empty(): break def main(): task=[] for i in range(min(10,UQueue.qsize())): t=Consumer(UQueue) task.append(t) for t in task: t.start() for t in task: t.join while 1: for t in task: if t.is_alive(): continue else: task.remove(t) if len(task)==0: break def write(): videos=[i.replace('/480','') for i in video_links] pictures=pic_links with open('pictures.txt','w') as f: for i in pictures: f.write('%s\n'%i) with open('videos.txt','w') as f: for i in videos: f.write('%s\n'%i) if __name__=='__main__': #name=sys.argv[1] #name=name.strip() name='mzcyx2011' getpost(name,UQueue) main() write()

第 1 条附言 2016-10-29 09:47:26 +08:00

用法是：
直接改掉那个 name 就行

第 2 条附言 2016-10-31 10:06:21 +08:00

我不会提供 name list 的，不然会被关小黑屋的

52 条回复 2018-09-22 18:22:54 +08:00

miketeam

Mark

TKKONE

PRO

2016-10-29 00:14:18 +08:00

忘了去重了！在 write 函数里面
videos=list(set(videos))
pictures=list(set(pictures))

sammiriam

2016-10-29 01:08:48 +08:00

mark ，明天起来再看

wjm2038

2016-10-29 01:15:12 +08:00 via Android

mark

weipang

2016-10-29 06:22:08 +08:00 via iPhone

然而不会用

cszhiyue

2016-10-29 07:58:37 +08:00

加个下载功能
https://gist.github.com/zhiyue/f7121aefc00640cb13bb0eded10c5312.js

TKKONE

PRO

2016-10-29 09:06:11 +08:00 via iPhone

@cszhiyue Python 下载没多少意义，下载起来慢。所以我是写出文件，可以用迅雷下载

aksoft

2016-10-29 09:17:48 +08:00

刚需啊，出售营养快线！

TKKONE

PRO

2016-10-29 09:29:49 +08:00 via iPhone

@weipang 改个 name 就够了，然后直接运行

TKKONE

PRO

2016-10-29 09:32:50 +08:00 via iPhone

@aksoft 个人网站上目前有 5000 多个解析过的博客

liuxingou

2016-10-29 09:35:29 +08:00

@tumbzzc

正解，解析出地址，让下载工具下载，最高效率了。

aksoft

2016-10-29 09:42:02 +08:00

@tumbzzc 哪呢

programdog

2016-10-29 09:53:59 +08:00

感谢楼主

TKKONE

PRO

2016-10-29 10:01:52 +08:00 via iPhone

@aksoft 最下面

freaks

2016-10-29 10:06:58 +08:00 via Android

这样的在线解析不要太多(⊙o⊙)哦！

0915240

2016-10-29 10:13:11 +08:00

olddrivertaketakeme

Nicksxs

2016-10-29 10:23:13 +08:00

不是被墙了么， vps 上下吗

cszhiyue

2016-10-29 10:49:04 +08:00

@tumbzzc 开了 8 进程下载并不觉得慢啊。是什么理由导致慢呢？

exoticknight

2016-10-29 10:51:17 +08:00

这东西是好，但是我觉得爬出提供资源的 tumblr 名字更重要

TKKONE

PRO

2016-10-29 10:53:20 +08:00 via iPhone

@freaks 我的网站放在过外 vps 上，也是在线解析

TKKONE

PRO

2016-10-29 11:52:06 +08:00 via iPhone

@exoticknight 名字没办法

guokeke

2016-10-29 11:54:29 +08:00 via Android

Mark

cevincheung

2016-10-29 11:58:33 +08:00

然后就可以 wget 了？

exalex

2016-10-29 12:09:26 +08:00

能不能简述下爬虫效果。。。

guonning

2016-10-29 16:51:33 +08:00 via Android

收藏了

LeoEatle

2016-10-29 20:34:31 +08:00

name 改成什么好，能否给个名单: )

yangonee

2016-10-29 21:12:00 +08:00

求 name_list

lycos

2016-10-29 23:48:36 +08:00 via iPad

mark

leetom

2016-10-30 00:07:26 +08:00

@cszhiyue

下载到一半会这样

Traceback (most recent call last):
File "turmla.py", line 150, in <module>
for square in tqdm(pool.imap_unordered(download_base_dir, urls), total=len(urls)):
File "/home/leetom/.pyenv/versions/2.7.10/lib/python2.7/site-packages/tqdm/_tqdm.py", line 713, in __iter__
for obj in iterable:
File "/home/leetom/.pyenv/versions/2.7.10/lib/python2.7/multiprocessing/pool.py", line 668, in next
raise value
Exception: Unexpected response.

thinks

2016-10-30 10:22:00 +08:00 via Android

Mark ，哎，老司机一言不合就发车啊。

sangmong

2016-10-30 21:59:24 +08:00

mark

errorlife

2016-10-31 01:58:11 +08:00

没人知道 www.tumblrget.com 吗

mozutaba

2016-10-31 04:13:10 +08:00

@errorlife 无效啊

errorlife

2016-10-31 09:13:51 +08:00

@mozutaba 上梯子=。=

Nutlee

2016-10-31 09:52:35 +08:00

战略 Mark

iewgnaw

2016-10-31 11:12:14 +08:00

不是有现成的 API 吗

TKKONE

PRO

2016-10-31 11:53:43 +08:00

@iewgnaw 这不就是用 api 吗

znoodl

2016-10-31 12:19:27 +08:00 via iPhone

我也用 golang 爬过。。。后来被墙就没搞了

Layne

2016-10-31 13:01:29 +08:00

默默点个赞 :)

itqls

2016-10-31 14:57:08 +08:00

一天到晚搞事情

weaming

2016-10-31 17:37:37 +08:00

搞事搞事

TKKONE

PRO

2016-10-31 17:42:37 +08:00

@itqls
@weaming
你们别搞事啊

GreatMartial

2016-11-01 14:55:39 +08:00 via Android

@tumbzzc 楼主，我要访问你的网站，我要做的你粉丝

TKKONE

PRO

2016-11-01 14:58:01 +08:00

@GreatMartial 少儿不宜哈哈哈

firefox12

2016-11-01 15:26:01 +08:00

下载的那个脚本
Traceback (most recent call last):
File "./1.py", line 138, in <module>
getpost(name, UQueue)
File "./1.py", line 27, in getpost
total = re.findall('<posts start="0" total="(.*?)">', page)[0]
IndexError: list index out of range

Doggy

2016-11-05 10:54:13 +08:00

with open('pictures.txt','r') as fobj:
for eachline in fobj:
pngurl=eachline.strip()
filename='.//getpic//test-{0}.jpg'.format(i)
print '[-]parsng:{0}'.format(filename)
urllib.urlretrieve(pngurl,filename)
i+=1