新手尝试用 aiohttp 写了个爬虫,但是目前因为 task 过多(超过 1000 个),报错 Too many open files,请问如何解决呢? - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
xiongshengyao
V2EX    Python

新手尝试用 aiohttp 写了个爬虫,但是目前因为 task 过多(超过 1000 个),报错 Too many open files,请问如何解决呢?

  •  
  •   xiongshengyao 2018-03-19 09:49:10 +08:00 8931 次点击
    这是一个创建于 2768 天前的主题,其中的信息可能已经有所发展或是发生改变。

    完整代码

    import time import asyncio import aiohttp from bs4 import BeautifulSoup as bs BASE_URL = "http://www.biqudu.com" TITLE2URL = dict() COnTENT= list() async def fetch(url, callback=None, **kwarags): headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} sem = asyncio.Semaphore(5) with (await sem): async with aiohttp.ClientSession() as session: async with session.get(url, headers=headers) as res: page = await res.text() if callback: callback(page, **kwarags) else: return page def parse_url(page): soup = bs(page, "lxml") dd_a_doc = soup.select("dd > a") for a_doc in dd_a_doc: article_page_url = a_doc['href'] article_title = a_doc.get_text() if article_page_url: TITLE2URL[article_title] = article_page_url def parse_body(page, **kwarags): title = kwarags.get('title', '') print("{}".format(title)) soup = bs(page, "lxml") content_doc = soup.find("div", id="content") content_text = content_doc.get_text().replace('readx();', '').replace(' ', "\r\n") cOntent= "%s\n%s\n\n" % (title, content_text) CONTENT.append(content) def main(): t0 = time.time() loop = asyncio.get_event_loop() loop.run_until_complete(fetch(BASE_URL+"/43_43074/", callback=parse_url)) tasks = [fetch(BASE_URL + page_url, callback=parse_body, title=title) for title, page_url in TITLE2URL.items()] loop.run_until_complete(asyncio.gather(*tasks[:500])) loop.close() elapsed = time.time() - t0 print("cost {}".format(elapsed)) if __name__ == "__main__": main() 

    错误信息

    Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py", line 797, in _wrap_create_connection return (yield from self._loop.create_connection(*args, **kwargs)) File "/usr/lib/python3.5/asyncio/base_events.py", line 695, in create_connection raise exceptions[0] File "/usr/lib/python3.5/asyncio/base_events.py", line 662, in create_connection sock = socket.socket(family=family, type=type, proto=proto) File "/usr/lib/python3.5/socket.py", line 134, in __init__ _socket.socket.__init__(self, family, type, proto, fileno) OSError: [Errno 24] Too many open files The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/xsy/Workspace/Self/aiotest/aiotest.py", line 58, in <module> main() File "/home/xsy/Workspace/Self/aiotest/aiotest.py", line 52, in main loop.run_until_complete(asyncio.gather(*tasks[:500])) File "/usr/lib/python3.5/asyncio/base_events.py", line 387, in run_until_complete return future.result() File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result raise self._exception File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step result = coro.send(None) File "/home/xsy/Workspace/Self/aiotest/aiotest.py", line 18, in fetch async with session.get(url, headers=headers) as res: File "/usr/local/lib/python3.5/dist-packages/aiohttp/client.py", line 690, in __aenter__ self._resp = yield from self._coro File "/usr/local/lib/python3.5/dist-packages/aiohttp/client.py", line 267, in _request cOnn= yield from self._connector.connect(req) File "/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py", line 402, in connect proto = yield from self._create_connection(req) File "/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py", line 749, in _create_connection _, proto = yield from self._create_direct_connection(req) File "/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py", line 860, in _create_direct_connection raise last_exc File "/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py", line 832, in _create_direct_connection req=req, client_error=client_error) File "/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py", line 804, in _wrap_create_connection raise client_error(req.connection_key, exc) from exc aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host www.biqudu.com:443 ssl:True [Too many open files] 

    目前我能想到的方法是

    • 修改linux 的最大文件打开数量限制
    • 对 task 切片多次运行

    但是这样解决都感觉太蠢了,请问有什么更好的方式吗?

    15 条回复    2018-09-21 10:46:49 +08:00
    WuMingyu
        1
    WuMingyu  
       2018-03-19 10:04:40 +08:00 via iPhone
    考虑下多个请求共同一个 client session,一个 clientsession 至少会占用一个链接的
    WuMingyu
        2
    WuMingyu  
       2018-03-19 10:07:33 +08:00 via iPhone
    xiongshengyao
        3
    xiongshengyao  
    OP
       2018-03-19 10:10:54 +08:00
    @WuMingyu 好哒,我试试
    janxin
        4
    janxin  
       2018-03-19 10:14:20 +08:00
    linux 打开文件句柄上限了解一下?
    zhengwenk
        5
    zhengwenk  
       2018-03-19 10:15:35 +08:00
    Too many open files 的话不是应该将最大文件数修改大一些么
    xiongshengyao
        6
    xiongshengyao  
    OP
       2018-03-19 10:16:30 +08:00
    @janxin 了解了,确实可以,但是觉得这种解决方式不优雅…
    ipwx
        7
    ipwx  
       2018-03-19 10:17:14 +08:00
    其实我对你这两句话表示疑惑:

    sem = asyncio.Semaphore(5)
    with (await sem):

    请问你是要靠 Semaphore 控制并发嘛?可是每个 fetch 用一个独立的 Semaphore 你靠什么去控制并发呢?
    xiongshengyao
        8
    xiongshengyao  
    OP
       2018-03-19 10:17:41 +08:00
    @zhengwenk 治标不治本呢…现在这个是链接是 1000 多…下次爬的假如是 10000 岂不是又要改…我按一楼的改好了…
    ipwx
        9
    ipwx  
       2018-03-19 10:18:57 +08:00
    另外 @WuMingyu 说的那一点也是,你为什么每一个 fetch 都用一个独立的 ClientSession 呢?

    事实上 Semaphore 或者 ClientSession 两者之中任何一个都能控制并发。Semaphore 可以控制同时运行的 task,而 ClientSession 可以控制最大连接数(当然你得加参数)。当然你必须用同一个对象才行。
    lfzyx
        10
    lfzyx  
       2018-03-19 10:19:37 +08:00
    这有什么优雅不优雅的,每个发行版的初始 open files 限制都不一样,而在云上的话,早就被云供应商改成 65535 甚至更高了
    xiongshengyao
        11
    xiongshengyao  
    OP
       2018-03-19 10:23:21 +08:00
    @ipwx
    @lfzyx
    @zhengwenk
    @janxin
    @WuMingyu
    感谢各位,已经有个解决的思路了,谢谢大家~~~
    CSM
        12
    CSM  
       2018-03-19 10:27:05 +08:00 via Android
    楼上说得对,aiohttp 文档中说一个 app 只需要一个 ClientSession 就够了。可以把 session 作为 fetch 的一个参数。
    bestehen
        13
    bestehen  
       2018-06-17 19:04:00 +08:00
    @xiongshengyao 我看你这里 task 是 500 个啊,怎么是 1000 个 *tasks[:500]))
    handan
        14
    handan  
       2018-09-20 17:50:30 +08:00
    可以问一下,你之前你这个问题有想到什么好的 解决方案么??
    xiongshengyao
        15
    xiongshengyao  
    OP
       2018-09-21 10:46:49 +08:00   1
    @handan   1 楼
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     3404 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 25ms UTC 04:37 PVG 12:37 LAX 21:37 JFK 00:37
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86