爬虫黑科技,我是怎么爬取 indeed 的职位数据的 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
syl371
V2EX    分享发现

爬虫黑科技,我是怎么爬取 indeed 的职位数据的

  •  
  •   syl371 2020-03-20 15:27:49 +08:00 2718 次点击
    这是一个创建于 2032 天前的主题,其中的信息可能已经有所发展或是发生改变。

    最近在学习 ndejs 爬虫技术,学了 request 模块,所以想着写一个自己的爬虫项目,研究了半天,最后选定 indeed 作为目标网站,通过爬取 indeed 的职位数据,然后开发一个自己的职位搜索引擎,目前已经上线了,虽然功能还是比较简单,但还是贴一下网址job search engine,证明一下这个爬虫项目是有用的。下面就来讲讲整个爬虫的思路。

    确定入口页面

    众所周知,爬虫是需要入口页面的,通过入口页面,不断的爬取链接,最后爬取完整个网站。在这个第一步的时候,就遇到了困难,一般来说都是选取首页和列表页作为入口页面的,但是 indeed 的列表页面做了限制,不能爬取完整的列表,顶多只能抓取前 100 页,但是这没有难倒我,我发现 indeed 有一个Browse Jobs 页面,通过这个页面,可以获取 indeed 按地区搜索和按类型搜索的所有列表。下面贴一下这个页面的解析代码。

    start: async (page) => { const host = URL.parse(page.url).hostname; const tasks = []; try { const $ = cheerio.load(iconv.decode(page.con, 'utf-8'), { decodeEntities: false }); $('#states > tbody > tr > td > a').each((i, ele) => { const url = URL.resolve(page.url, $(ele).attr('href')); tasks.push({ _id: md5(url), type: 'city', host, url, done: 0, name: $(ele).text() }); }); $('#categories > tbody > tr > td > a').each((i, ele) => { const url = URL.resolve(page.url, $(ele).attr('href')); tasks.push({ _id: md5(url), type: 'category', host, url, done: 0, name: $(ele).text() }); }); const res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {}); res && console.log(`${host}-start insert ${res.insertedCount} from ${tasks.length} tasks`); return 1; } catch (err) { console.error(`${host}-start parse ${page.url} ${err}`); return 0; } } 

    通过 cheerio 解析 html 内容,把按地区搜索和按类型搜索链接插入到数据库中。

    爬虫架构

    这里简单讲一下我的爬虫架构思路,数据库选用 mongodb。每一个待爬取的页面存一条记录 page,包含 id,url,done,type,host 等字段,id 用md5(url)生成,避免重复。每一个 type 有一个对应的 html 内容解析方法,主要的业务逻辑都集中在这些解析方法里面,上面贴出来的代码就是例子。

    爬取 html 采用 request 模块,进行了简单的封装,把 callback 封装成 promise,方便使用 async 和 await 方式调用,代码如下。

    const req = require('request'); const request = req.defaults({ headers: { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36' }, timeout: 30000, encoding: null }); const fetch = (url) => new Promise((resolve) => { console.log(`down ${url} started`); request(encodeURI(url), (err, res, body) => { if (res && res.statusCode === 200) { console.log(`down ${url} 200`); resolve(body); } else { console.error(`down ${url} ${res && res.statusCode} ${err}`); if (res && res.statusCode) { resolve(res.statusCode); } else { // ESOCKETTIMEOUT 超时错误返回 600 resolve(600); } } }); }); 

    做了简单的反反爬处理,把 user-agent 改成电脑通用的 user-agent,设置了超时时间 30 秒,其中encoding: null设置 request 直接返回 buffer,而不是解析后的内容,这样的好处是如果页面是 gbk 或者 utf-8 编码,只要解析 html 的时候指定编码就行了,如果这里指定encoding: utf-8,则当页面编码是 gbk 的时候,页面内容会乱码。

    request 默认是回调函数形式,通过 promise 封装,如果成功,则返回页面内容的 buffer,如果失败,则返回错误状态码,如果超时,则返回 600,这些懂 nodejs 的应该很好理解。

    完整的解析代码

    const URL = require('url'); const md5 = require('md5'); const cheerio = require('cheerio'); const icOnv= require('iconv-lite'); const json = (data) => { let res; try { res = JSON.parse(data); } catch (err) { console.error(err); } return res; }; const rules = [ /\/jobs\?q=.*&sort=date&start=\d+/, /\/jobs\?q=&l=.*&sort=date&start=\d+/ ]; const fns = { start: async (page) => { const host = URL.parse(page.url).hostname; const tasks = []; try { const $ = cheerio.load(iconv.decode(page.con, 'utf-8'), { decodeEntities: false }); $('#states > tbody > tr > td > a').each((i, ele) => { const url = URL.resolve(page.url, $(ele).attr('href')); tasks.push({ _id: md5(url), type: 'city', host, url, done: 0, name: $(ele).text() }); }); $('#categories > tbody > tr > td > a').each((i, ele) => { const url = URL.resolve(page.url, $(ele).attr('href')); tasks.push({ _id: md5(url), type: 'category', host, url, done: 0, name: $(ele).text() }); }); const res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {}); res && console.log(`${host}-start insert ${res.insertedCount} from ${tasks.length} tasks`); return 1; } catch (err) { console.error(`${host}-start parse ${page.url} ${err}`); return 0; } }, city: async (page) => { const host = URL.parse(page.url).hostname; const tasks = []; const cities = []; try { const $ = cheerio.load(iconv.decode(page.con, 'utf-8'), { decodeEntities: false }); $('#cities > tbody > tr > td > p.city > a').each((i, ele) => { // https://www.indeed.com/l-Charlotte,-NC-jobs.html let tmp = $(ele).attr('href').match(/l-(?<loc>.*)-jobs.html/u); if (!tmp) { tmp = $(ele).attr('href').match(/l=(?<loc>.*)/u); } const { loc } = tmp.groups; const url = `https://www.indeed.com/jobs?l=${decodeURIComponent(loc)}&sort=date`; tasks.push({ _id: md5(url), type: 'search', host, url, done: 0 }); cities.push({ _id: `${$(ele).text()}_${page.name}`, parent: page.name, name: $(ele).text(), url }); }); let res = await global.com.city.insertMany(cities, { ordered: false }).catch(() => {}); res && console.log(`${host}-city insert ${res.insertedCount} from ${cities.length} cities`); res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {}); res && console.log(`${host}-city insert ${res.insertedCount} from ${tasks.length} tasks`); return 1; } catch (err) { console.error(`${host}-city parse ${page.url} ${err}`); return 0; } }, category: async (page) => { const host = URL.parse(page.url).hostname; const tasks = []; const categories = []; try { const $ = cheerio.load(iconv.decode(page.con, 'utf-8'), { decodeEntities: false }); $('#titles > tbody > tr > td > p.job > a').each((i, ele) => { const { query } = $(ele).attr('href').match(/q-(?<query>.*)-jobs.html/u).groups; const url = `https://www.indeed.com/jobs?q=${decodeURIComponent(query)}&sort=date`; tasks.push({ _id: md5(url), type: 'search', host, url, done: 0 }); categories.push({ _id: `${$(ele).text()}_${page.name}`, parent: page.name, name: $(ele).text(), url }); }); let res = await global.com.category.insertMany(categories, { ordered: false }).catch(() => {}); res && console.log(`${host}-category insert ${res.insertedCount} from ${categories.length} categories`); res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {}); res && console.log(`${host}-category insert ${res.insertedCount} from ${tasks.length} tasks`); return 1; } catch (err) { console.error(`${host}-category parse ${page.url} ${err}`); return 0; } }, search: async (page) => { const host = URL.parse(page.url).hostname; const tasks = []; const durls = []; try { const con = iconv.decode(page.con, 'utf-8'); const $ = cheerio.load(con, { decodeEntities: false }); const list = con.match(/jobmap\[\d+\]= {.*}/g); const jobmap = []; if (list) { // eslint-disable-next-line no-eval list.map((item) => eval(item)); } for (const item of jobmap) { const cmplink = URL.resolve(page.url, item.cmplnk); const { query } = URL.parse(cmplink, true); let name; if (query.q) { // eslint-disable-next-line prefer-destructuring name = query.q.split(' #')[0].split('#')[0]; } else { const tmp = cmplink.match(/q-(?<text>.*)-jobs.html/u); if (!tmp) { // eslint-disable-next-line no-continue continue; } const { text } = tmp.groups; // eslint-disable-next-line prefer-destructuring name = text.replace(/-/g, ' ').split(' #')[0]; } const surl = `https://www.indeed.com/cmp/_cs/cmpauto?q=${name}&n=10&returnlogourls=1&returncmppageurls=1&caret=8`; const burl = `https://www.indeed.com/viewjob?jk=${item.jk}&from=vjs&vjs=1`; const durl = `https://www.indeed.com/rpc/jobdescs?jks=${item.jk}`; tasks.push({ _id: md5(surl), type: 'suggest', host, url: surl, done: 0 }); tasks.push({ _id: md5(burl), type: 'brief', host, url: burl, done: 0 }); durls.push({ _id: md5(durl), type: 'detail', host, url: durl, done: 0 }); } $('a[href]').each((i, ele) => { const tmp = URL.resolve(page.url, $(ele).attr('href')); const [url] = tmp.split('#'); const { path, hostname } = URL.parse(url); for (const rule of rules) { if (rule.test(path)) { if (hostname == host) { // tasks.push({ _id: md5(url), type: 'list', host, url: decodeURI(url), done: 0 }); } break; } } }); let res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {}); res && console.log(`${host}-search insert ${res.insertedCount} from ${tasks.length} tasks`); res = await global.com.task.insertMany(durls, { ordered: false }).catch(() => {}); res && console.log(`${host}-search insert ${res.insertedCount} from ${durls.length} tasks`); return 1; } catch (err) { console.error(`${host}-search parse ${page.url} ${err}`); return 0; } }, suggest: async (page) => { const host = URL.parse(page.url).hostname; const tasks = []; const companies = []; try { const con = page.con.toString('utf-8'); const data = json(con); for (const item of data) { const id = item.overviewUrl.replace('/cmp/', ''); const cmpurl = `https://www.indeed.com/cmp/${id}`; const joburl = `https://www.indeed.com/cmp/${id}/jobs?clearPrefilter=1`; tasks.push({ _id: md5(cmpurl), type: 'company', host, url: cmpurl, done: 0 }); tasks.push({ _id: md5(joburl), type: 'jobs', host, url: joburl, done: 0 }); companies.push({ _id: id, name: item.name, url: cmpurl }); } let res = await global.com.company.insertMany(companies, { ordered: false }).catch(() => {}); res && console.log(`${host}-suggest insert ${res.insertedCount} from ${companies.length} companies`); res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {}); res && console.log(`${host}-suggest insert ${res.insertedCount} from ${tasks.length} tasks`); return 1; } catch (err) { console.error(`${host}-suggest parse ${page.url} ${err}`); return 0; } }, // list: () => {}, jobs: async (page) => { const host = URL.parse(page.url).hostname; const tasks = []; const durls = []; try { const con = iconv.decode(page.con, 'utf-8'); const tmp = con.match(/window._initialData=(?<text>.*);<\/script><script>window._sentryData/u); let data; if (tmp) { const { text } = tmp.groups; data = json(text); if (data.jobList && data.jobList.pagination && data.jobList.pagination.paginationLinks) { for (const item of data.jobList.pagination.paginationLinks) { // eslint-disable-next-line max-depth if (item.href) { item.href = item.href.replace(/\u002F/g, '/'); const url = URL.resolve(page.url, decodeURI(item.href)); tasks.push({ _id: md5(url), type: 'jobs', host, url: decodeURI(url), done: 0 }); } } } if (data.jobList && data.jobList.jobs) { for (const job of data.jobList.jobs) { const burl = `https://www.indeed.com/viewjob?jk=${job.jobKey}&from=vjs&vjs=1`; const durl = `https://www.indeed.com/rpc/jobdescs?jks=${job.jobKey}`; tasks.push({ _id: md5(burl), type: 'brief', host, url: burl, done: 0 }); durls.push({ _id: md5(durl), type: 'detail', host, url: durl, done: 0 }); } } } else { console.log(`${host}-jobs ${page.url} has no _initialData`); } let res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {}); res && console.log(`${host}-search insert ${res.insertedCount} from ${tasks.length} tasks`); res = await global.com.task.insertMany(durls, { ordered: false }).catch(() => {}); res && console.log(`${host}-search insert ${res.insertedCount} from ${durls.length} tasks`); return 1; } catch (err) { console.error(`${host}-jobs parse ${page.url} ${err}`); return 0; } }, brief: async (page) => { const host = URL.parse(page.url).hostname; try { const con = page.con.toString('utf-8'); const data = json(con); data.dOne= 0; data.views = 0; data.host = host; // format publish date if (data.vfvm && data.vfvm.jobAgeRelative) { const str = data.vfvm.jobAgeRelative; const tmp = str.split(' '); const [first, second] = tmp; if (first == 'Just' || first == 'Today') { data.publishDate = Date.now(); } else { const num = first.replace(/\+/, ''); if (secOnd== 'hours') { const date = new Date(); const time = date.getTime(); // eslint-disable-next-line no-mixed-operators date.setTime(time - num * 60 * 60 * 1000); data.publishDate = date.getTime(); } else if (secOnd== 'days') { const date = new Date(); const time = date.getTime(); // eslint-disable-next-line no-mixed-operators date.setTime(time - num * 24 * 60 * 60 * 1000); data.publishDate = date.getTime(); } else { data.publishDate = Date.now(); } } } await global.com.job.updateOne({ _id: data.jobKey }, { $set: data }, { upsert: true }).catch(() => { }); const tasks = []; const url = `https://www.indeed.com/jobs?l=${data.jobLocationModel.jobLocation}&sort=date`; tasks.push({ _id: md5(url), type: 'search', host, url, done: 0 }); const res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {}); res && console.log(`${host}-brief insert ${res.insertedCount} from ${tasks.length} tasks`); return 1; } catch (err) { console.error(`${host}-brief parse ${page.url} ${err}`); return 0; } }, detail: async (page) => { const host = URL.parse(page.url).hostname; try { const con = page.con.toString('utf-8'); const data = json(con); const [jobKey] = Object.keys(data); await global.com.job.updateOne({ _id: jobKey }, { $set: { content: data[jobKey], done: 1 } }).catch(() => { }); return 1; } catch (err) { console.error(`${host}-detail parse ${page.url} ${err}`); return 0; } }, run: (page) => { if (page.type == 'list') { page.type = 'search'; } const fn = fns[page.type]; if (fn) { return fn(page); } console.error(`${page.url} parser not found`); return 0; } }; module.exports = fns; 

    每一个解析方法都会插入一些新的链接,新的链接记录都会有一个 type 字段,通过 type 字段,可以知道新的链接的解析方法,这样就能完整解析所有的页面了。例如 start 方法会插入 type 为 city 和 category 的记录,type 为 city 的页面记录的解析方法就是city方法,city 方法里面又会插入 type 为 search 的链接,这样一直循环,直到最后的 brief 和 detail 方法分别获取职位数据的简介和详细内容。

    其实爬虫最关键的就是这些 html 解析方法,有了这些方法,你就能获取任何想要的结构化内容了。

    数据索引

    这部分就很简单了,有了前面获取的结构化数据,按照 elasticsearch,新建一个 schema,然后写个程序定时把职位数据添加到 es 的索引里面就行了。因为职位详情的内容有点多,我就没有把 content 字段添加到索引里面了,因为太占内存了,服务器内存不够用了,>_<。

    DEMO

    最后还是贴上网址供大家检阅,job search engine

    12 条回复    2020-03-21 11:26:01 +08:00
    justseemore
        1
    justseemore  
       2020-03-20 15:29:32 +08:00
    @警察叔叔, 就是他
    zdd2389
        2
    zdd2389  
       2020-03-20 15:31:10 +08:00
    为啥你们都这么正直,我学爬虫第一件事就是去爬动作片
    syl371
        3
    syl371  
    OP
       2020-03-20 15:33:39 +08:00
    @zdd2389 动作片我也有,哈哈
    crella
        4
    crella  
       2020-03-20 16:52:06 +08:00 via Android
    感叹一下 js 的代码实在绕圈圈太多了,我这个菜鸟真看不懂

    提几个建议,爬招聘网站的经验而已

    1、清除或压缩过时的招聘信息,比如十天前的

    2、某些招聘网站会每天“更新”之前发布过的招聘信息,怎样把它们与真的是今天刚发布的招聘信息分辨开来?我是建了两个表,过时的信息仅记录 job_id,主表的招聘信息记录“发布日期”和“更新日期”

    3、同一日同一公司发布的招聘信息最好靠在一起显示

    4、建立公司黑名单,不收录位于黑名单内的公司的招聘信息
    crella
        5
    crella  
       2020-03-20 16:53:34 +08:00 via Android
    @crella 上面建议的第二点需要每天爬取,要不然神仙也分不出某条招聘信息是不是真的今天才发布的,狗头
    syl371
        6
    syl371  
    OP
       2020-03-20 20:00:20 +08:00
    @crella 谢谢你的建议,不贴一下你的网址吗,瞻仰一下,哈哈
    crella
        7
    crella  
       2020-03-20 22:08:57 +08:00 via Android
    @syl371 瞻仰这个词过奖了,我是自己找工作的时候被招聘网站和公司的一些骚操作恶心到了才想到这些。
    locoz
        8
    locoz  
       2020-03-20 23:28:27 +08:00
    挺有意思的小项目,全套做下来水平已经超过 90%以上的人了 hh。有兴趣往 bbs.nightteam.cn 也发一份吗?是个很垂直的爬虫社区。
    solonF
        9
    solonF  
       2020-03-21 00:39:36 +08:00
    ( indeed 本来就是个爬虫)
    stephCurry
        10
    stephCurry  
       2020-03-21 10:01:37 +08:00 via iPhone
    @zdd2389 我正好反过来了。我是为了啪骗子而研究学习爬虫
    syl371
        11
    syl371  
    OP
       2020-03-21 11:25:34 +08:00
    @locoz 已经发了
    syl371
        12
    syl371  
    OP
       2020-03-21 11:26:01 +08:00
    @solonF 对啊,indeed 做了很多爬虫的工作
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     5823 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 349ms UTC 06:31 PVG 14:31 LAX 23:31 JFK 02:31
    Do have faith in what you're doing.
    ubao snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86