初学 Markdown,菜鸟写了第一篇博客,整理了第一个 Python 项目 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
请不要在回答技术问题时复制粘贴 AI 生成的内容
p2pCoder
V2EX    程序员

初学 Markdown,菜鸟写了第一篇博客,整理了第一个 Python 项目

  •  1
     
  •   p2pCoder
    zgbgx 2018 年 1 月 24 日 3862 次点击
    这是一个创建于 2912 天前的主题,其中的信息可能已经有所发展或是发生改变。

    关于项目

    本项目写于 2017 年七月初,主要使用 Python 爬取网贷之家以及人人贷的数据进行分析。
    网贷之家是国内最大的 P2P 数据平台,人人贷国内排名前二十的 P2P 平台。
    源码地址

    数据爬取

    抓包分析

    抓包工具主要使用 chrome 的开发者工具 网络一栏,网贷之家的数据全部是 ajax 返回 json 数据,而人人贷既有 ajax 返回数据也有 html 页面直接生成数据。

    请求实例

    QQ 截图 20180123205633.png 从数据中可以看到请求数据的方式( GET 或者 POST ),请求头以及请求参数。 QQ 截图 20180123205843.png 从请求数据中可以看到返回数据的格式(此例中为 json )、数据结构以及具体数据。 注:这是现在网贷之家的 API 请求后台的接口,爬虫编写的时候与数据接口与如今的请求接口不一样,所以网贷之家的数据爬虫部分已无效。

    构造请求

    根据抓包分析得到的结果,构造请求。在本项目中,使用 Python 的 requests 库模拟 http 请求 具体代码:

    import requests class SessionUtil(): def __init__(self,headers=None,cookie=None): self.session=requests.Session() if headers is None: headersStr={"Accept":"application/json, text/Javascript, */*; q=0.01", "X-Requested-With":"XMLHttpRequest", "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36", "Accept-Encoding":"gzip, deflate, sdch, br", "Accept-Language":"zh-CN,zh;q=0.8" } self.headers=headersStr else: self.headers=headers self.cookie=cookie //发送 get 请求 def getReq(self,url): return self.session.get(url,headers=self.headers).text def addCookie(self,cookie): self.headers['cookie']=cookie //发送 post 请求 def postReq(self,url,param): return self.session.post(url, param).text 

    在设置请求头的时候,关键字段只设置了"User-Agent",网贷之家和人人贷的没有反爬措施,甚至不用设置"Referer"字段来防止跨域错误。

    爬虫实例

    以下是一个爬虫实例

    import json import time from databaseUtil import DatabaseUtil from sessionUtil import SessionUtil from dictUtil import DictUtil from logUtil import LogUtil import traceback def handleData(returnStr): jsOnData=json.loads(returnStr) platData=jsonData.get('data').get('platOuterVo') return platData def storeData(jsonOne,conn,cur,platId): actualCapital=jsonOne.get('actualCapital') aliasName=jsonOne.get('aliasName') association=jsonOne.get('association') associatiOnDetail=jsonOne.get('associationDetail') autoBid=jsonOne.get('autoBid') autoBidCode=jsonOne.get('autoBidCode') bankCapital=jsonOne.get('bankCapital') bankFunds=jsonOne.get('bankFunds') bidSecurity=jsonOne.get('bidSecurity') bindingFlag=jsonOne.get('bindingFlag') businessType=jsonOne.get('businessType') companyName=jsonOne.get('companyName') credit=jsonOne.get('credit') creditLevel=jsonOne.get('creditLevel') delayScore=jsonOne.get('delayScore') delayScoreDetail=jsonOne.get('delayScoreDetail') displayFlg=jsonOne.get('displayFlg') drawScore=jsonOne.get('drawScore') drawScoreDetail=jsonOne.get('drawScoreDetail') equityVoList=jsonOne.get('equityVoList') experienceScore=jsonOne.get('experienceScore') experienceScoreDetail=jsonOne.get('experienceScoreDetail') fundCapital=jsonOne.get('fundCapital') gjlhhFlag=jsonOne.get('gjlhhFlag') gjlhhTime=jsonOne.get('gjlhhTime') gruarantee=jsonOne.get('gruarantee') inspection=jsonOne.get('inspection') juridicalPerson=jsonOne.get('juridicalPerson') locatiOnArea=jsonOne.get('locationArea') locatiOnAreaName=jsonOne.get('locationAreaName') locatiOnCity=jsonOne.get('locationCity') locatiOnCityName=jsonOne.get('locationCityName') manageExpense=jsonOne.get('manageExpense') manageExpenseDetail=jsonOne.get('manageExpenseDetail') newTrustCreditor=jsonOne.get('newTrustCreditor') newTrustCreditorCode=jsonOne.get('newTrustCreditorCode') officeAddress=jsonOne.get('officeAddress') OnlineDate=jsonOne.get('onlineDate') payment=jsonOne.get('payment') paymode=jsonOne.get('paymode') platBackground=jsonOne.get('platBackground') platBackgroundDetail=jsonOne.get('platBackgroundDetail') platBackgroundDetailExpand=jsonOne.get('platBackgroundDetailExpand') platBackgroundExpand=jsonOne.get('platBackgroundExpand') platEarnings=jsonOne.get('platEarnings') platEarningsCode=jsonOne.get('platEarningsCode') platName=jsonOne.get('platName') platStatus=jsonOne.get('platStatus') platUrl=jsonOne.get('platUrl') problem=jsonOne.get('problem') problemTime=jsonOne.get('problemTime') recordId=jsonOne.get('recordId') recordLicId=jsonOne.get('recordLicId') registeredCapital=jsonOne.get('registeredCapital') riskCapital=jsonOne.get('riskCapital') riskFunds=jsonOne.get('riskFunds') riskReserve=jsonOne.get('riskReserve') riskcOntrol=jsonOne.get('riskcontrol') securityModel=jsonOne.get('securityModel') securityModelCode=jsonOne.get('securityModelCode') securityModelOther=jsonOne.get('securityModelOther') serviceScore=jsonOne.get('serviceScore') serviceScoreDetail=jsonOne.get('serviceScoreDetail') startInvestmentAmout=jsonOne.get('startInvestmentAmout') term=jsonOne.get('term') termCodes=jsonOne.get('termCodes') termWeight=jsonOne.get('termWeight') transferExpense=jsonOne.get('transferExpense') transferExpenseDetail=jsonOne.get('transferExpenseDetail') trustCapital=jsonOne.get('trustCapital') trustCreditor=jsonOne.get('trustCreditor') trustCreditorMOnth=jsonOne.get('trustCreditorMonth') trustFunds=jsonOne.get('trustFunds') tzjPj=jsonOne.get('tzjPj') vipExpense=jsonOne.get('vipExpense') withTzj=jsonOne.get('withTzj') withdrawExpense=jsonOne.get('withdrawExpense') sql='insert into problemPlatDetail (actualCapital,aliasName,association,associationDetail,autoBid,autoBidCode,bankCapital,bankFunds,bidSecurity,bindingFlag,businessType,companyName,credit,creditLevel,delayScore,delayScoreDetail,displayFlg,drawScore,drawScoreDetail,equityVoList,experienceScore,experienceScoreDetail,fundCapital,gjlhhFlag,gjlhhTime,gruarantee,inspection,juridicalPerson,locationArea,locationAreaName,locationCity,locationCityName,manageExpense,manageExpenseDetail,newTrustCreditor,newTrustCreditorCode,officeAddress,onlineDate,payment,paymode,platBackground,platBackgroundDetail,platBackgroundDetailExpand,platBackgroundExpand,platEarnings,platEarningsCode,platName,platStatus,platUrl,problem,problemTime,recordId,recordLicId,registeredCapital,riskCapital,riskFunds,riskReserve,riskcontrol,securityModel,securityModelCode,securityModelOther,serviceScore,serviceScoreDetail,startInvestmentAmout,term,termCodes,termWeight,transferExpense,transferExpenseDetail,trustCapital,trustCreditor,trustCreditorMonth,trustFunds,tzjPj,vipExpense,withTzj,withdrawExpense,platId) values ("'+actualCapital+'","'+aliasName+'","'+association+'","'+associationDetail+'","'+autoBid+'","'+autoBidCode+'","'+bankCapital+'","'+bankFunds+'","'+bidSecurity+'","'+bindingFlag+'","'+businessType+'","'+companyName+'","'+credit+'","'+creditLevel+'","'+delayScore+'","'+delayScoreDetail+'","'+displayFlg+'","'+drawScore+'","'+drawScoeDetail+'","'+equityVoList+'","'+experienceScore+'","'+experienceScoreDetail+'","'+fundCapital+'","'+gjlhhFlag+'","'+gjlhhTime+'","'+gruarantee+'","'+inspection+'","'+juridicalPerson+'","'+locationArea+'","'+locationAreaName+'","'+locationCity+'","'+locationCityName+'","'+manageExpense+'","'+manageExpenseDetail+'","'+newTrustCreditor+'","'+newTrustCreditorCode+'","'+officeAddress+'","'+onlineDate+'","'+payment+'","'+paymode+'","'+platBackground+'","'+platBackgroundDetail+'","'+platBackgroundDetailExpand+'","'+platBackgroundExpand+'","'+platEarnings+'","'+platEarningsCode+'","'+platName+'","'+platStatus+'","'+platUrl+'","'+problem+'","'+problemTime+'","'+recordId+'","'+recordLicId+'","'+registeredCapital+'","'+riskCapital+'","'+riskFunds+'","'+riskReserve+'","'+riskcontrol+'","'+securityModel+'","'+securityModelCode+'","'+securityModelOther+'","'+serviceScore+'","'+serviceScoreDetail+'","'+startInvestmentAmout+'","'+term+'","'+termCodes+'","'+termWeight+'","'+transferExpense+'","'+transferExpenseDetail+'","'+trustCapital+'","'+trustCreditor+'","'+trustCreditorMonth+'","'+trustFunds+'","'+tzjPj+'","'+vipExpense+'","'+withTzj+'","'+withdrawExpense+'","'+platId+'")' cur.execute(sql) conn.commit() conn,cur=DatabaseUtil().getConn() session=SessionUtil() logUtil=LogUtil("problemPlatDetail.log") cur.execute('select platId from problemPlat') data=cur.fetchall() print(data) mylist=list() print(data) for i in range(0,len(data)): platId=str(data[i].get('platId')) mylist.append(platId) print mylist for i in mylist: url='http://wwwservice.wdzj.com/api/plat/platData30Days?platId='+i try: data=session.getReq(url) platData=handleData(data) dictObject=DictUtil(platData) storeData(dictObject,conn,cur,i) except Exception,e: traceback.print_exc() cur.close() conn.close 

    整个过程中 我们 构造请求,然后把解析每个请求的响应,其中 json 返回值使用 json 库进行解析,html 页面使用 BeautifulSoup 库进行解析(结构复杂的 html 的页面推荐使用 lxml 库进行解析),解析到的结果存储到 mysql 数据库中。

    爬虫代码

    爬虫代码地址(注:爬虫使用代码 Python2 与 python3 都可运行,本人把爬虫代码部署在阿里云服务器上,使用 Python2 运行)

    数据分析

    数据分析主要使用 Python 的 numpy、pandas、matplotlib 进行数据分析,同时辅以海致 BDP。

    时间序列分析

    数据读取

    一般采取把数据读取 pandas 的 DataFrame 中进行分析。 以下就是读取问题平台的数据的例子

    problemPlat=pd.read_csv('problemPlat.csv',parse_dates=True)#问题平台 

    数据结构 QQ 截图 20180123212641.png

    时间序列分析

    eg 问题平台数量随时间变化

    problemPlat['id']['2012':'2017'].resample('M',how='count').plot(title='P2P 发生问题')#发生问题 P2P 平台数量 随时间变化趋势 

    图形化展示 QQ 截图 20180123212803.png

    地域分析

    使用海致 BDP 完成( Python 绘制地图分布轮子比较复杂,当时还未学习)

    各省问题平台数量

    下载.png

    各省平台成交额

    全年成交额全国各省对比.png

    规模分布分析

    eg 全国六月平台成交额分布 代码

    juneData['amount'].hist(normed=True) juneData['amount'].plot(kind='kde',style='k--')#六月份交易量概率分布 

    核密度图形展示 QQ 截图 20180123213700.png 成交额取对数核密度分布

    np.log10(juneData['amount']).hist(normed=True) np.log10(juneData['amount']).plot(kind='kde',style='k--')#取 10 对数的 概率分布 

    图形化展示 QQ 截图 20180123213901.png 可看出取 10 的对数后分布更符合正常的金字塔形。

    相关性分析

    eg.陆金所交易额与所有平台交易额的相关系数变化趋势

    lujinData=platVolume[platVolume['wdzjPlatId']==59] corr=pd.rolling_corr(lujinData['amount'],allPlatDayData['amount'],50,min_periods=50).plot(title='陆金所交易额与所有平台交易额的相关系数变化趋势') 

    图形化展示 QQ 截图 20180123214114.png

    分类比较

    车贷平台与全平台成交额数据对比

    carFinanceDayData=carFinanceData.resample('D').sum()['amount'] fig,axes=plt.subplots(nrows=1,ncols=2,sharey=True,figsize=(14,7)) carFinanceDayData.plot(ax=axes[0],title='车贷平台交易额') allPlatDayData['amount'].plot(ax=axes[1],title='所有 p2p 平台交易额') 

    QQ 截图 20180123214359.png

    趋势预测

    eg 预测陆金所成交量趋势(使用 Facebook Prophet 库完成)

    lujinAmount=platVolume[platVolume['wdzjPlatId']==59] lujinAmount['y']=lujinAmount['amount'] lujinAmount['ds']=lujinAmount['date'] m=Prophet(yearly_seasOnality=True) m.fit(lujinAmount) future=m.make_future_dataframe(periods=365) forecast=m.predict(future) m.plot(forecast) 

    趋势预测图形化展示 QQ 截图 20180123214653.png

    数据分析代码

    数据分析代码地址(注:数据分析代码智能运行在 Python3 环境下) 代码运行后样例(无需安装 Python 环境 也可查看具体代码解图形化展示)

    后记

    这是本人从 Java web 转向数据方向后自己写的第一项目,也是自己的第一个 Python 项目,在整个过程中,也没遇到多少坑,整体来说,爬虫和数据分析以及 Python 这门语言门槛都是非常低的。
    如果想入门 Python 爬虫,推荐《 Python 网络数据采集》
    s29086659.jpg
    如果想入门 Python 数据分析,推荐 《利用 Python 进行数据分析》
    30adcbef76094b360e72e763a9cc7cd98c109d58.jpg

    1 条回复    2018-01-25 11:04:05 +08:00
    superlead
        1
    superlead  
       2018 年 1 月 25 日
    不错 很好~
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     4129 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 47ms UTC 05:22 PVG 13:22 LAX 21:22 JFK 00:22
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86