http://www.hkexnews.hk/APP/SEHKAppMainIndex_c.htm
这个网站我用 request 请求没有问题 但是 scrapy 就出现下面字体
愠敦卅偌桴牯敦琢牯步确敷献桫敷┲が灬楣渥春偈整畲楮┲あ损污潮汩晴潮潮慢钟胥薷诧骄慢搠渠印ㄢ污灭楮束扬摩污示楮汩摩瘾呒印㈢污灭楮朱㈢牯隙蘼潮呒印㈢扬灸∠慢祟扯敲慢慢∠社搠栽∵挽慧支敲楦∠栽∵杨∵愠敦卅奥瑭扳扳桴愠敦卅奥瑭扳扳桴愠敦卅奥瑭扳扳桴搠潮獯汩∠来扬欢锼扯摹慢印㈢搽∠浩戢摩瘠≤灬楮搠渠敲扬
1 leefsfd 2018-12-28 14:16:42 +08:00 编码问题. |
2 liaowy 2018-12-28 14:55:00 +08:00 resp.text.encode('latin-1').decode('utf-8') |
3 carlclone 2018-12-28 14:59:32 +08:00 这都不知道还玩爬虫呢.... |
4 lihongjie0209 2018-12-28 15:07:32 +08:00 In [10]: response.body.decode(encoding="utf-8") Out[10]: '\ufeff<html lang="en" class="news-hkex">\r\n <head>\r\n <META http-equiv="Content-Type" cOntent="text/html; charset=utf-16">\r\n <meta name="MS.LOCALE" cOntent="ZH-TW">\r\n <title>:: HKEX :: HKEXnews ::</title>\r\n <meta http-equiv="Pragma" cOntent="no-cache">\r\n <meta http-equiv="Cache-Control" cOntent="no-cache">\r\n <meta charset="UTF-8">\r\n <meta name="viewport" cOntent="width=device-width, initial-scale=1">\r\n <meta http-equiv="X-UA-Compatible" cOntent="IE=edge">\r\n <link href="/ncms/css/main.css" rel="stylesheet"><script language="Javascript" src="http://www.v2ex.com/script/hkex_common.js"></script><script language="Javascript" src="http://www.v2ex.com/script/hkex_setting.js"></script><script type="text/Javascript" src="http://www.v2ex.com/ncms/script/hkex_app.js"></script><script type="text/Javascript" src="http://www.v2ex.com/ncms/script/hkex_settings.js"></script><script type="text/Javascript" src="http://www.v2ex.com/ncms/script/hkex_widget.js"></script><script type="text/Javascript" src="http://www.v2ex.com/ncms/script/vendor.js"></script><script type="text/Javascript">\n \t var pageDefaultTitle = "申版本,聆後 料集及相料";\n \t var pageDefaultSubTitle = "";\n \t var pageDefaultBanner = "/ncms/media/HKEXnews/top_banner_bg.png";\n \t var pageDefaultTabletBanner = "/ncms/media/HKEXnews/top_banner_bg_tablet.png";\n \t var overrideBreadcrumb = [{\n \t \t\ttitle: "申版本,聆後料集及相料",\n \t \t\turl: "http://www2.hkexnews.hk/New-Listings/Application-Proof-and-PHIP?sc_lang=zh-HK"\n \t },{\n \t \t\ttitle: "新上市",\n \t \t\turl: window.location.href\n \t \t\t}]\n \t var overridePageTools = {};\n \t overridePageTools.showlastupdate = 0;\n \t overridePageTools.showprint = 0;\n\t\t </script><link rel="stylesheet" href="/css/hkex_css.css" type="text/css"><script type="text/Javascript">\n\t\t\t\t\t\tfunction PrintFriendlyUTF() |
5 lanqing 2018-12-28 16:04:24 +08:00 从表面上看,scrapy 请求返回的数据被 decode(使用 latin-1|ISO-8859-1),所以 encode('latin-1').decode('utf-8')就行 #2 #4 都行 |
7 Ewig OP @lanqing 你用 scrapy 试过吗?我试过打印的是空 print(response.text.encode('latin-1').decode('utf-8')) 我这样写的 |
9 Ewig OP @lihongjie0209 def parse(self, response): print(response.body.decode(encoding="utf-8")) linkList =response.body.decode(encoding="utf-8").xpath( '//td[@class="pming_black12 ms-rteTableOddCol-BlueTable_CHI"]/a/@href') nameList = response.body.decode(encoding="utf-8").xpath( '//td[@class="pming_black12 ms-rteTableOddCol-BlueTable_CHI"]/a/text()') 我在 scrapy 里面这样写好像不行,说这是字符串没有 xpath,如何写才是正确的 |
10 lihongjie0209 2018-12-29 11:00:21 +08:00 @Ewig 你还是多搞搞基础吧, 没办法一步一步教你 |
11 atencheung 2018-12-29 14:05:24 +08:00 编码的问题,我一般是这么解决的 import requests respOnse= requests.get('http://xxxxxx') response.encoding = "gbk2312" # 这里就是对方网站的编码格式 print(response.text) 这样一般就该返回的正常的,不会乱码了 |