- 論壇徽章:
- 0
|
Normal
0
7.8 磅
0
2
false
false
false
EN-US
ZH-CN
X-NONE
MicrosoftInternetExplorer4
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:普通表格;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-qformat:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.5pt;
mso-bidi-font-size:11.0pt;
font-family:"Calibri","sans-serif";
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:宋體;
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:minor-bidi;
mso-font-kerning:1.0pt;}
利用python檢查搜索引擎的準(zhǔn)確率
File information
2009-11-10
磁針石:xurongzhong#gmail.com
博客:
oychw.cublog.cn
騰訊搜搜的主頁(yè)為:
http://www.soso.com/
,比如輸入“武岡”,則會(huì)返回包含如下信息的網(wǎng)頁(yè):a href="http://www.wugang.gov.cn/" id="res0" 這表示
www.wugang.gov.cn
為搜索結(jié)果的第一條記錄(res0,記錄從0開(kāi)始計(jì)數(shù))。這樣就方便使用正則表達(dá)式來(lái)抓取。 把要搜索的關(guān)鍵字和網(wǎng)址存入c:\word.txt,樣式如下:武岡 www.wugang.gov.cn武岡
www.wugangren.com
輸出結(jié)果存放于c:\out.csv。代碼如下: import urllib2import re f = open("c:\out.csv",'w')for line in open("c:\word.txt"): word,address= line.split() print "\n--------" + word,address, url = "http://www.soso.com/q?pid=s.idx&w=" + word response = urllib2.urlopen(url) html = response.read() if address in html: text = address+'.*?res([0-9]*)' m = re.search(text, html, re.IGNORECASE) result = m.group(1) print "-----------ok", else: result = "Not found!" print "-----------!!!!!!!!----- fail", f.write(word+","+address+","+result+"\n")f.close() 如果數(shù)據(jù)量比較大的話,需要采用多線程或者進(jìn)程。不過(guò)實(shí)際執(zhí)行中,騰訊對(duì)單個(gè)IP不允許過(guò)多的搜索量,還需要研究IP偽造。 相關(guān)文件:
![]()
文件:新建文件夾.rar
大小:4KB
下載:
下載
本文來(lái)自ChinaUnix博客,如果查看原文請(qǐng)點(diǎn):http://blog.chinaunix.net/u/21908/showart_2090370.html |
|