亚洲av成人无遮挡网站在线观看,少妇性bbb搡bbb爽爽爽,亚洲av日韩精品久久久久久,兔费看少妇性l交大片免费,无码少妇一区二区三区
Chinaunix
標題:
LWP無法正常decode網(wǎng)頁的gb2312編碼?
[打印本頁]
作者:
sjdy521
時間:
2013-06-11 21:08
標題:
LWP無法正常decode網(wǎng)頁的gb2312編碼?
事情是這樣的:
使用LWP請求一個gb2312編碼的網(wǎng)頁,比如
http://ip138.com/ips138.asp?ip=8.8.8.8&action=2
$response = LWP::UserAgent->new->get("http://ip138.com/ips138.asp?ip=8.8.8.8&action=2") ;
$content = $response->decode_content() if $response->is_success;
復制代碼
按理說應(yīng)該是得到按照gb2312解碼后的unicode字符串,實際上卻并不是,似乎是按照ISO-8859-1來處理了
如果這個decode_content真的無法正常識別gb2312的話,那模塊本身就有問題了,但LWP是如此常用的模塊,如果有問題早就應(yīng)該暴露出來了,所以不太敢確定是不是自己搞錯了。貼出來希望大家也幫忙看下。
跟蹤源代碼分析,得到一些結(jié)論,貼一些關(guān)鍵性的代碼出來。
首先decode_content函數(shù)
if ($self->content_is_text || (my $is_xml = $self->content_is_xml)) {
my $charset = lc(
$opt{charset} ||
$self->content_type_charset ||
$opt{default_charset} ||
$self->content_charset ||
"ISO-8859-1"
);
復制代碼
由于網(wǎng)頁并沒有在http頭部添加charset的param 而是在html文檔里設(shè)置了<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
因此HTTP::Message模塊也支持從meta標簽里識別這個編碼。上面的if流程執(zhí)行下來
$opt{charset} 未設(shè)定,$self->content_type_charset沒有得到編碼,$opt{default_charset}未設(shè)定。$self->content_charset應(yīng)該得到gb2312才對,但實際上用
$response->content_charset還是得到了undef,跟蹤content_charset()函數(shù),一些關(guān)鍵代碼:
elsif ($self->content_is_html) {
# look for <META charset="..."> or <META content="...">
# http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding
require IO::HTML;
# Use relaxed search to match previous versions of HTTP::Message:
my $encoding = IO::HTML::find_charset_in($cref, { encoding => 1,
need_pragma => 0 });
return $encoding->mime_name if $encoding;
}
復制代碼
發(fā)現(xiàn)my $encoding = IO::HTML::find_charset_in已經(jīng)正確的返回了encode對象,但$encoding->mime_name返回確實空。$encoding->name卻是正確的“euc-cn”
繼續(xù)跟蹤mime_name,實際上調(diào)用的是return Encode::MIME::Name::get_mime_name,關(guān)鍵代碼:
our %MIME_NAME_OF = (
'AdobeStandardEncoding' => 'Adobe-Standard-Encoding',
'AdobeSymbol' => 'Adobe-Symbol-Encoding',
'ascii' => 'US-ASCII',
'big5-hkscs' => 'Big5-HKSCS',
'cp1026' => 'IBM1026',
'cp1047' => 'IBM1047',
'cp1250' => 'windows-1250',
'cp1251' => 'windows-1251',
'cp1252' => 'windows-1252',
'cp1253' => 'windows-1253',
'cp1254' => 'windows-1254',
'cp1255' => 'windows-1255',
'cp1256' => 'windows-1256',
'cp1257' => 'windows-1257',
'cp1258' => 'windows-1258',
'cp37' => 'IBM037',
'cp424' => 'IBM424',
'cp437' => 'IBM437',
'cp500' => 'IBM500',
'cp775' => 'IBM775',
'cp850' => 'IBM850',
'cp852' => 'IBM852',
'cp855' => 'IBM855',
'cp857' => 'IBM857',
'cp860' => 'IBM860',
'cp861' => 'IBM861',
'cp862' => 'IBM862',
'cp863' => 'IBM863',
'cp864' => 'IBM864',
'cp865' => 'IBM865',
'cp866' => 'IBM866',
'cp869' => 'IBM869',
'cp936' => 'GBK',
'euc-jp' => 'EUC-JP',
'euc-kr' => 'EUC-KR',
#'gb2312-raw' => 'GB2312', # no, you're wrong, I18N::Charset
'hp-roman8' => 'hp-roman8',
'hz' => 'HZ-GB-2312',
'iso-2022-jp' => 'ISO-2022-JP',
'iso-2022-jp-1' => 'ISO-2022-JP',
'iso-2022-kr' => 'ISO-2022-KR',
'iso-8859-1' => 'ISO-8859-1',
'iso-8859-10' => 'ISO-8859-10',
'iso-8859-13' => 'ISO-8859-13',
'iso-8859-14' => 'ISO-8859-14',
'iso-8859-15' => 'ISO-8859-15',
'iso-8859-16' => 'ISO-8859-16',
'iso-8859-2' => 'ISO-8859-2',
'iso-8859-3' => 'ISO-8859-3',
'iso-8859-4' => 'ISO-8859-4',
'iso-8859-5' => 'ISO-8859-5',
'iso-8859-6' => 'ISO-8859-6',
'iso-8859-7' => 'ISO-8859-7',
'iso-8859-8' => 'ISO-8859-8',
'iso-8859-9' => 'ISO-8859-9',
#'jis0201-raw' => 'JIS_X0201',
#'jis0208-raw' => 'JIS_C6226-1983',
#'jis0212-raw' => 'JIS_X0212-1990',
'koi8-r' => 'KOI8-R',
'koi8-u' => 'KOI8-U',
#'ksc5601-raw' => 'KS_C_5601-1987',
'shiftjis' => 'Shift_JIS',
'UTF-16' => 'UTF-16',
'UTF-16BE' => 'UTF-16BE',
'UTF-16LE' => 'UTF-16LE',
'UTF-32' => 'UTF-32',
'UTF-32BE' => 'UTF-32BE',
'UTF-32LE' => 'UTF-32LE',
'UTF-7' => 'UTF-7',
'utf8' => 'UTF-8',
'utf-8-strict' => 'UTF-8',
'viscii' => 'VISCII',
);
sub get_mime_name($) { $MIME_NAME_OF{$_[0]} };
復制代碼
可以看到 Encode::MIME::Name里根本沒有euc-cn對應(yīng)的mime_name,查了下IANA官方的charset文檔:
http://www.iana.org/assignments/character-sets/character-sets.xml
里面也確實沒有。
難道說真的是HTTP::Message里的源代碼寫的有問題么?。。。
作者:
cronas
時間:
2013-06-12 10:53
my %opts = (charset_strict => 1,
default_charset => 'cp936',
);
$content = $response->decoded_content( %opts );
自己設(shè)定行不?
作者:
sjdy521
時間:
2013-06-12 14:32
回復
2#
cronas
自己設(shè)定當然可以,但這不是問題的關(guān)鍵
歡迎光臨 Chinaunix (http://www.72891.cn/)
Powered by Discuz! X3.2