HttpWebResponse 的编码问题
本文关键字:问题 编码 HttpWebResponse | 更新日期: 2023-09-27 17:47:22
这是一段代码:
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(request.RawUrl);
WebRequest.DefaultWebProxy = null;//Ensure that we will not loop by going again in the proxy
HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse();
string charSet = response.CharacterSet;
Encoding encoding;
if (String.IsNullOrEmpty(charSet))
encoding = Encoding.Default;
else
encoding = Encoding.GetEncoding(charSet);
StreamReader resStream = new StreamReader(response.GetResponseStream(), encoding);
return resStream.ReadToEnd();
问题是如果我使用:http://www.google.fr 进行测试
所有"é"都显示不好。我尝试将 ASCII 更改为 UTF8,但它仍然显示错误。我已经在浏览器中测试了 html 文件,浏览器很好地显示了 html 文本,所以我很确定问题出在我用来下载 html 文件的方法上。
我应该更改什么?
删除了失效的图像小屋链接
更新 1:代码和测试文件已更改
默认情况下,如果字符集未在服务器的内容类型标头中指定(不同于 HTML 中的"字符集"元标记),则默认情况下为"ISO-8859-1"。我将 HttpWebResponse.CharacterSet 与 HTML 的字符集属性进行比较。如果它们不同 - 我使用 HTML 中指定的字符集再次重新读取页面,但这次使用正确的编码。
请参阅代码:
string strWebPage = "";
// create request
System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(sURL);
// get response
System.Net.HttpWebResponse objResponse;
objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse();
// get correct charset and encoding from the server's header
string Charset = objResponse.CharacterSet;
Encoding encoding = Encoding.GetEncoding(Charset);
// read response
using (StreamReader sr =
new StreamReader(objResponse.GetResponseStream(), encoding))
{
strWebPage = sr.ReadToEnd();
// Close and clean up the StreamReader
sr.Close();
}
// Check real charset meta-tag in HTML
int CharsetStart = strWebPage.IndexOf("charset=");
if (CharsetStart > 0)
{
CharsetStart += 8;
int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', ''"', ';' }, CharsetStart);
string RealCharset =
strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart);
// real charset meta-tag in HTML differs from supplied server header???
if(RealCharset!=Charset)
{
// get correct encoding
Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset);
// read the web page again, but with correct encoding this time
// create request
System.Net.WebRequest objRequest2 = System.Net.HttpWebRequest.Create(sURL);
// get response
System.Net.HttpWebResponse objResponse2;
objResponse2 = (System.Net.HttpWebResponse)objRequest2.GetResponse();
// read response
using (StreamReader sr =
new StreamReader(objResponse2.GetResponseStream(), CorrectEncoding))
{
strWebPage = sr.ReadToEnd();
// Close and clean up the StreamReader
sr.Close();
}
}
}
首先,编写代码的更简单方法是使用 StreamReader 和 ReadToEnd:
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(myURL);
using (HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse())
{
using (Stream resStream = response.GetResponseStream())
{
StreamReader reader = new StreamReader(resStream, Encoding.???);
return reader.ReadToEnd();
}
}
然后,它"只是"找到正确编码的问题。您是如何创建文件的?如果它带有记事本,那么您可能需要Encoding.Default
- 但这显然不是便携式的,因为它是您 PC 的默认编码。
在运行良好的 Web 服务器中,响应将在其标头中指示编码。话虽如此,在某些情况下,响应标头有时会声明一件事,而 HTML 声明另一件事。
如果您不想下载页面两次,我使用 如何将WebResponse 放入内存流?稍微修改了 Alex 的代码。这是结果
public static string DownloadString(string address)
{
string strWebPage = "";
// create request
System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(address);
// get response
System.Net.HttpWebResponse objResponse;
objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse();
// get correct charset and encoding from the server's header
string Charset = objResponse.CharacterSet;
Encoding encoding = Encoding.GetEncoding(Charset);
// read response into memory stream
MemoryStream memoryStream;
using (Stream responseStream = objResponse.GetResponseStream())
{
memoryStream = new MemoryStream();
byte[] buffer = new byte[1024];
int byteCount;
do
{
byteCount = responseStream.Read(buffer, 0, buffer.Length);
memoryStream.Write(buffer, 0, byteCount);
} while (byteCount > 0);
}
// set stream position to beginning
memoryStream.Seek(0, SeekOrigin.Begin);
StreamReader sr = new StreamReader(memoryStream, encoding);
strWebPage = sr.ReadToEnd();
// Check real charset meta-tag in HTML
int CharsetStart = strWebPage.IndexOf("charset=");
if (CharsetStart > 0)
{
CharsetStart += 8;
int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', ''"', ';' }, CharsetStart);
string RealCharset =
strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart);
// real charset meta-tag in HTML differs from supplied server header???
if (RealCharset != Charset)
{
// get correct encoding
Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset);
// reset stream position to beginning
memoryStream.Seek(0, SeekOrigin.Begin);
// reread response stream with the correct encoding
StreamReader sr2 = new StreamReader(memoryStream, CorrectEncoding);
strWebPage = sr2.ReadToEnd();
// Close and clean up the StreamReader
sr2.Close();
}
}
// dispose the first stream reader object
sr.Close();
return strWebPage;
}
这里有一些很好的解决方案,但它们似乎都在尝试从内容类型字符串中解析字符集。下面是一个使用 System.Net.Mime.ContentType 的解决方案,它应该更可靠、更短。
var client = new System.Net.WebClient();
var data = client.DownloadData(url);
var encoding = System.Text.Encoding.Default;
var contentType = new System.Net.Mime.ContentType(client.ResponseHeaders[HttpResponseHeader.ContentType]);
if (!String.IsNullOrEmpty(contentType.CharSet))
{
encoding = System.Text.Encoding.GetEncoding(contentType.CharSet);
}
string result = encoding.GetString(data);
这是一次下载的代码。
String FinalResult = "";
HttpWebRequest Request = (HttpWebRequest)System.Net.WebRequest.Create( URL );
HttpWebResponse Response = (HttpWebResponse)Request.GetResponse();
Stream ResponseStream = Response.GetResponseStream();
StreamReader Reader = new StreamReader( ResponseStream );
bool NeedEncodingCheck = true;
while( true )
{
string NewLine = Reader.ReadLine(); // it may not working for zipped HTML.
if( NewLine == null )
{
break;
}
FinalResult += NewLine;
FinalResult += Environment.NewLine;
if( NeedEncodingCheck )
{
int Start = NewLine.IndexOf( "charset=" );
if( Start > 0 )
{
Start += "charset='"".Length;
int End = NewLine.IndexOfAny( new[] { ' ', ''"', ';' }, Start );
Reader = new StreamReader( ResponseStream, Encoding.GetEncoding(
NewLine.Substring( Start, End - Start ) ) ); // Replace Reader with new encoding.
NeedEncodingCheck = false;
}
}
}
Reader.Close();
Response.Close();
我在WireShark的帮助下研究了同样的问题,WireShark是一个伟大的协议分析器。我认为httpWebResponse类有一些设计缺陷。事实上,第一次调用 HttpWebRequest 类的 GetResponse() 方法时,整个消息实体就被下载了,但框架没有地方保存 HttpWebResponse 类或其他位置的数据,导致您必须第二次获取响应流。
从 WebRequest 请求网页"www.google.fr"时仍然存在一些问题。
我用小提琴手检查了原始请求和响应。问题来自谷歌服务器。响应 HTTP 标头设置为 charset=ISO-8859-1,文本本身使用 ISO-8859-1 编码,而 HTML 表示 charset=UTF-8。这是不连贯的,并导致编码错误。
经过多次测试,我设法找到了解决方法。只需添加:
myHttpWebRequest.UserAgent = "Mozilla/5.0";
到你的代码,谷歌响应将神奇地完全变成UTF-8。