使用htmllagilitypack解析html文档
本文关键字:文档 html 解析 htmllagilitypack 使用 | 更新日期: 2023-09-27 18:11:23
我试图通过htmllagilitypack解析以下html片段:
<td bgcolor="silver" width="50%" valign="top">
<table bgcolor="silver" style="font-size: 90%" border="0" cellpadding="2" cellspacing="0"
width="100%">
<tr bgcolor="#003366">
<td>
<font color="white">Info
</td>
<td>
<font color="white">
<center>Price
</td>
<td align="right">
<font color="white">Hourly
</td>
</tr>
<tr>
<td>
<a href='test1.cgi?type=1'>Bookbags</a>
</td>
<td>
$156.42
</td>
<td align="right">
<font color="green">0.11%</font>
</td>
</tr>
<tr>
<td>
<a href='test2.cgi?type=2'>Jeans</a>
</td>
<td>
$235.92
</td>
<td align="right">
<font color="red">100%</font>
</td>
</tr>
</table>
</td>
我的代码看起来像这样:
private void ParseHtml(HtmlDocument htmlDoc)
{
var ItemsAndPrices = new Dictionary<string, int>();
var findItemPrices = from links in htmlDoc.DocumentNode.Descendants()
where links.Name.Equals("table") &&
links.Attributes["width"].Equals ("100%") &&
links.Attributes["bgcolor"].Equals("silver")
select new
{
//select item and price
}
在本例中,我想将elect the item which are Jeans and Bookbags
及其关联的prices
存储在下面,并将它们存储在字典中。
E.g Jeans at price $235.92
有人知道如何通过html和LINQ正确地做到这一点吗?
我是这么想的:
var ItemsAndPrices = new Dictionary<string, string>();
var findItemPrices = from links in htmlDoc.DocumentNode.Descendants("tr").Skip(1)
select links;
foreach (var a in findItemPrices)
{
var values = (from tds in a.Descendants("td")
select tds.InnerText.Trim()).ToList();
ItemsAndPrices.Add(values[0], values[1]);
}
我唯一改变的是你的<string, int>
,因为$156.42
不是int
试试这个:正则表达式的解决方案:
static Dictionary<string, string> GetProduct(string name, string html)
{
Dictionary<string, string> output = new Dictionary<string, string>();
string clfr = @"['r'n]*[^'r'n]+";
string pattern = string.Format(@"href='([^']+)'>{0}</a>.*{1}{1}['r'n]*([^'$][^'r'n]+)", name, clfr);
Match products = Regex.Match(html, pattern, RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
if(products.Success) {
GroupCollection details = products.Groups;
output.Add("Name", name);
output.Add("Link", details[1].Value);
output.Add("Price", details[2].Value.Trim());
return output;
}
return output;
}
:
var ProductNames = new string[2] { "Jeans", "Bookbags" };
for (int i = 0, len = ProductNames.Length; i < len; i++)
{
var product = GetProduct(ProductNames[i], html);
if (product.Count != 0)
{
Console.WriteLine("{0} at price {1}", product["Name"], product["Price"]);
}
}
输出:Jeans at price $235.92
Bookbags at price $156.42
注意:Dictionary
不能是int
,因为$235.92
/$156.42
不是有效的int
。要将其转换为int有效类型,可以删除美元和点符号并使用
int.Parse()
假设可以有其他行,并且您不是特别想要Bookbags和Jeans,我会这样做:
var table = htmlDoc.DocumentNode
.SelectSingleNode("//table[@bgcolor='silver' and @width='100%']");
var query =
from row in table.Elements("tr").Skip(1) // skip the header row
let columns = row.Elements("td").Take(2) // take only the first two columns
.Select(col => col.InnerText.Trim())
.ToList()
select new
{
Info = columns[0],
Price = Decimal.Parse(columns[1], NumberStyles.Currency),
};