解析网页并提取数据
本文关键字:提取 数据 网页 | 更新日期: 2023-09-27 18:31:28
我的任务是创建一个网络抓取器(或屏幕抓取器,无论你想怎么看它)。我已经找到了HtmlAgilityPack,但我想知道,鉴于以下HTML示例,我将如何提取电话号码等内容
<td valign="top" class="clsContent" style="width: 250px; padding-right: 21px">
<span class=clsLabelB>Web: </span><a href='http://www.marriott.com/hotels/travel/sandm-san-diego-marriott-del-mar/' target=_blank>http://www.marriott.com/hotels/travel/sandm-san-diego-marriott-del-mar/</a><br />
<div style='padding-top:7px'>
<table cellpadding=0 cellspacing=0>
<tr>
<td valign=top class=clsLabelB nowrap>Phone: </td>
<td valign=top>+1 858-523-1700</td>
</tr>
<tr>
<td valign=top class=clsLabelB nowrap>Fax: </td>
<td valign=top>+1 858-523-1355</td>
</tr>
<tr>
<td valign=top class=clsLabelB nowrap>Toll Free: </td><td valign=top>800-228-9290</td>
</tr>
</table>
</div>
<p><span class=clsLabelB>Chain: </span><a href='/Hotels/Companies/Marriott-International'>Marriott International</a><br />
<span class=clsLabelB>Chain Website: </span><a href='http://www.marriott.com' target=_blank>http://www.marriott.com</a>
<p><span class=clsLabelB>Description: </span>Contemporary high-rise hotel - Convenient to area companies, beaches, golf, shopping, San Diego Zoo and Sea World.<br />
<div style='padding-top:7px'>
<table cellpadding=0 cellspacing=0>
<tr>
<td valign=top class=clsLabelB width=170px nowrap>Year Renovated: </td>
<td valign=top>2003</td>
</tr>
</table>
</div>
<div style='padding-top:7px'>
<table cellpadding=0 cellspacing=0>
<tr>
<td valign=top class=clsLabelB width=170px nowrap>Check in Time: </td>
<td valign=top>4:00 PM</td>
</tr>
<tr>
<td valign=top class=clsLabelB width=170px nowrap>Check out Time: </td>
<td valign=top>12:00 PM</td>
</tr>
<tr>
<td valign=top class=clsLabelB width=170px nowrap>Number of Floors: </td>
<td valign=top>11</td>
</tr>
<tr>
<td valign=top class=clsLabelB width=170px nowrap>Total Number of Rooms: </td>
<td valign=top>284</td>
</tr>
</table>
</div>
</td>
目前我没有要显示的示例代码,因为我完全停留在这个代码上,任何帮助或帮助将不胜感激。
你试试这个,这是一个示例代码
HtmlDocument doc = new HtmlDocument();
doc.Load("file.html");
string phone_number = doc.DocumentElement.SelectNodes("//td[contains(text(), 'Phone')]//following-sibling::td[1]"]).InnerText