C# 正则表达式模式,用于从给定字符串中提取 URL - 不是完整的 html URL,但也是裸链接

本文关键字:URL 正则表达式 html 链接 用于 字符串 提取 模式 | 更新日期: 2023-09-27 18:34:07

我需要一个正则表达式来执行以下操作

Extract all strings which starts with http://
Extract all strings which starts with www.

所以我需要提取这 2 个。

例如,下面有这个给定的字符串文本

house home go www.monstermmorpg.com nice hospital http://www.monstermmorpg.com this is incorrect url http://www.monstermmorpg.commerged continue

所以从上面给定的字符串中,我将得到

    www.monstermmorpg.com
http://www.monstermmorpg.com
http://www.monstermmorpg.commerged

寻找正则表达式或其他方式。谢谢。

C# 4.0

C# 正则表达式模式,用于从给定字符串中提取 URL - 不是完整的 html URL,但也是裸链接

您可以编写一些非常简单的正则表达式来处理此问题,或者通过更传统的字符串拆分 + LINQ 方法。

正则表达式

var linkParser = new Regex(@"'b(?:https?://|www'.)'S+'b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
var rawString = "house home go www.monstermmorpg.com nice hospital http://www.monstermmorpg.com this is incorrect url http://www.monstermmorpg.commerged continue";
foreach(Match m in linkParser.Matches(rawString))
    MessageBox.Show(m.Value);

解释模式:

'b       -matches a word boundary (spaces, periods..etc)
(?:      -define the beginning of a group, the ?: specifies not to capture the data within this group.
https?://  - Match http or https (the '?' after the "s" makes it optional)
|        -OR
www'.    -literal string, match www. (the '. means a literal ".")
)        -end group
'S+      -match a series of non-whitespace characters.
'b       -match the closing word boundary.

基本上,该模式查找以http:// OR https:// OR www. (?:https?://|www'.)开头的字符串,然后将所有字符匹配到下一个空格。

传统字符串选项

var rawString = "house home go www.monstermmorpg.com nice hospital http://www.monstermmorpg.com this is incorrect url http://www.monstermmorpg.commerged continue";
var links = rawString.Split("'t'n ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries).Where(s => s.StartsWith("http://") || s.StartsWith("www.") || s.StartsWith("https://"));
foreach (string s in links)
    MessageBox.Show(s);

使用尼基塔的回复,我很容易获得字符串中的网址:

using System.Text.RegularExpressions;
string myString = "test =) https://google.com/";
Match url = Regex.Match(myString, @"http(s)?://(['w-]+'.)+['w-]+(/['w- ./?%&=]*)?");
string finalUrl = url.ToString();

不适用于包含 URL 的 html

例如

<table><tr><td class="sub-img car-sm" rowspan ="1"><img src="https://{s3bucket}/abc/xyzxyzxyz/subject/jkljlk757cc617-a560-48f5-bea1-f7c066a24350_202008210836495252.jpg?X-Amz-Expires=1800&X-Amz-Algorithm=abcabcabc&X-Amz-Credential=AKIAVCAFR2PUOE4WV6ZX/20210107/ap-south-1/s3/aws4_request&X-Amz-Date=20210107T134049Z&X-Amz-SignedHeaders=host&X-Amz-Signature=3cc6301wrwersdf25fb13sdfcfe8c26d88ca1949e77d9e1d9af4bba126aa5fa91a308f7883e"></td><td class="icon"></td></tr></table>

为此需要在下面使用正则表达式

Regex regx = new Regex("http://([''w+?''.''w+])+([a-zA-Z0-9''~''!''@''#''$''%''^''&amp;''*''('')_''-''=''+''''''/''?''.'':'';''''',]*)?", RegexOptions.IgnoreCase);