Regex将句子与小数和名称相匹配
本文关键字:句子 小数 Regex | 更新日期: 2023-09-27 18:23:51
我觉得我很接近这个,但只要我把标点符号的捕获移到句子的末尾,它就会错误地捕获。
句子场景如下:
This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it. This is a sentence with odd spacing. This is one with lots of exclamation marks at the end!!!!This is another with a decimal 10.00 in the middle. Why is it so hard to find sentence endings?Last sentence without a space at the start.
这将导致捕获:
This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it.
This is a sentence with odd spacing.
This is one with lots of exclamation marks at the end!!!!
This is another with a decimal 10.00 in the middle.
Why is it so hard to find sentence endings?
Last sentence without a space at the start.
这就是我的表达方式:
.*?(?:[!?.;]+)((?<!(Mr|Mrs|Dr|Rev).?)(?='D|'s+|$)(?:[^!?.;'d]|'d*'.?'d+)*)(?=(?:[!?.;]+))
目前有两个问题:
- 标点符号在开头
- 它每句话能正确处理一个名字,但不能处理两个(为了加分,我希望它能正确地捕捉"d.J.Smith先生",但我不知道它怎么会与以一个字母结尾的句子不匹配
进入其中的数据将在某种程度上正常化,因此我们知道它将以句号结束,并在一行上,但欢迎任何指针。
我同意@spender的观点,建议使用解析器来过滤所有标点规则。
但是,以下内容将适用于您的场景。
foreach (Match m in Regex.Matches(s, @"(.*?(?<!(?:'b[A-Z]|Mrs?|Dr|Rev|'d))[!?.;]+)'s*"))
Console.WriteLine(m.Groups[1].Value);
输出
This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it.
This is a sentence with odd spacing.
This is one with lots of exclamation marks at the end!!!!
This is another with a decimal 10.00 in the middle.
Why is it so hard to find sentence endings?
Last sentence without a space at the start.
Ideone演示