Regex将句子与小数和名称相匹配

本文关键字:句子 小数 Regex | 更新日期: 2023-09-27 18:23:51

我觉得我很接近这个,但只要我把标点符号的捕获移到句子的末尾,它就会错误地捕获。

句子场景如下:

This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it. This is a  sentence      with odd   spacing. This is one with lots of exclamation marks at the end!!!!This is another with a decimal 10.00 in the middle. Why is it so hard to find sentence endings?Last sentence without a space at the start.

这将导致捕获:

This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it. 
This is a  sentence      with odd   spacing. 
This is one with lots of exclamation marks at the end!!!!
This is another with a decimal 10.00 in the middle. 
Why is it so hard to find sentence endings?
Last sentence without a space at the start.

这就是我的表达方式:

.*?(?:[!?.;]+)((?<!(Mr|Mrs|Dr|Rev).?)(?='D|'s+|$)(?:[^!?.;'d]|'d*'.?'d+)*)(?=(?:[!?.;]+))

目前有两个问题:

  1. 标点符号在开头
  2. 它每句话能正确处理一个名字,但不能处理两个(为了加分,我希望它能正确地捕捉"d.J.Smith先生",但我不知道它怎么会与以一个字母结尾的句子不匹配

进入其中的数据将在某种程度上正常化,因此我们知道它将以句号结束,并在一行上,但欢迎任何指针。

Regex将句子与小数和名称相匹配

我同意@spender的观点,建议使用解析器来过滤所有标点规则。

但是,以下内容将适用于您的场景。

foreach (Match m in Regex.Matches(s, @"(.*?(?<!(?:'b[A-Z]|Mrs?|Dr|Rev|'d))[!?.;]+)'s*"))
         Console.WriteLine(m.Groups[1].Value);

输出

This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it. 
This is a  sentence      with odd   spacing. 
This is one with lots of exclamation marks at the end!!!!
This is another with a decimal 10.00 in the middle. 
Why is it so hard to find sentence endings?
Last sentence without a space at the start.

Ideone演示