[译]Java正则系列: (2)量词

栏目: Java · 发布时间: 8年前

内容简介：[译]Java正则系列: (2)量词

翻译说明

greedy : 贪婪型, 最大匹配方式;

reluctant : 懒惰型, 最小匹配方式;

possessive : 独占型, 全部匹配方式; 也翻译为[ 支配型 ];

这3种量词, 是修饰量词的量词, 可以理解为正则格式重复的匹配类型。

量词

量词(Quantifier)用来指定某部分正则所重复的次数。为了方便，本文分别介绍 Pattern API 规范中的3种类型, 分别是 greedy(贪婪), reluctant(懒惰), 和 possessive(独占) 量词。表面上看, X? , X?? 和 X?+ 这几种量词都差不多, 都是匹配 “出现0到1次大写的X”。下文将会讲解他们在实现上的细微差别。

Greedy(贪婪)	Reluctant(懒惰)	Possessive(独占)	说明
`X?`	`X??`	`X?+`	`X` , 出现0或1次
`X*`	`X*?`	`X*+`	`X` , 出现0到多次
`X+`	`X+?`	`X++`	`X` , 出现1到多次
`X{n}`	`X{n}?`	`X{n}+`	`X` , 精确匹配 `n` 次
`X{n,}`	`X{n,}?`	`X{n,}+`	`X` , 最少出现 `n` 次
`X{n,m}`	`X{n,m}?`	`X{n,m}+`	`X` , 最少出现 `n` 次, 最多出现 `m` 次

我们先创建3个基本的正则表达式：字母 “ a ” 后面紧跟 ? , * , 或者 + 。然后使用贪婪型来进行匹配。先来看看碰到空字符串 "" 是什么情况：

Enter your regex: a?
Enter input string to search: 
I found the text "" starting at index 0 and ending at index 0.

Enter your regex: a*
Enter input string to search: 
I found the text "" starting at index 0 and ending at index 0.

Enter your regex: a+
Enter input string to search: 
No match found.

零长匹配

上面的示例中, 前两个正则成功匹配, 因为 a? 和 a* 都允许出现 0 次 a . 且开始索引和结束索引都是 0, 这和之前所见的情形略有不同。空字符串 "" 的长度为0, 所以只能在索引0处匹配。这种情况称为零长匹配(Zero-Length Match).

零长匹配可能出现的情况包括: 空文本, 字符串起始处, 字符串结尾处, 以及任意两个字符之间. 零长匹配很容易辨认, 因为开始索引和结束索引的位置相等。

下面来看几个零长匹配的示例。输入文本为单个字母 “ a ” , 你会看到一些有趣的地方:

Enter your regex: a?
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.

Enter your regex: a*
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.

Enter your regex: a+
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.

3种量词都可以匹配到字母”a”, 但前两个还找到了一次零长匹配, 在 index=1 的位置, 也就是字符串结尾之处. 可以看到, 匹配器先在 index=0 和 index=1 之间找到了字符 “a”, 往后类推, 直到再也匹配不到为止. 根据使用量词的不同, 文本结尾处的空白(nothing)可能被匹配到, 也可能不被匹配到。

我们看看连续输入5个字母” a “的情况:

Enter your regex: a?
Enter input string to search: aaaaa
I found the text "a" starting at index 0 and ending at index 1.
I found the text "a" starting at index 1 and ending at index 2.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "a" starting at index 3 and ending at index 4.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "" starting at index 5 and ending at index 5.

Enter your regex: a*
Enter input string to search: aaaaa
I found the text "aaaaa" starting at index 0 and ending at index 5.
I found the text "" starting at index 5 and ending at index 5.

Enter your regex: a+
Enter input string to search: aaaaa
I found the text "aaaaa" starting at index 0 and ending at index 5.

正则 a? 对每个字母进行1次匹配, 因为它匹配的是0到1个 "a" . 正则 a* 会匹配2次: 其中第1次匹配多个连续的字母 “a” , 第2次是零长匹配, 字符串结束位置 index=5 的地方. 而 a+ 只会匹配所有出现的”a”字母, 忽略最后的空白(nothing)。

现在, 我们想知道, 前2个正则在碰到其他字母时会发生什么. 例如碰到 “ababaaaab” 之中的 b 字母时。

请看示例:

Enter your regex: a?
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "a" starting at index 5 and ending at index 6.
I found the text "a" starting at index 6 and ending at index 7.
I found the text "a" starting at index 7 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.

Enter your regex: a*
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "aaaa" starting at index 4 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.

Enter your regex: a+
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "aaaa" starting at index 4 and ending at index 8.

字母 “b” 出现在索引为 1, 3, 8 的位置, 输出结果也表明零长匹配出现在这些地方. 正则 a? 不会专门查找字母”b”, 而只查找存在/或不存在字母 “a” 的地方. 如果量词允许0次匹配, 则只要不是 “a” 字母的地方都会出现一次零长匹配. 其余的”a”则根据前面介绍的规则进行匹配。

要精确匹配某个格式 n 次, 只需要在大括号内指定数字即可:

Enter your regex: a{3}
Enter input string to search: aa
No match found.

Enter your regex: a{3}
Enter input string to search: aaa
I found the text "aaa" starting at index 0 and ending at index 3.

Enter your regex: a{3}
Enter input string to search: aaaa
I found the text "aaa" starting at index 0 and ending at index 3.

正则 a{3} 匹配连续出现的三个“ a ”字母。第一次测试匹配失败, 是因为字母 a 的数量不足. 第二次测试时, 字符串中刚好包含3个 a 字母, 所以匹配了一次。第三次测试也触发了一次匹配, 因为输入文本的签名有3个 a 字母. 后面再出现的字母, 与第一次匹配无关。如果后面还有这种格式的字符串, 则使用后面的子串触发后续匹配:

Enter your regex: a{3}
Enter input string to search: aaaaaaaaa
I found the text "aaa" starting at index 0 and ending at index 3.
I found the text "aaa" starting at index 3 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.

要求某种格式至少出现 n 次，可以在数字后面加一个逗号,例如:

Enter your regex: a{3,}
Enter input string to search: aaaaaaaaa
I found the text "aaaaaaaaa" starting at index 0 and ending at index 9.

同样是9个字母a, 这里就只匹配了一次，因为9个 a 字母的序列也满足 “最少3个a字母” 的需求。

如果要指定出现次数的最大值，在大括号内加上第二个数字即可:

Enter your regex: a{3,6} // 最少3个,最多6个a字母
Enter input string to search: aaaaaaaaa
I found the text "aaaaaa" starting at index 0 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.

这里的第一个匹配在达到上限的6个字符时停止. 第二个匹配包含了剩下的字母, 恰好是要求的最小字符个数: 三个 a . 如果输入的文本再少一个字符, 第二次匹配就不会发生, 因为只有2个 a 则匹配不了该格式。

关联到捕获组和/或字符集的量词

到目前为止, 我们只是用量词来测试了单个字符的情况. 但实际上, 量词只关联到一个字符上, 所以正则 “ abc+ ” 的含义是: “字母 a , 后面跟着字母 b , 然后再跟着1到多个字母 c ”. 而不表示1到多次的 “abc”. 当然, 量词可以关联到字符集合(Character Class)和捕获组(Capturing Group), 例如 [abc]+ , 表示 “出现1到多次的a或b或c, 也就是abc三个字母组成的任意组合”), 而正则 (abc)+ 表示 “ abc ” 这个 group 整体出现 1次到多次, 例如 abcabcabc 。

让我们看一个具体的示例, 指定分组 dog 连续出现三次。

Enter your regex: (dog){3}
Enter input string to search: dogdogdogdogdogdog
I found the text "dogdogdog" starting at index 0 and ending at index 9.
I found the text "dogdogdog" starting at index 9 and ending at index 18.

Enter your regex: dog{3}
Enter input string to search: dogdogdogdogdogdog
No match found.

第一个示例, 匹配了3次, 因为量词作用于整个捕获组. 如果把小括号去掉, 就会匹配失败, 因为这时候量词 {3} 只作用于字母” g “。

类似地,我们将量词作用于整个字符集合(character class):

Enter your regex: [abc]{3}
Enter input string to search: abccabaaaccbbbc
I found the text "abc" starting at index 0 and ending at index 3.
I found the text "cab" starting at index 3 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.
I found the text "ccb" starting at index 9 and ending at index 12.
I found the text "bbc" starting at index 12 and ending at index 15.

Enter your regex: abc{3}
Enter input string to search: abccabaaaccbbbc
No match found.

第一个示例中, 量词 {3} 作用于整个字符集合, 在第二个示例中, 量词只作用于字母 “c”。

贪婪,懒惰和全量量词之间的区别

贪婪(Greedy),懒惰(Reluctant)和全量(Possessive)这三种量词模式之间有一些细微的差别。

贪婪量词(Greedy quantifier), 其试图在第一次匹配时就吃掉所有的输入字符. 如果尝试吃掉整个字符串失败, 则放过最后一个字符, 并再次尝试匹配, 重复这个过程, 直到找到一个匹配, 或者是没有可回退的字符为止. 根据正则中的量词, 最后尝试匹配的可能是0或1个字符。

懒惰量词(reluctant quantifier),采取的策略正好相反: 从输入字符串的起始处, 每吃下一个字符,就尝试进行一次匹配. 最后才会尝试匹配整个输入字符串。

独占量词(possessive quantifier), 则是吃下整个输入字符串, 只进行一次匹配尝试. 独占量词从不后退, 即使匹配失败, 这点是和贪婪量词的不同。

请看下面的示例:

Enter your regex: .*foo  // Java默认贪婪型
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.

Enter your regex: .*?foo  // 懒惰型
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

Enter your regex: .*+foo // 独占模式
Enter input string to search: xfooxxxxxxfoo
No match found.

第一个示例使用的是贪婪量词 .* , 匹配0到多个的任意字符(anything), 紧随其后的是字母 “f” “o” “o”。因为是贪婪量词, .* 部分首先吃掉整个输入字符串, 发现整个表达式匹配不成功, 因为最后三个字母(“f” “o” “o”)已经被 .* 吃掉了; 然后, 匹配器放开最后1个字符,再放开最后1个字符,再放开最后1个字符, 直到右边剩下 “foo” 为止, 这时候匹配成功, 查找结束。

第二个示例是懒惰型, 所以最开始什么都不吃. 因为后面不是 “foo”，所以不得不吃下第一个字母(“x”), 然后就触发了第一次匹配, 在索引0到4之间。接着从索引4的后面再次进行匹配尝试, 直到尝试完整个输入字符串。在索引4到13之间触发了第二次匹配。

第三个例子, 使用的是独占量词, 所以没有匹配成功。在这个示例中, 因为整个输入字符串都被 .*+ 吃掉了, 剩下的空白自然不能对应 “foo”. 由此可知, 独占量词只能用于匹配所有字符的情况, 它从不后退; 如果都不能匹配到, 独占量词的性能会比贪婪型好一些。

原文链接: https://docs.oracle.com/javase/tutorial/essential/regex/quant.html

构建高性能Web站点

郭欣 / 电子工业出版社 / 2009-8 / 59.00元

本书围绕如何构建高性能Web站点，从多个方面、多个角度进行了全面的阐述，涵盖了Web站点性能优化的几乎所有内容，包括数据的网络传输、服务器并发处理能力、动态网页缓存、动态网页静态化、应用层数据缓存、分布式缓存、Web服务器缓存、反向代理缓存、脚本解释速度、页面组件分离、浏览器本地缓存、浏览器并发请求、文件的分发、数据库I/O优化、数据库访问、数据库分布式设计、负载均衡、分布式文件系统、性能监控等。......一起来看看《构建高性能Web站点》这本书的介绍吧!

码农工具