Groovy正則表達(dá)式使用解讀

更新時(shí)間：2025年09月07日 14:15:57 作者：Allocator

文章總結(jié)了Groovy中的正則表達(dá)式特點(diǎn),指出其作為Java擴(kuò)展支持=~和==~操作符,轉(zhuǎn)義字符使用需特別處理,正則表達(dá)式在雙引號(hào)字符串中不需要轉(zhuǎn)義,但需雙斜線表示特殊字符,與Java相比,Groovy更簡(jiǎn)便,基于Pattern和Matcher類,用于模式匹配和捕獲子字符串

背景

項(xiàng)目使用Gradle作為自動(dòng)化構(gòu)建的工具, 閑暇之余對(duì)這個(gè)工具的使用方式以及其配置文件依賴的Groovy語法進(jìn)行了鞏固, 在學(xué)習(xí)Groovy語法的時(shí)候發(fā)現(xiàn)其中一個(gè)比較有意思的東西, 就是Groovy的正則表達(dá)式,于是本篇文章總結(jié)了一下Groovy中的正則表達(dá)式的特點(diǎn)以及Groovy正則表達(dá)式與Java正則表達(dá)式的區(qū)別:

Groovy正則表達(dá)式語法

Groovy是Java語言的一種擴(kuò)展, 可以無縫的使用Java的JDK, 并且其自身還有SDK對(duì)Java進(jìn)行了擴(kuò)展. Groovy中的正則表達(dá)式本質(zhì)上還是會(huì)使用到JDK中的java.lang.regex 包中的class, 其實(shí)這一部分, 個(gè)人認(rèn)為可以看成一種"語法糖", 只不過更方便大家在Groovy中使用正則表達(dá)式

來看一個(gè)最簡(jiǎn)單的正則表達(dá)式

def reg1 = ~'he*llo'
def reg2 = /he*llo/
println "reg1 type is ${reg1.class}"
println "reg2 type is ${reg2.class}"
println "hello".matches(reg1)
println "hello".matches(reg2)

運(yùn)行結(jié)果:

reg1 type is class java.util.regex.Pattern
reg2 type is class java.lang.String
true
true

上式中使用了~ + 字符串(以及雙斜線分隔符模式)的方式定義了一個(gè)正則表達(dá)式

Groovy中支持使用~ 來定義正則表達(dá)式, 打印出來的reg 類型都為Pattern 類型而不是一個(gè)字符串, 需要注意的點(diǎn)是上述例子中的~ 和= 之間有一個(gè)空格, 因?yàn)镚roovy中存在=~ 操作符號(hào), 這個(gè)操作符為查詢操作符, 使用在字符串之后, 要求接一個(gè)正則表達(dá)式, 返回的是一個(gè)java.util.regex.Matcher 對(duì)象. 還有一個(gè)操作符==~ 也比較容易混淆,這個(gè)操作符為匹配操作符, 后面跟一個(gè)正則表達(dá)式, 返回的類型為Boolean 類型. 這個(gè)操作符要求前面給定的字符串與后面的正則表達(dá)式完全匹配才可返回true 比如以下的列子

def val1 = "hello" =~ "he*llo"
println val1.class
print val1.matches()

運(yùn)行結(jié)果

class java.util.regex.Matcher
true

使用Groovy中的匹配操作符可以簡(jiǎn)化上述的操作

def val1 = "hello" ==~ "he*llo"
println val1.class
print val1

運(yùn)行結(jié)果:

class java.lang.Boolean
true

原字符問題

我們知道正則表達(dá)式中存在一些特殊的字符(比如\w 表示的是[a-zA-Z0-9])用于文本的匹配, 這些字符一般是以\ 開頭, 所以這個(gè)地方涉及到了轉(zhuǎn)義字符問題.

舉個(gè)例子:

def val1 = "test value"
println 'value is ${val1}'
println "value is ${val1}"

運(yùn)行結(jié)果:

value is ${val1}
value is test value

如果在構(gòu)建正則表達(dá)式字符串的時(shí)候, 使用雙引號(hào)表示字符串,就需要使用\\ 來表示單斜線,比如:

def reg1 = "hello \\w*"
def reg2 = /hello \w*/
println "hello world" ==~ reg1
println "hello world" ==~ reg2

運(yùn)行結(jié)果為true 當(dāng)然使用雙斜線字符串的話就不需要額外的斜線進(jìn)行轉(zhuǎn)義. 我們知道groovy的單引號(hào)中的字符串是以原字符的形式存在的,即是字符串本身就是它顯示的意思,嘗試使用單引號(hào)原字符來進(jìn)行正則匹配:

def reg1 = 'hello \w*' // 更改為 'hello \\w*' 則運(yùn)行正確
println "hello world" ==~ reg1

但是最終卻是一個(gè)error 使用單引號(hào)依然需要進(jìn)行轉(zhuǎn)義, 仔細(xì)想想,Groovy的單引號(hào)場(chǎng)景是參數(shù)解析的場(chǎng)景, 而此處是含有斜線的正則表達(dá)式字符的匹配問題, 兩個(gè)問題應(yīng)該不一樣,因此無論是使用單引號(hào)還是雙引號(hào),遇到正則表達(dá)式的含有斜線的特殊字符都要進(jìn)行轉(zhuǎn)義.不想進(jìn)行轉(zhuǎn)義可以使用斜線取代(單)雙引號(hào).

Pattern 和 Matcher

在Groovy中正則表達(dá)式中相關(guān)聯(lián)的依然是這兩個(gè)Java類. 依然回歸到Java中的這兩個(gè)類. JDK1.8 中java.util.regex 包中最核心的就是這兩個(gè)類, Pattern表示的即是正則表達(dá)式的"模式", 這是一個(gè)抽象的概念, 在編程過程中我們使用的是字符串表示正則, 它只是一種抽象的模式,實(shí)際上需要將字符串表示的抽象模式"編譯"成這個(gè)類才能夠正常工作, 當(dāng)在Groovy中可以理解為使用~ 操作符將字符串編譯為一個(gè)Pattern對(duì)象. 回顧這個(gè)類中的一些重要的概念和方法:

Pattern.matches 和 Pattern.matcher

Matcher matcher(charsequence input)

這個(gè)函數(shù)返回一個(gè)Matcher匹配器對(duì)象, 這個(gè)匹配器匹配給定的輸入與模式

def reg = ~/^hello \w*world$/
def str = "hello world"
def matcher = reg.matcher(str)
println matcher.class

輸出的類型就是java.util.regex.Matcher 然而上述的Matcher對(duì)象在groovy中可以用=~ 操作符號(hào)一步完成

def matcher = "hello world"=~/^hello \w*world$/
println matcher.class

static boolean matches(string regex, charsequence input)

這個(gè)函數(shù)編譯給定的正則表達(dá)式并且嘗試匹配給定的輸入, 這個(gè)在Java中是一個(gè)靜態(tài)的函數(shù), 可以理解為一種快速判斷字符串與給定的正則表達(dá)式模式是否匹配的工具, 同樣在Groovy中也有簡(jiǎn)單的實(shí)現(xiàn)方式

println "hello world"==~/^hello \w*world$/

運(yùn)行結(jié)果為true 可以看到Groovy中使用兩個(gè)操作符號(hào)=~ 和 ==~ 完成了Matcher匹配對(duì)象的構(gòu)建, 以及快速驗(yàn)證給定字符串是否和給定正則表達(dá)式模式匹配的功能.

Matcher 中的capturing group概念

首先Matcher 的概念是解釋Pattern. 我理解的是有時(shí)候我們使用正則表達(dá)式不僅僅是完成簡(jiǎn)單的驗(yàn)證字符串是否和模式匹配, 而是需要更加靈活和高級(jí)的操作(比如獲取部分匹配成功的子字符串功能), 此時(shí)就需要這個(gè)Matcher對(duì)象. Java中需要調(diào)用Pattern中的matcher方法返回這個(gè)對(duì)象, 而groovy中只需要使用=~ 操作符號(hào)即可創(chuàng)建這樣的對(duì)象.抽象的講, 這個(gè)對(duì)象即是存儲(chǔ)一個(gè)正則表達(dá)式模式與一個(gè)給定輸入字符串的所有匹配相關(guān)的信息. capturing group 這個(gè)概念是針對(duì)正則表達(dá)式中的() 引入的, 正則表達(dá)式中的括號(hào)表示group，捕獲組是從左往右計(jì)算其開始括號(hào)進(jìn)行編號(hào)的(因?yàn)榫哂欣ㄌ?hào)嵌套的情況, 括號(hào)層次越高那么它的組編號(hào)自然越小), 其中0表示整個(gè)表達(dá)式,

如下例子:

(A (BC))
group 0: (A(BC))
group 1: (A(BC))
group 2: (BC)

計(jì)算表達(dá)式的group就從左括號(hào)開始算遇到一個(gè)左括號(hào)group number就加1. 使用group可以用于捕獲輸入字符串與模式匹配上的部分對(duì)應(yīng)group位置的子字符串.

def str = "hello wrold hello"
def reg = /((el)(l))o/
def matcher = str=~reg
def num = 0
while(matcher.find()){
    println "the ${num} match sub sequenc"
    num++
    groupnum = matcher.groupCount()
    println "group count ${matcher.groupCount()}"
    println "group string ${matcher.group()}"
    println "group 0 string ${matcher.group(0)}"
    for(id in 1..groupnum){
        println "group ${id} string ${matcher.group(id)}"
        println "start index is ${matcher.start(id)} and end index is ${matcher.end(id)}"
    }
}

運(yùn)行結(jié)果:

the 0 match sub sequenc
group count 3
group string ello
group 0 string ello
group 1 string ell
start index is 1 and end index is 4
group 2 string el
start index is 1 and end index is 3
group 3 string l
start index is 3 and end index is 4
the 1 match sub sequenc
group count 3
group string ello
group 0 string ello
group 1 string ell
start index is 13 and end index is 16
group 2 string el
start index is 13 and end index is 15
group 3 string l
start index is 15 and end index is 16

有一個(gè)點(diǎn)比較重要就是groupCount 組數(shù)量是不會(huì)將group 0計(jì)算在內(nèi)的,組的數(shù)量是和括號(hào)數(shù)量保持一致, 其次是matcher.group() 方法和matcher.group(0) 方法返回內(nèi)容都一樣, 都是和模式匹配的完成的子序列, 當(dāng)傳遞參數(shù)時(shí)返回的就是相應(yīng)編號(hào)的捕獲組獲取的子序列.當(dāng)然可以使用start end等方法獲取到匹配的字符串(或者捕獲組匹配到的字符串)的偏移量(end的偏移量位置始終是最后一個(gè)字符的位置加1).

注意在groovy中由于GDK實(shí)現(xiàn)了getAt 方法那么其實(shí)可以通過索引的方式訪問捕獲組中的內(nèi)容,如下例子:

def reg = ~/h(el)(lo)/
def str = 'hello world hello nihao'
def matcher = str=~reg
println "first matched substring"
println matcher[0]
println matcher[0][0]
println matcher[0][1]
println matcher[0][2]
println "second matched substring"
println matcher[0]
println matcher[0][0]
println matcher[0][1]
println matcher[0][2]

運(yùn)行結(jié)果

first matched substring
[hello, el, lo]
hello
el
lo
second matched substring
[hello, el, lo]
hello
el
lo

可以看到通過索引訪問捕獲字符串的規(guī)律, 返回的matcher對(duì)象第一維度索引表示子字符串的索引, 返回值為一個(gè)ArrayList 包含的內(nèi)容是全部的子字符串以及與捕獲組編號(hào)對(duì)應(yīng)的子字符串. 這些等價(jià)的操作即是matcher.group(index)

Matcher 重置

匹配器的重置涉及到兩個(gè)方法find() 和reset 其中find 方法可以指定從哪一個(gè)位置重新開始尋找模式匹配的字符串.比如:

def reg = /el/
def str = "hello world hello"
def matcher = str=~reg
while(matcher.find()){
    println matcher.group()
}
matcher.find(0) // 重置matcher 從頭開始尋找匹配字符串
// 但此時(shí)第一個(gè)匹配的子字符串已經(jīng)獲取到了，下一次調(diào)用find則是查詢下一個(gè)匹配字符串
println "reset the matcher"
while(matcher.find()){
    println matcher.group()
}

結(jié)果:

el
el
reset the matcher
el

當(dāng)然最好使用reset來完成這個(gè)過程:

def reg = /el/
def str = "hello world hello"
def matcher = str=~reg
while(matcher.find()){
    println matcher.group()
}
matcher.reset()
println "reset the matcher"
while(matcher.find()){
    println matcher.group()
}

輸出結(jié)果: