高級正則表達式技術(shù)（Python版）

2019-11-14 17:27:46

字體：大中小

供稿：網(wǎng)友

　　正則表達式是從信息中搜索特定的模式的一把瑞士軍刀。它們是一個巨大的工具庫，其中的一些功能經(jīng)常被忽視或未被充分利用。今天我將向你們展示一些正則表達式的高級用法。

　　舉個例子，這是一個我們可能用來檢測電話美國電話號碼的正則表達式：

r'^(1[-/s.])?(/()?/d{3}(?(2)/))[-/s.]?/d{3}[-/s.]?/d{4}$'

　　我們可以加上一些注釋和空格使得它更具有可讀性。

r'^'r'(1[-/s.])?' # optional '1-', '1.' or '1'r'(/()?'      # optional opening parenthesisr'/d{3}'      # the area coder'(?(2)/))'   # if there was opening parenthesis, close itr'[-/s.]?'    # followed by '-' or '.' or spacer'/d{3}'      # first 3 digitsr'[-/s.]?'    # followed by '-' or '.' or spacer'/d{4}$'    # last 4 digits

　　讓我們把它放到一個代碼片段里：

import renumbers = [ "123 555 6789",            "1-(123)-555-6789",            "(123-555-6789",            "(123).555.6789",            "123 55 6789" ]for number in numbers:    pattern = re.match(r'^'                   r'(1[-/s.])?'           # optional '1-', '1.' or '1'                   r'(/()?'                # optional opening parenthesis                   r'/d{3}'                # the area code                   r'(?(2)/))'             # if there was opening parenthesis, close it                   r'[-/s.]?'              # followed by '-' or '.' or space                   r'/d{3}'                # first 3 digits                   r'[-/s.]?'              # followed by '-' or '.' or space                   r'/d{4}$/s*',number)    # last 4 digits    if pattern:        　　輸出，不帶空格：
123 555 6789 is valid1-(123)-555-6789 is valid(123-555-6789 is not valid(123).555.6789 is valid123 55 6789 is not valid
　　正則表達式是 python 的一個很好的功能，但是調(diào)試它們很艱難，而且正則表達式很容易就出錯。
　　幸運的是，python 可以通過對 re.compile 或 re.match 設(shè)置 re.DEBUG (實際上就是整數(shù) 128) 標(biāo)志就可以輸出正則表達式的解析樹。
import renumbers = [ "123 555 6789",            "1-(123)-555-6789",            "(123-555-6789",            "(123).555.6789",            "123 55 6789" ]for number in numbers:    pattern = re.match(r'^'                    r'(1[-/s.])?'        # optional '1-', '1.' or '1'                    r'(/()?'             # optional opening parenthesis                    r'/d{3}'             # the area code                    r'(?(2)/))'          # if there was opening parenthesis, close it                    r'[-/s.]?'           # followed by '-' or '.' or space                    r'/d{3}'             # first 3 digits                    r'[-/s.]?'           # followed by '-' or '.' or space                    r'/d{4}$', number, re.DEBUG)  # last 4 digits    if pattern:        print '{0} is valid'.format(number)    else:        print '{0} is not valid'.format(number)
　　解析樹
at_beginningmax_repeat 0 1  subpattern 1    literal 49    in      literal 45      category category_space      literal 46max_repeat 0 2147483648  in    category category_spacemax_repeat 0 1  subpattern 2    literal 40max_repeat 0 2147483648  in    category category_spacemax_repeat 3 3  in    category category_digitmax_repeat 0 2147483648  in    category category_spacesubpattern None  groupref_exists 2    literal 41Nonemax_repeat 0 2147483648  in    category category_spacemax_repeat 0 1  in    literal 45    category category_space    literal 46max_repeat 0 2147483648  in    category category_spacemax_repeat 3 3  in    category category_digitmax_repeat 0 2147483648  in    category category_spacemax_repeat 0 1  in    literal 45    category category_space    literal 46max_repeat 0 2147483648  in    category category_spacemax_repeat 4 4  in    category category_digitat at_endmax_repeat 0 2147483648  in    category category_space123 555 6789 is valid1-(123)-555-6789 is valid(123-555-6789 is not valid(123).555.6789 is valid123 55 6789 is not valid
　　貪婪和非貪婪
　　在我解釋這個概念之前，我想先展示一個例子。我們要從一段 html 文本尋找錨標(biāo)簽：
import rehtml = 'Hello <a  title="pypix">Pypix</a>'m = re.findall('<a.*>.*<//a>', html)if m:    print m
　　結(jié)果將在意料之中：
['<a  title="pypix">Pypix</a>']
　　我們改下輸入，添加第二個錨標(biāo)簽：
import rehtml = 'Hello <a  title="pypix">Pypix</a>' /       'Hello <a  title"example">Example</a>'m = re.findall('<a.*>.*<//a>', html)if m:    print m
　　結(jié)果看起來再次對了。但是不要上當(dāng)了！如果我們在同一行遇到兩個錨標(biāo)簽后，它將不再正確工作：
['<a  title="pypix">Pypix</a>Hello <a  title"example">Example</a>']
　　這次模式匹配了第一個開標(biāo)簽和最后一個閉標(biāo)簽以及在它們之間的所有的內(nèi)容，成了一個匹配而不是兩個 單獨的匹配。這是因為默認(rèn)的匹配模式是“貪婪的”。
當(dāng)處于貪婪模式時，量詞(比如 * 和 +)匹配盡可能多的字符。
　　當(dāng)你加一個問號在后面時（.*?）它將變?yōu)?ldquo;非貪婪的”。
import rehtml = 'Hello <a  title="pypix">Pypix</a>' /       'Hello <a  title"example">Example</a>'m = re.findall('<a.*?>.*?<//a>', html)if m:    print m
　　現(xiàn)在結(jié)果是正確的。
['<a  title="pypix">Pypix</a>', '<a  title"example">Example</a>']
　　前向界定符和后向界定符
　　一個前向界定符搜索當(dāng)前的匹配之后搜索匹配。通過一個例子比較好解釋一點。
　　下面的模式首先匹配 foo，然后檢測是否接著匹配 bar：
import restrings = [  "hello foo",         # returns False             "hello foobar"  ]    # returns Truefor string in strings:    pattern = re.search(r'foo(?=bar)', string)    if pattern:        print 'True'    else:        print 'False'
　　這看起來似乎沒什么用，因為我們可以直接檢測 foobar 不是更簡單么。然而，它也可以用來前向否定界定。 下面的例子匹配foo，當(dāng)且僅當(dāng)它的后面沒有跟著 bar。
import restrings = [  "hello foo",         # returns True             "hello foobar",      # returns False             "hello foobaz"]      # returns Truefor string in strings:    pattern = re.search(r'foo(?!bar)', string)    if pattern:        print 'True'    else:        print 'False'
　　后向界定符類似，但是它查看當(dāng)前匹配的前面的模式。你可以使用 (?> 來表示肯定界定，(?<! 表示否定界定。
　　下面的模式匹配一個不是跟在 foo 后面的 bar。
import restrings = [  "hello bar",         # returns True             "hello foobar",      # returns False             "hello bazbar"]      # returns Truefor string in strings:    pattern = re.search(r'(?<!foo)bar',string)    if pattern:        print 'True'    else:        print 'False'
　　條件(IF-Then-Else)模式
　　正則表達式提供了條件檢測的功能。格式如下：
(?(?=regex)then|else)
　　條件可以是一個數(shù)字。表示引用前面捕捉到的分組。
　　比如我們可以用這個正則表達式來檢測打開和閉合的尖括號：
import restrings = [  "<pypix>",    # returns true             "<foo",       # returns false             "bar>",       # returns false             "hello" ]     # returns truefor string in strings:    pattern = re.search(r'^(<)?[a-z]+(?(1)>)$', string)    if pattern:        print 'True'    else:        print 'False'
　　在上面的例子中，1 表示分組 (<)，當(dāng)然也可以為空因為后面跟著一個問號。當(dāng)且僅當(dāng)條件成立時它才匹配關(guān)閉的尖括號。
　　條件也可以是界定符。
　　無捕獲組
　　分組，由圓括號括起來，將會捕獲到一個數(shù)組，然后在后面要用的時候可以被引用。但是我們也可以不捕獲它們。
　　我們先看一個非常簡單的例子：
import re          string = 'Hello foobar'          pattern = re.search(r'(f.*)(b.*)', string)          print "f* => {0}".format(pattern.group(1)) # prints f* => foo          print "b* => {0}".format(pattern.group(2)) # prints b* => bar
　　現(xiàn)在我們改動一點點，在前面加上另外一個分組 (H.*)：
import re          string = 'Hello foobar'          pattern = re.search(r'(H.*)(f.*)(b.*)', string)          print "f* => {0}".format(pattern.group(1)) # prints f* => Hello          print "b* => {0}".format(pattern.group(2)) # prints b* => bar
　　模式數(shù)組改變了，取決于我們在代碼中怎么使用這些變量，這可能會使我們的腳本不能正常工作。 現(xiàn)在我們不得不找到代碼中每一處出現(xiàn)了模式數(shù)組的地方，然后相應(yīng)地調(diào)整下標(biāo)。 如果我們真的對一個新添加的分組的內(nèi)容沒興趣的話，我們可以使它“不被捕獲”，就像這樣：
import re          string = 'Hello foobar'          pattern = re.search(r'(?:H.*)(f.*)(b.*)', string)          print "f* => {0}".format(pattern.group(1)) # prints f* => foo          print "b* => {0}".format(pattern.group(2)) # prints b* => bar
　　通過在分組的前面添加 ?:，我們就再也不用在模式數(shù)組中捕獲它了。所以數(shù)組中其他的值也不需要移動。
　　命名組
　　像前面那個例子一樣，這又是一個防止我們掉進陷阱的方法。我們實際上可以給分組命名， 然后我們就可以通過名字來引用它們，而不再需要使用數(shù)組下標(biāo)。格式是：(?Ppattern) 我們可以重寫前面那個例子，就像這樣：
import re          string = 'Hello foobar'          pattern = re.search(r'(?P<fstar>f.*)(?P<bstar>b.*)', string)          print "f* => {0}".format(pattern.group('fstar')) # prints f* => foo          print "b* => {0}".format(pattern.group('bstar')) # prints b* => bar
　　現(xiàn)在我們可以添加另外一個分組了，而不會影響模式數(shù)組里其他的已存在的組：
import re          string = 'Hello foobar'          pattern = re.search(r'(?P<hi>H.*)(?P<fstar>f.*)(?P<bstar>b.*)', string)          print "f* => {0}".format(pattern.group('fstar')) # prints f* => foo          print "b* => {0}".format(pattern.group('bstar')) # prints b* => bar          print "h* => {0}".format(pattern.group('hi')) # prints b* => Hello
　　使用回調(diào)函數(shù)
　　在 Python 中 re.sub() 可以用來給正則表達式替換添加回調(diào)函數(shù)。　　讓我們來看看這個例子，這是一個 e-mail 模板：
import re          template = "Hello [first_name] [last_name], /           Thank you for purchasing [product_name] from [store_name]. /           The total cost of your purchase was [product_price] plus [ship_price] for shipping. /           You can expect your product to arrive in [ship_days_min] to [ship_days_max] business days. /           Sincerely, /           [store_manager_name]"          # assume dic has all the replacement data          # such as dic['first_name'] dic['product_price'] etc...          dic = {           "first_name" : "John",           "last_name" : "Doe",           "product_name" : "iphone",           "store_name" : "Walkers",           "product_price": "$500",           "ship_price": "$10",           "ship_days_min": "1",           "ship_days_max": "5",           "store_manager_name": "DoeJohn"          }          result = re.compile(r'/[(.*)/]')          print result.sub('John', template, count=1)
　　注意到每一個替換都有一個共同點，它們都是由一對中括號括起來的。我們可以用一個單獨的正則表達式 來捕獲它們，并且用一個回調(diào)函數(shù)來處理具體的替換。
　　所以用回調(diào)函數(shù)是一個更好的辦法：
import re          template = "Hello [first_name] [last_name], /           Thank you for purchasing [product_name] from [store_name]. /           The total cost of your purchase was [product_price] plus [ship_price] for shipping. /           You can expect your product to arrive in [ship_days_min] to [ship_days_max] business days. /           Sincerely, /           [store_manager_name]"          # assume dic has all the replacement data          # such as dic['first_name'] dic['product_price'] etc...          dic = {           "first_name" : "John",           "last_name" : "Doe",           "product_name" : "iphone",           "store_name" : "Walkers",           "product_price": "$500",           "ship_price": "$10",           "ship_days_min": "1",           "ship_days_max": "5",           "store_manager_name": "DoeJohn"          }          def multiple_replace(dic, text):    pattern = "|".join(map(lambda key : re.escape("["+key+"]"), dic.keys()))    return re.sub(pattern, lambda m: dic[m.group()[1:-1]], text)     print multiple_replace(dic, template)
　　不要重復(fù)發(fā)明輪子
　　更重要的可能是知道在什么時候不要使用正則表達式。在許多情況下你都可以找到 替代的工具。
　　解析 [X]HTML
　　Stackoverflow 上的一個答案用一個絕妙的解釋告訴了我們?yōu)槭裁床粦?yīng)該用正則表達式來解析 [X]HTML。
　　你應(yīng)該使用使用 HTML 解析器，Python 有很多選擇：
ElementTree 是標(biāo)準(zhǔn)庫的一部分
BeautifulSoup 是一個流行的第三方庫
lxml 是一個功能齊全基于 c 的快速的庫
　　后面兩個即使是處理畸形的 HTML 也能很優(yōu)雅，這給大量的丑陋站點帶來了福音。
　　ElementTree 的一個例子：
from xml.etree import ElementTree          tree = ElementTree.parse('filename.html')          for element in tree.findall('h1'):             print ElementTree.tostring(element)
　　其他
　　在使用正則表達式之前，這里有很多其他可以考慮的工具。
　　感謝閱讀！