2016-05-27

python之正则表达式

我们已经搞定了怎样获取页面的内容，不过还差一步，这么多杂乱的代码夹杂文字我们怎样把它提取出来整理呢？下面就开始介绍一个十分强大的工具，正则表达式！

什么叫做正则表达式？

正则表达式是对字符串操作的一种逻辑公式，就是用事先定义好的一些特定字符、及这些特定字符的组合，正则表达式使用耽搁字符串来描述，匹配一系列符合某个句法规则的字符串。

简单理解，就是对字符串的检索匹配和处理组成一个“规则字符串”，这个“规则字符串”用来表达对字符串的一种过滤逻辑。

为什么会有正则表达式？

想要从返回的页面内容提取出我们想要的内容。

正则表达式如何匹配的？

1.依次拿出表达式和文本中的字符比较，
2.如果每一个字符都能匹配，则匹配成功；一旦有匹配不成功的字符则匹配失败。
3.如果表达式中有量词或边界，这个过程会稍微有一些不同。

正则表达式的基本语法。

.    匹配除换行符以外的任意字符
^    匹配字符串的开始
$    匹配字符串的结束
[]   用来匹配一个指定的字符类别
？   对于前一个字符字符重复0次到1次
*    对于前一个字符重复0次到无穷次
{}   对于前一个字符重复m次
{m，n} 对前一个字符重复为m到n次
\d   匹配数字，相当于[0-9]
\D   匹配任何非数字字符，相当于[^0-9]
\s   匹配任意的空白符，相当于[ fv]
\S   匹配任何非空白字符，相当于[^ fv]
\w   匹配任何字母数字字符，相当于[a-zA-Z0-9_]
\W   匹配任何非字母数字字符，相当于[^a-zA-Z0-9_]

一些记得的知识点

\b   匹配单词的开始或结束
\d+匹配1个或更多连续的数字
\d**匹配重复任意次(可能是0次)
^匹配你要用来查找的字符串的开头
$匹配结尾
^\d{5,12}$,匹配5-12的数字
^\s{5,12}$,匹配5-12位的字符
^\w{5,12}$,匹配5-12位的字母，数字，下划线，或者汉子
\W \D \B \S 意思刚好相反
[^x]匹配除了x之外的任意字符
[^aed]匹配除了aed之外的任意字符

还有一些用法

re.I(re.IGNORECASE): 忽略大小写（括号内是完整写法，下同）
M(MULTILINE): 多行模式，改变'^'和'$'的行为（参见上图）
S(DOTALL): 点任意匹配模式，改变'.'的行为
L(LOCALE): 使预定字符类 \w \W \b \B \s \S 取决于当前区域设定
U(UNICODE): 使预定字符类 \w \W \b \B \s \S \d \D 取决于unicode定义的字符属性
X(VERBOSE): 详细模式。这个模式下正则表达式可以是多行，忽略空白字符，并可以加入注释。以下两个正则表达式是等价的：

查找 . *  \ 使用\. \* \\
* 0 - 多次 
+ 1 - 多次
? 0 - 1次
{n} n次
{n,} n - 多次
{n,m} n - m 次

\d-->[0-9]
\w-->[a-z0-9A-Z]

一个小小的实例

\(?0\d{2}\)?[- ]?\d{8}|0\d{2}[- ]?\d{8}这个表达式匹配3位区号的电话号码，
其中区号可以用小括号括起来，也可以不用，区号与本地号间可以用连字号或空格间隔，
也可以没有间隔。你可以试试用分枝条件把这个表达式扩展成也支持4位区号的。

正则常用的一些函数用法

print (re.match('www','www.baidu.com').span())#从起始位置开始
print (re.search('com','www.baidu.com').span())#未从起始位置开始

re.match只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回None；而re.search匹配整个字符串，直到找到一个匹配。

在Python中使用正则表达式进行查找

‘re’模块提供了几个方法对输入的字符串进行确切的查询。我们将会要讨论的方法有：

re.match()
re.search()
re.findall()

match匹配字符串的开始位置
search匹配字符串的任意位置
>>> match = re.search(r'dog', 'dog cat dog')
>>> match.group(0)
'dog'

Python中我使用的最多的查找方法是findall()方法

['dog', 'dog']
>>> re.findall(r'cat', 'dog cat dog')
['cat']


>>> contactInfo = 'Doe, John: 555-1212'
>>> match = re.search(r'(\w+), (\w+): (\S+)', contactInfo)
>>> match.group(0)
'Doe, John: 555-1212'
>>> match.group(1)
'Doe'
>>> match.group(2)
'John'
>>> match.group(3)
'555-1212'

re.search(u'四川汶川', rp, re.I):

实际案例

案例1，

import re
pattern=re.compile('hello')
match=pattern.match('hello world')
print match.group()

案例2

import re
match=re.findall('hello','hello world')
print match

re模块提供正则表达式的支持
字符串形式编译为Pattern实例；
使用Pattern实例处理文本并获得匹配结果；

案例3，

import re
word ='http://www.baidu.com python_1.2'
key=re.findall('h.',word)
print key
匹配.任意一个字符

案例4，

import re
word ='http://www.baidu.com python_1.2'
key=re.findall('\.',word)
print key
匹配.转义的字符

案例5，

import re
word ='http://www.baidu.com python_1.2'
key=re.findall('\d\.\d',word)
print key
匹配两个数字的字符以及字符.

案例6，

import re
word ='httphttp://www.baidu.com python_1.2'
key=re.findall('http*',word)
print key
匹配所有的http

案例7，

import re
word ='httphttp://www.baidu.com python_1.2'
key=re.findall('t{2}',word)
print key
匹配所有的http

案例8，

#!usr/bin/env python 
#-*- coding:utf_8 -*-
import urllib
import re
html='''
    <div class="one"><div class="aaa" title="白帽子" onclick=......
    '''
title=re.findall(r'<div class="aaa" title="(.*?)" onclick',html)

for i in title:
    print i

匹配title内容

下面的是针对语法做的一些代码操作

#!usr/bin/env python 
#-*- coding:utf_8 -*-
import urllib
import re
import os
pattern=re.compile(r'hello')
match=pattern.match('hello world')
if match:
    print match.group()
else:
    pass

m=re.match(r'aaa','aaaaaaa efe')
print m.group()


import re
m = re.match(r'(\w+) (\w+)(?P<sign>.*)', 'hello world!')

print "m.string:", m.string
print "m.re:", m.re
print "m.pos:", m.pos
print "m.endpos:", m.endpos
print "m.lastindex:", m.lastindex
print "m.lastgroup:", m.lastgroup

print "m.group(1,2):", m.group(1, 2)
print "m.groups():", m.groups()
print "m.groupdict():", m.groupdict()
print "m.start(2):", m.start(2)
print "m.end(2):", m.end(2)
print "m.span(2):", m.span(2)
print r"m.expand(r'\2 \1\3'):", m.expand(r'\2 \1\3')


p=re.compile(r'\=')#根据特殊字符进行分割操作
a=p.split('cookie=fwefwvb,password=fwefwefwefw')
print a


p=re.compile(r'\d+')#根据特殊字符进行分割
a=p.split('dwedw1fwefwvb2fwefwe4fwefw')
print a
print a[3]
b=p.findall('dwedw1fwefwvb2fwefwe4fwefw')
print b
print b[1]


#!/usr/bin/python
import re

line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:
   print "matchObj.group() : ", matchObj.group()
   print "matchObj.group(1) : ", matchObj.group(1)
   print "matchObj.group(2) : ", matchObj.group(2)
else:
   print "No match!!"
#!/usr/bin/python
import re

phone = "2004-959-559 # This is Phone Number"

# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print "Phone Num : ", num

# Remove anything other than digits
num = re.sub(r'\D', "", phone)    
print "Phone Num : ", num

'''
f=open('module/test.txt','r')
for line in f.readlines():
    payload=line.strip()
    #print type(payload)
    if len(payload)!=0:
        re_telepone=re.match(r'^(\d{3})-(\d{3,20})$', payload)
        print re_telepone.group(2)
        p=open('module/test2.txt','w')
        p.write(re_telepone.group(2))
    else:
        break
f.close()
p.close()


'''
'''
test='010-12345'
if re.match(r'\d{3}-\d{3,8}$',test):
    print 'ok'
else:
    print 'fail'

a=re.split(r'[\s\,]+','a,b,ccc   dd')
print a
a=re.split(r'[\s\,\;]+','a,b,ccc;;;   dd')
print a
reg='Cookie=aaaaa;falg=ddddd'
a=re.split(r'[\s\,\;]+',reg)
print a[1]
a[1]=re.split(r'[\s\,\=]',a[1])
print a[1]
print a[1][1]

reg='Cookie=aaaaa;falg=ddddd'
a=re.split(r'[\s\,\;,\=]+',reg)
print a[3]

print re.match(r'^(\d+)(0*)$','12300').groups()#贪婪

aa=re.match(r'^(\d+?)(0*)$','12300').groups()#非贪婪
print aa
print aa[1]
'''

lazy