Regular Expressions
介绍
Regular Expressions was initially a term borrowed fromautomata theoryin theoretical computer science. Broadly, it refers topatterns需要与之匹配的子字符串。
The comic should have already given you an idea of what regular expressions could be useful for. It should not be surprising that many programming languages, text processing tools, data validation tools and search engines make extensive use of them.
The key idea is that a regular expression is a pattern whichmatchesa set of target strings.
\ w+@\w+\.(com|org|net|in)
是符合最终电子邮件地址的正则态度.com
,.net
,.org
或者a。在
.
Concepts
随着语言的变化,有多种形式的正则语法。在这里,我们将检查Perl Regex,因为大多数其他正则是对此的变化。
在我们深入研究语法之前,这些是模式包含的事物:
Literals:They are the simplest things to match. When they are there, we just match them. It could be like an
a
或者a1
.metacharacters:They do not mean what they look like. They usually refer to something else. For example,
\ d
could refer to any digit.Vertical Bar:The
|
is a symbol of boolean OR. It gives an option to match any of the things it delimits.Quantifiers:They specify how many of the concerned pattern needs to be matched.
Grouping and Capturing:括号可用于分组正则零件或捕获零件以备后用。
Syntax
让我们看一下Metacharacters的详细信息。
metacharacter | Description |
^ |
弦的开始 |
$ |
End of a string |
\ t |
标签 |
\ n |
Newline |
\ r |
Carriage Return |
\ s |
Any whitespace character |
\ s |
任何非空格角色 |
\ d |
Any Digit |
\D |
Any non-digit |
\ w |
Any word-character |
\ w |
任何非字的角色 |
\ b |
Any word boundary |
\B |
Any non-word-boundary |
. |
Any single character, usually barring a newline |
By the way, if you want to match a metacharacter literally, you need to use\
to escape it. For example,\.
只会匹配.
character.
现在,让我们研究更多的灵活性。
表达 | Meaning |
[abc] |
匹配任何一个a ,b , orc |
[^abc] |
Matches anything other thana ,b , orc |
[a-d] |
匹配任何一个the characters in the range广告 |
一个* |
Matchesa zero or more times |
一个? |
Matchesa zero or one time |
A+ |
Matchesa 一次或多次 |
a | b |
Matches eithera 或者b |
a{3} |
Matches exactly 3 ofa |
a{3,} |
Matches 3 or more ofa |
a{3,5} |
Matches 3, 4 or 5 ofa (包括范围) |
( ) |
Captures everything inside the bracket |
我们现在准备解释为什么
\ w+@\w+\.(com|org|net|in)
does what it claims.Firstly, what should an email look like? That's right, it should have a structure like
用户@domain.extension
.The
用户
和domain
consists of any letter, number or underscore but at least one of them. So, we use\ w+
.We restrict the
extension
to或者g
,com
,网
或者in
by using the|
.
出色的员工有诸如calvin@www.parkandroid.com或support@www.parkandroid.com之类的电子邮件,即单个字母数字单词(有时带有下划线或期间)或名称,然后是网站地址。
Kenji wants to build an app which only the brilliant staff could use. Which of the following regex would be the best for him to use?
。
\ w+@brintry \ .org
\ w+@www.parkandroid.com
\ w*@brilliant\.org
+ @brilliant \ .org
Regular Expressions in Action - Perl Implementation
Perl is the language that is the most famous for its use of regular expression for good reasons.
We use the=〜
operator to denote a match or an assignment depending upon the context. The use of!~
is to reverse the sense of the match.
There are basically two regex operators in perl:
- 匹配:
m//
- Substitution:
s///
The purpose of the//
is to enclose the regex. However, any other delimiters like{}
,""
, etc could be used.
匹配
To use the matching operator, we simply check both sides using the=〜
和m//
operator.
The following sets
$ true
to 1 if and only if$foo
matches the regular expressionfoo
:
1$ true=($foo=〜m/foo/);
并不难看到相反的情况是
!~
:
1$false=($foo!~m/foo/);
捕获
As promised, the()
could be used for capturing parts of the regexes. When the pattern inside a parentheses match, they go into special variables like$1
,$2
, etc in that order.
Here's how one would extract the hours, minutes, seconds from a time string:
1 2 3 4 5if($time=〜/(\d\d):(\d\d):(\d\d)/){# match hh:mm:ss format$hours=$1;$minutes=$2;$seconds=$3;}
In list context, the list($ 1,$ 2,$ 3,..)
would be returned.
A simpler way to do the same would be
1我的($hours,$minutes,$seconds)=($time=〜m/(\ d+):(\ d+):(\ d+)/);
Substitution
This is our favorite search and replace feature. Almost the same syntax rules apply here except that there is an extra clause between the second//
这告诉我们要与什么相匹配。
Here is a self-explanatory piece of code:
1 2 3 4 5 6 78$x="Time to feed the cat!";$x=〜s/cat/hacker/;# $x contains "Time to feed the hacker!"if($x=〜s/^(Time.*hacker)!$/$1 now!/){$ more_insistent=1;}$y="'quoted words'";$y=〜s/^'(.*)'$/$1/;# strip single quotes,# $y contains "quoted words"
修饰符s
修饰符s could be appended to the end of the regex operation expression to modify their matching behavior.
Here is a list of some important modifiers:
修饰符 | Description |
i |
Case insensisitive matching |
s |
允许使用. to match newlines |
x |
为了清楚起见 |
g |
Globally find all matches |
Here's how one might want to use the
g
modifier:
1 2 3 4 5 6 7$x=“我打了4击4”;$x=〜S/4/四/GydF4y2Ba;# doesn't do it all:# $x contains "I batted four for 4"$x=“我打了4击4”;$x=〜S/4/四/G;# does it all:# $x contains "I batted four for four"