World Wide Web Programming 17 - More on Regular Expressions
Reusing groups of Characters
It is possible to reuse a group of characters in a regular expression For example, if you want to check than in a set of values, there are not two of the same after the other
• Basically 009, 007, 001, 002, 004, 003 would be accepted but 007, 007, 001, 002, 002, 003 would have 2 errors
Any pattern written between parenthesis can be referenced by a \ followed by the position of the pattern For example /(\d+), \1/ would match a number of one or more digits followed by a , a space and the same number again Ex:
• var myString = “007, 007, 1, 02, 02, 003, 002, 04”; var myRegExp = /(\d+), \1/g; myString = myString.replace(myRegExp, “ERROR”); • This would change the string into ERROR, 1, ERROR, 003, 002, 04
Using the “or” character You can try to pattern 2 different sets of patterns For example, to replace ‘ by “ but only when it surrounds a word you would use
• /\B’|’\B/g • This would try to find a non-word break followed by a ‘ OR a ‘ followed by a nonword break
Another example of OR
Try to match tags and words into an array We’ll use the myString.match(myRegExp) function which outputs an array The regular expression should try to find first a tag • • • •
Starts with < Has at least a character other than > \r or \n Ends with > This gives us \r\n]+>
It has to find either a tag OR a word, represented as a non tag (no < > \r or \n) • [^\r\n]+
The total regular expression becomes • /\r\n]+>|[^\r\n]+/g
Regular Expressions in PHP
It is more or less like in JavaScript though we more generally use the string version “abc” rather than /abc/ The basics are the same, we can do • • • • •
“xyz” “abc|xyz” “[xyz]” “[0-9]” “[^xyz]” “x+” “x*” “x?” “ab{3}” “ab{3,}” “ab{3,5}” “x(yz)*” “.” “^ab” “ab$”
You can also use “[[:alnum:]]” “[[:digit:]]” and “[[:alpha:]]”
Regular expressions examples
To find a currency (0; 1000; 10,000; 10000.00; 10,000.00; …) • • • •
We have either 0 or a number not starting with 0 Can have up to 2 digits after the decimal point Can be negative Can have commas…
We start with the string 0 or another number not starting with 0 • “^0$” • Or “^[1-9][0-9]*$” • Combined: “^(0|[1-9][0-9]*)$”
We add the - support
• “^(0|-?[1-9][0-9]*)$”
To find an optional decimal value
• “(\.[0-9]{1,2})?” • Which gives us “^(0|-?[1-9][0-9]*)(\.[0-9]{1,2})?$”
To add commas support
• “^(0|-?[1-9](([0-9]*)|([0-9]{1,2}(,[0-9]{3})*)))(\.[0-9]{1,2})?$”
Validating an email address
We want to match something of the type
[email protected] We have a name and a domain deparated by @ • User can be “[-a-zA-Z0-9._]+” • The domain name can also have a whole set of subdomains
“(\.[-a-zA-Z0-9]+)”
It becomes • “^[-a-zA-Z0-9._]+@[-a-zA-Z0-9.]+(\.[-a-zAZ0-9]+)+$”
Using PHP functions
int ereg(string pattern, string str, [,array regs]) For example to replace something in the format MM-DD-YYYY into DD-MM-YYYY you can do if (ereg(“([0-9]{1,2})-([0-9]{1,2})-([0-9]{4})”, $date, $regs))
echo $regs[2].”-”.$regs[1].”-”.$regs[3];
else
echo “Invalid date format:”.$date;
You can also use eregi(…) which ignores the character case
More PHP functions
string ereg_replace(string pattern, string replacement, string str) • Replace a pattern by replacement in str
string eregi_replace(…) array split(string pattern, string str, [, int limit]) • •
Return an array with the matched elements limit can be set to decide the maximum number of elements in the array
array spliti(…) string sql_regcase(string str)
Pleasing Perl lovers…
PHP has a set of PCRE functions You can use the same regexps than with JavaScript /php/ And you can use any of the following modifiers
• i for case-insesitive • x to ignore whitespace data in the pattern and also anything written between # and an end of line, allowing to write patterns like
/
/xi
\b web \b
#begin pattern #Find a word boundary #”web” is to be matched #Followed by another wb
• e only used by preg_replace() allows normal use of \\
You can also use any of the following characters • \d \D \s \S \w \W \b \B \A \Z \z
For more details on PCRE… http://www.pcre.org
Example of the e attribute
$html_string = “Bold Text and underlined text”; $new_html = preg_replace(“/(]*>)/e”, “’\\1’.strtoupper(‘\\2’).’\\3’”, $html_string);
The new string would hold “Bold Text and underlined text”
PCRE related PHP functions (1/4)
int preg_match(string pattern, string str [, array match]) • $match[0] would hold the text that matched the full pattern • $match[1] would hold the text that matched the first captured parenthesized subpattern • so on
PCRE related PHP functions (2/4)
int preg_march_all(string pattern, string str, array matches [, int order]) Order can be • PREG_PATTERN_ORDER
In this case $matches[0] is the an array with full pattern matches, $matches[1] the array of strings matched by the first parenthesized sub-pattern, and so on
• PREG_SET_ORDER
In this case $matches[0] is an array of first set of matches, $matches[1] array of second set of matches and so on…
I recommend you use the PREG_PATTERN_ORDER
PCRE related PHP functions (3/4) string preg_replace(string pattern, string replacement, string str [, int limit]) If limit is omitted or equal to -1 all occurrences are replaced If you want to use the \\0, \\1, … You can access maximum 9 substring (1-9), 0 being used for the whole pattern
PCRE related PHP functions (4/4)
array preg_split(string pattern, string subject [, int limit [, int flags]]) • flags can be PREG_SPLIT_NO_EMPTY which only splits non empty pieces…
string preg_quote(string s [, string selimiter]) • Puts a backslash in front of every character which is part of the regular expression syntax (.\+*?[^]$(){}=!|:) • If you add a delimiter, it will also be escaped