jump to navigation

Five Regular-Expression Concepts a PHP Developer Should Know July 14, 2014

Posted by Tournas Dimitrios in PHP.
Tags: ,
trackback

Regular expressionsA Regular Expression, also called regex or regexp, is the term used to describe a codified method of searching into a string (or text) for a matched pattern. It was invented by an American mathematician, Stephen Cole Kleene (January 5, 1909 – January 25, 1994). Regex are used for advanced context sensitive searches and text modifications. Although regex are quite wide spread in the Unix world there is no such thing as ‘the standard regular expression language’. It is more like several different dialects. PHP supports two different types of regular expressions: POSIX-extended and Perl-Compatible Regular Expressions (PCRE). While the former is depreciated as of PHP version 5.3 and attempting to use any of its functions will result in the generation of a notice of level  E_DEPRECATED, it is highly encouraged to use the later (PCRE functions), which have native support for utf-8 character set encoding, are more powerful and faster than the POSIX ones.

Excerpt taken from a StackOverflow.com discussion : POSIX (Portable Operating System Interface for Unix) is simply a set of standards that define how to develop programs for UNIX (and its variants). Being POSIX-compliant for an OS means that it supports those standards (e.g., APIs), and thus can either natively run UNIX programs, or at least porting an application from UNIX to the target OS is easy/easier than if it did not support POSIX. Of course, the level of compliance is not necessarily 100% and can vary (e.g., not all features are supported or may be implemented differently).

For who is this article ? : Certainly not for a beginner in the world of “manipulating text programmatically“, neither for those that don’t like to experiment. The reader should have basic knowledge of regex syntax (he/she should be fluent with reading/writing regex patterns, so to speak). We will “talk” about : greediness , internal option settings , lookbehinds , lookaheads, non-capturing sequences and backreferences . Each of those concepts will be explained in brief and accompanied by a practical example.  This article aims to trigger your interest for writing more advanced regex patterns.

For those that need a refresh, at the end of this article (paragraph: “A quick overview of all PCRE functions” ) are practical examples for all PCRE functions. Also tables with metacharacters and pattern-modifiers are showing the most used “tools” (learn these tools by heart if you plan to master PCRE) .Of course, the official documentation is “the way to go”, for an in-depth review of the subject .

Whatever is done with regular expressions can also be achieved with native PHP functions (after all, PHP has close to one hundred functions for text manipulation), with one crucial difference though, regular expressions are doing it in a more efficient way. Not in the context of operating system performance but on the level of writing code. The following example will illustrate what I’m talking about :

/*  */

$contents = array(
    ' ALPHA ΒΑΝΚ ' => ' 0,6320 -0,0530 -7,74% 0,6260 0,6910 46.264.300 30.254.097 -',
    //    ......  truncated  .............  ///
    );

// Using PHP's string manipulation functions
foreach ($contents as $bank_name => $stock_values) {
    $stock_values =  str_replace('%', '', $stock_values);
    //first strip '-' and then strip whitespaces from beginning and end of string
    $cleaned_stock_values = trim(rtrim($stock_values, '-'));
    // the text is sliced and stored into an array
    $values = explode(' ', $cleaned_stock_values );
    $final_contents[$bank_name] =  $values;

}

// Using PCRE functions
foreach ($contents as $bank_name => $stock_values) {
$final_contents[$bank_name] = preg_split('/(-$)|[\s]|[%]/', $stock_values, null, PREG_SPLIT_NO_EMPTY);
}

var_dump($final_contents); 

/*
array (size=1)
  ' ALPHA ΒΑΝΚ ' =>
    array (size=7)
      0 => string '0,6320' (length=6)
      1 => string '-0,0530' (length=7)
      2 => string '-7,74' (length=5)
      3 => string '0,6260' (length=6)
      4 => string '0,6910' (length=6)
      5 => string '46.264.300' (length=10)
      6 => string '30.254.097' (length=10)
*/

/* */

Both approaches in the above code-snippet achieve the same goal, it’s evident that the regular expression solution is more compact (more readable I would say). For the curious spirits out there, the above example is an associative array whose values are strings of stock-market values which have to be sliced into separate entities. Firstly, each string has to be cleaned from unwanted characters (ie : %, – and white-spaces) and then the same string has to be transformed into an array. The outcome of this transformation can be used by Google Charts  and other custom code to extract statistical conclusions in order to create attractive and interactive charts. chess image

Five regular-expression concepts a PHP developer should know :

First concept : Greediness :

By default, regular expressions in PHP are what’s known as greedy. This means a quantifier always tries to match as many characters as possible, this characteristic might not be desirable sometimes (for instance, when parsing HTML/XML text). Using a question mark after one of the quantifiers (ie: +?, *?) or a capital “U” after the pattern delimiter will make  regular expressions non-greedy (they will match the minimal number of characters, not the maximal number of characters). A practical example will illustrate this feature :

//
$html = '<a><img src="logo.png" alt="" />My image </a>' ;
preg_match_all('/<img.*>/', $html, $matches1); // greedy
preg_match_all('/<img.*?>/', $html, $matches2); // nongreedy
preg_match_all('/<img.*>/U', $html, $matches3); // nongreedy
echo '$matches1 content';
var_dump($matches1);
echo '$matches2 content';
var_dump($matches2);
echo '$matches3 content';
var_dump($matches3);

/*
$matches1 content
array (size=1)
  0 =>
    array (size=1)
      0 => string '<img src="logo.png" alt="" />My image </a>' (length=42)
$matches2 content
array (size=1)
  0 =>
    array (size=1)
      0 => string '<img src="logo.png" alt="" />' (length=29)
$matches3 content
array (size=1)
  0 =>
    array (size=1)
      0 => string '<img src="logo.png" alt="" />' (length=29)
*/
//

Greedy matching is also known as maximal matching and non-greedy matching can be called minimal matching, because these options match either the maximum or minimum number of characters possible.

Second concept : Internal Option Settings :

PCRE’s have a long list of pattern modifiers to customize (ie overwrite) default characteristics of the whole pattern. The most used modifiers are :
i -> case insensitivity , u -> Pattern and subject strings are treated as UTF-8 , U -> greediness (we have already described this feature) , m -> multi-line …… visit the documentation page for an extensive list. There are cases though that only fragments of the pattern need to be customized, this can be achieved by a sequence of letters enclosed between “(?” and “)”  . The documentation page has a list of all “internal option setting characters” . A practical example will illustrate this feature :

$silly_text = <<<EOT
Hey my name is mr Dean and my name starts with a capital letter D .
Let us think of other things that starts with D .
My favorite candies are chocolate cookies
My favorite candies are chocolate Cookies
My favorite candies are chocolate cookies
My favorite candies are chocolate Cookies
My favorite candies are chocolate cookies
EOT;

$pattern = array('/(?i)dean/', '/D/', '/(?i)cookies/', );
$replacement = array('Bean', 'B', 'Biscuits');
$result = preg_replace($pattern, $replacement, $silly_text);
print $result ;

The important thing to notice is that both (pattern and replacement) are arrays and have an equal number of elements, otherwise, missing replacements will be set to a null strings. The above example modifies features on fractions of the pattern (ie it made those fractions case insensitive) by setting the internal option modifier to “(?i)” . The reverse order is also possible, unsetting specific pattern options by preceding the internal-modifier with a “^“. Other times,  we might want to combine multiple modifiers to achieve multiple adjustments.

Third concept : Lookaheads and lookbehinds :

Lookahead and lookbehind, collectively called “lookarounds”, allows us to create regular expressions that would difficult to achieve with basic regex patterns. If we had to emulate similar functionality with basic regex patterns, that would get very long-winded patterns. Lookarounds allow us to specify sequences of characters that must (or must not) match before / after that segment of the regex pattern. There are four types of lookarounds :

  • Negative lookbehinds : (?<!sequence)expression : Specifies a group of characters that must not exist before the matched string.
  • Positive lookbehinds : (?<= sequence)expression : Specifies a group of characters that must exist before the matched string.
  • Negative lookaheads : expression(?! sequence) : Specifies a group of characters that must not exist after  the matched string.
  • Positive lookaheads : expression(?= sequence) : Specifies a group of characters that must exist after the matched string.
/*  */
/*** a simple string ***/
$string = 'I\'m a Linux: php developer/programmer and .....';

var_dump(preg_match_all('/php(\s)+(?!programmer)/i', $string, $matches)); // 1
var_dump(preg_match('/php(\s)+(?=developer)/i', $string, $matches2)); // 1
var_dump(preg_match('/(?<=Linux:\s)php/', $string, $matches3)); // 1
var_dump(preg_match('/(?<!Linux:\s)php/', $string, $matches4)); // 0
//

Fourth concept : Non-capturing sequences :

By placing part of a regular expression inside round brackets (parentheses), we can group that part of the regular expression together. This is usually done to apply alteration or quantifiers on that specific part [group] of the regex. All regex engines caches those groups into a catcher-ring , which can then be read [referenced]  with  $1 or $2 or $3 etc….. (the number represent the position into the ring). There are cases though, that specific groups have to be excluded from being cached. This is achieved by using the “?:” before that group . This is mainly done for three reasons :

  1. performance issues : large blocks of text , which will not be referenced anyway, should not be cached (no reason to consume memory resources).
  2. maintaining a reasonable sequence numbering (there is no reason to consume sequence numbers for groups that won’t be referenced ).
  3. PHP’s PCRE engine has a limit on backreferences (as we will see on the next paragraph).
/* First */
$phone = '(123)-345-6789-3456';
$phone2 = '345-6789-3456';
$pattern = '/(\(\d{3}\)-)?(\d{3})-(\d{4})-(\d{4})/';
$replace = "$1$2-****-****";
var_dump(preg_replace($pattern, $replace, $phone));
var_dump(preg_replace($pattern, $replace, $phone2));

/*  Second example */
$phone = 'This is a fake phone number (123)-345-6789-3456';
$phone2 = 'This is a fake phone number 345-6789-3456';
$pattern2 = '/(\w+\s+)+(\(\d{3}\)-)?(\d{3})-(\d{4})-(\d{4})/';
$replace2 = "$2$3-****-****";
var_dump(preg_replace($pattern2, $replace2, $phone));
var_dump(preg_replace($pattern2, $replace2, $phone2));

/* Third */
$phone = 'This is a fake phone (123)-345-6789-3456';
$phone2 = 'This is a fake phone number 345-6789-3456';
$pattern3 = '/(?:\w+\s+)+(\(\d{3}\)-)?(\d{3})-(\d{4})-(\d{4})/';
$replace3 = "$1$2-****-****";
var_dump(preg_replace($pattern3, $replace3, $phone));
var_dump(preg_replace($pattern3, $replace3, $phone2));

 

Fifth concept : Numeric Backreferences :

Backreferences match the same text as previously matched by a capturing group (expression inside round brackets). To figure out the number of a particular backreference, scan the regular expression from left to right and count all groups enclosed in parentheses (the first group in parenthesis has backreference number one, the second group has number two, etc..) . Keep in mind that non-capturing sequences (groups that are prepended with a “?:“) are not counted.   The following code demonstrates a simple regex that uses backreferences :

//
$phone = '(123)345-6789-3456';
$phone2 = '345-6789-3456';
$pattern = '/\(?(\d{3})\)?(\d{3})[\s-]?(\d{4})(\d{4})/';
$duplicate_string = 'Remove duplicate duplicate words that that have been forgoten';
var_dump(preg_replace($pattern, '(\1)\2-\3-\4',$phone));
var_dump(preg_replace($pattern, '(\1)\2-\3-\4',$phone2));
var_dump(preg_replace('/\b(\w+)\s+\1\b/', '\1' , $duplicate_string)); 

/*
string '(123)345-6789-3456' (length=18)
string '345-6789-3456' (length=13)
string 'Remove duplicate words that have been forgoten' (length=46)
*/

// Non-capturing sequences in action
$subject = 'Hello *John* how are you my name is *Peter*';
// here we have two groups (so, two backreferences)
var_dump(preg_replace('/(\*)(\w+)\*/', '<b>\2</b>', $subject));

//disabling the first group from being cached , counting starts on second group
var_dump(preg_replace('/(?:\*)(\w+)\*/', '<b>\1</b>', $subject));
//

PHP’s PCRE supports up to 9 backreferences  (\1 through \9), other regex engines can support up to \99. When using multiple back references, a regular expression can quickly become confusing and hard to understand. An alternative way to back reference is by using named groups. A named group is specified by using (?P<name>pattern) , where name is the name of the group and pattern is the regular expression in the group itself. The group can then be referred to by (?P=name).

 

 A quick overview of all PCRE functions :

PHP offers eight PCRE functions for searching and modifying strings [text] using Perl-compatible regular expressions (a ninth function returns an error code of the last PCRE regex execution — if  there was an error of course  — ). These functions can be divided in four categories :  replacing, matching, splitting and filtering.:

  • preg_replace() : replaces all occurrences of a pattern with a replacement, and returns the modified result. It takes three basic and two optional parameters.
    //
    $subject = 'Hello there amigo. I said, hello there amigo. Once again hello amigo';
    var_dump(preg_replace('/(amigo)/', 'my friend', $subject));
    //
    
  • preg_filter() :  is identical to preg_replace() except it only returns the matches. Notice: the subject must be an array of values .
    //
    $subject = array('apple', 'pear', 'banana', 'strawberry', 'cherry', 'computer' , 'tv');
    $pattern = array('/(apple|pear|banana)/', '/strawberry/', '/cherry/');
    $replace = array('$0s', 'strawberries', 'cherries');
    echo "preg_filter returns\n";
    var_dump(preg_filter($pattern, $replace, $subject));
    echo "preg_replace returns\n";
    var_dump(preg_replace($pattern, $replace, $subject));
    /*
    preg_filter returns
    array (size=5)
      0 => string 'apples' (length=6)
      1 => string 'pears' (length=5)
      2 => string 'bananas' (length=7)
      3 => string 'strawberries' (length=12)
      4 => string 'cherries' (length=8)
    preg_replace returns
    array (size=7)
      0 => string 'apples' (length=6)
      1 => string 'pears' (length=5)
      2 => string 'bananas' (length=7)
      3 => string 'strawberries' (length=12)
      4 => string 'cherries' (length=8)
      5 => string 'computer' (length=8)
      6 => string 'tv' (length=2)
      */
      //
    
  • preg_replace_callback : Each pattern match will call the defined callback-function. The function will receive an indexed array with all group-sequences  of the pattern. Index 0 (zero) will have the whole match , index-1 the first match, index-2 the second match etc…
    function convDate( $matches ) {
      $time = ( mktime(0, 0, 0, $matches[1], $matches[2], $matches[3]));
      return date("l d F Y", $time);
    }
    
    $text = " The session was made between 9/18/013 and 7/22/14";
    $converted = preg_replace_callback('/([0-9]+)\/([0-9]+)\/([0-9]+)/', 'convDate', $text);
    var_dump($converted);
    //string 'The session was made between Wednesday 18 September 2013 and Tuesday 22 July 2014'(length=82)
    
  • preg_grep() : searches all elements of an array, returning only those elements that match the regex pattern.
    
    $subject = array('Getting started with PHP',
                     'Learning Mysql',
                     'All about the Linux kernel',
                     'Advanced PHP design patterns',
                     'A list of my favourit Linux books',
                     'Building web API\'s with php modules');
    var_dump(preg_grep('/(php)/i', $subject));
    /*
    array (size=3)
      0 => string 'Getting started with PHP' (length=24)
      3 => string 'Advanced PHP design patterns' (length=28)
      5 => string 'Building web API's with php modules' (length=35)
    */
    
  • preg_split() : breaks a string apart based on a pattern. The basic idea is the same as preg_match_all() except that, instead of returning matched pieces of the subject string, it returns an array of pieces that didn’t match the specified pattern.
    /*  */
    $subject = 'Apples and pears or potatos And tomatos OR onions and garlics';
    var_dump(preg_split('/(and|or)/i', $subject));
    /*
    array (size=6)
      0 => string 'Apples ' (length=7)
      1 => string ' pears ' (length=7)
      2 => string ' potatos ' (length=9)
      3 => string ' tomatos ' (length=9)
      4 => string ' onions ' (length=8)
      5 => string ' garlics' (length=8)
    */
    //
    
  • preg_match() : takes two to five parameters, only the first two are mandatory: the regular expression and the string to search. Usually preg_match() is used to identify if a string contains a sequence of characters (as defined into the regex), the first match will return a Boolean “TRUE” and no further attempt is made for further findings.  The optional third parameter has no meaning to be used on this function, as it only returns an array which contains the returned status (“1” or “0”, ie Boolean false/true).
  • preg_match_all() : in some sense, this function extends the previous one (ie preg_match()). It’s also used to identify if a string contains a sequence of characters (as defined into a pattern). The third parameter will contain all matched occurrences.
    preg_match("/\d/", "1 and 2 and 3 and 4", $matches1);
    preg_match("/welcome|\d/", "1 and welcome 2 and welcome 3 and welcome 4", $matches2);
    preg_match_all("/\d/", "1 and 2 and 3 and 4", $matches3);
    preg_match_all('/\d|wel(com)e/', "1 and welcome 2 and welcome 3 and welcome 4", $matches4);
    
    var_dump($matches1);
    var_dump($matches2);
    var_dump($matches3);
    var_dump($matches4);
    
    /*
      // $matches1
    array (size=1)
      0 => string '1' (length=1)
    
      // $matches2
    array (size=1)
      0 => string '1' (length=1)
    
      // $matches3
    array (size=1)
      0 =>
        array (size=4)
          0 => string '1' (length=1)
          1 => string '2' (length=1)
          2 => string '3' (length=1)
          3 => string '4' (length=1)
    
      // $matches4
    array (size=2)
      0 =>
        array (size=7)
          0 => string '1' (length=1)
          1 => string 'welcome' (length=7)
          2 => string '2' (length=1)
          3 => string 'welcome' (length=7)
          4 => string '3' (length=1)
          5 => string 'welcome' (length=7)
          6 => string '4' (length=1)
      1 =>
        array (size=7)
          0 => string '' (length=0)
          1 => string 'com' (length=3)
          2 => string '' (length=0)
          3 => string 'com' (length=3)
          4 => string '' (length=0)
          5 => string 'com' (length=3)
          6 => string '' (length=0)
    */
    

    Let’s improve a little bit the presentation of the array (by passing a fourth parameter into the function)

    //
    $record = <<<EOT
    Male 1987-11-29 New York
    Female 1988-07-13 Tennessee
    Female 1990-04-14 New York
    
    EOT;
    $pattern = '/(Male|Female) (\d+)-(\d+)-(\d+) ([\w\s]+)\n/';
    preg_match_all($pattern, $record, $matches);
    var_dump($matches);
    /*
    array (size=6)
      0 =>
        array (size=3)
          0 => string 'Male 1987-11-29 New York
    ' (length=26)
          1 => string 'Female 1988-07-13 Tennessee
    ' (length=29)
          2 => string 'Female 1990-04-14 New York
    ' (length=27)
      1 =>
        array (size=3)
          0 => string 'Male' (length=4)
          1 => string 'Female' (length=6)
          2 => string 'Female' (length=6)
      2 =>
        array (size=3)
          0 => string '1987' (length=4)
          1 => string '1988' (length=4)
          2 => string '1990' (length=4)
      3 =>
        array (size=3)
          0 => string '11' (length=2)
          1 => string '07' (length=2)
          2 => string '04' (length=2)
      4 =>
        array (size=3)
          0 => string '29' (length=2)
          1 => string '13' (length=2)
          2 => string '14' (length=2)
      5 =>
        array (size=3)
          0 => string 'New York ' (length=9)
          1 => string 'Tennessee ' (length=10)
          2 => string 'New York' (length=8)
    */
    
    preg_match_all($pattern, $record, $matches, PREG_SET_ORDER);
    var_dump($matches);
    
    /*
    array (size=3)
      0 =>
        array (size=6)
          0 => string 'Male 1987-11-29 New York
    ' (length=26)
          1 => string 'Male' (length=4)
          2 => string '1987' (length=4)
          3 => string '11' (length=2)
          4 => string '29' (length=2)
          5 => string 'New York ' (length=9)
      1 =>
        array (size=6)
          0 => string 'Female 1988-07-13 Tennessee
    ' (length=29)
          1 => string 'Female' (length=6)
          2 => string '1988' (length=4)
          3 => string '07' (length=2)
          4 => string '13' (length=2)
          5 => string 'Tennessee ' (length=10)
      2 =>
        array (size=6)
          0 => string 'Female 1990-04-14 New York
    ' (length=27)
          1 => string 'Female' (length=6)
          2 => string '1990' (length=4)
          3 => string '04' (length=2)
          4 => string '14' (length=2)
          5 => string 'New York' (length=8)
    */
    //
    
  • preg_quote() : inserts a backslash delimiter before every character of special significance to regular expression syntax. These special characters include $ ^ * ( ) + = { } [ ] | \\
    //
    var_dump(preg_quote('abcdefg $ ^ * . " \' ( ) + = - { } [ ] | \\'));
    /*
    string 'abcdefg \$ \^ \* \. " ' \( \) \+ \= \- \{ \} \[ \] \| \\' (length=56)
    */
    //
    
  • preg_last_error() : Returns the error code of the last PCRE regex execution. Returns a string (constant), a full list of predefined constants is outlined on the documentation page.

Meta Characters :

Description
^ Marks the start of a string (similar to “\a”)
$ Marks the end of a string (similar to “\z”)
. Matches any single character except for the newline
\. Quotes the next metacharacter
| Boolean OR
() Group elements
[abc] Item in range (a,b or c)
[^abc] NOT in range (every character except a,b or c)
a? Zero or one b characters. Equals to a{0,1}
a* Zero or more of a
a+ One or more of a
a{2} Exactly two of a
a{,5} Up to five of a
a{5,} Minimum five “a” characters
a{5,10} Between five to ten of a
\A Start of subject
\b Matches a word boundary
\B Matches anything but a word boundary
\w Any alpha numeric character plus underscore. Equals to [A-Za-z0-9_]
\W Any non alpha numeric characters
\s Any white-space character
\S Any non white-space character
\r Any carriage return character (new line )
\n Any new line character
\d Any digits. Equals to [0-9]
\D Any non digits. Equals to [^0-9]
\z Marks the end of a string
\Z End of subject or newline at end
. \ + * ? [ ^ ] $ ( ) { } = ! < > | :These characters have special meaning for the regex engine and must be escaped if we want to use those as literal characters

Pattern Modifiers (flags) : 

Description
i Ignore case
m Multiline mode
S Extra analysis of pattern
u Pattern is treated as UTF-8
U Enable non-greedy (see above section for description)

Delimiters :

Alternative delimiters
  • /
  • @
  • #
  • `
  • ~
  • %
  • &

Many other characters can be used as delimiters, these delimiters are the most used (according my experience)

Final thoughts :

Usually, if you are trying to match a known, simple value with no real logic behind the searching, you should try to use the standard string-matching functions to keep things simple. On the other side of the coin, if you have special rules that accompany your search or a pattern that has to cover a broad range of matching, use regular expressions.

Links :

Advertisements

Comments»

No comments yet — be the first.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s