jump to navigation

Five Regular-Expression Concepts a PHP Developer Should Know July 14, 2014

Posted by Tournas Dimitrios in PHP.
Tags: ,
add a comment

Regular expressionsA Regular Expression, also called regex or regexp, is the term used to describe a codified method of searching into a string (or text) for a matched pattern. It was invented by an American mathematician, Stephen Cole Kleene (January 5, 1909 – January 25, 1994). Regex are used for advanced context sensitive searches and text modifications. Although regex are quite wide spread in the Unix world there is no such thing as ‘the standard regular expression language’. It is more like several different dialects. PHP supports two different types of regular expressions: POSIX-extended and Perl-Compatible Regular Expressions (PCRE). While the former is depreciated as of PHP version 5.3 and attempting to use any of its functions will result in the generation of a notice of level  E_DEPRECATED, it is highly encouraged to use the later (PCRE functions), which have native support for utf-8 character set encoding, are more powerful and faster than the POSIX ones.

Excerpt taken from a StackOverflow.com discussion : POSIX (Portable Operating System Interface for Unix) is simply a set of standards that define how to develop programs for UNIX (and its variants). Being POSIX-compliant for an OS means that it supports those standards (e.g., APIs), and thus can either natively run UNIX programs, or at least porting an application from UNIX to the target OS is easy/easier than if it did not support POSIX. Of course, the level of compliance is not necessarily 100% and can vary (e.g., not all features are supported or may be implemented differently).

For who is this article ? : Certainly not for a beginner in the world of “manipulating text programmatically“, neither for those that don’t like to experiment. The reader should have basic knowledge of regex syntax (he/she should be fluent with reading/writing regex patterns, so to speak). We will “talk” about : greediness , internal option settings , lookbehinds , lookaheads, non-capturing sequences and backreferences . Each of those concepts will be explained in brief and accompanied by a practical example.  This article aims to trigger your interest for writing more advanced regex patterns.

For those that need a refresh, at the end of this article (paragraph: “A quick overview of all PCRE functions” ) are practical examples for all PCRE functions. Also tables with metacharacters and pattern-modifiers are showing the most used “tools” (learn these tools by heart if you plan to master PCRE) .Of course, the official documentation is “the way to go”, for an in-depth review of the subject .

Whatever is done with regular expressions can also be achieved with native PHP functions (after all, PHP has close to one hundred functions for text manipulation), with one crucial difference though, regular expressions are doing it in a more efficient way. Not in the context of operating system performance but on the level of writing code. The following example will illustrate what I’m talking about :

/*  */

$contents = array(
    ' ALPHA ΒΑΝΚ ' => ' 0,6320 -0,0530 -7,74% 0,6260 0,6910 46.264.300 30.254.097 -',
    //    ......  truncated  .............  ///

// Using PHP's string manipulation functions
foreach ($contents as $bank_name => $stock_values) {
    $stock_values =  str_replace('%', '', $stock_values);
    //first strip '-' and then strip whitespaces from beginning and end of string
    $cleaned_stock_values = trim(rtrim($stock_values, '-'));
    // the text is sliced and stored into an array
    $values = explode(' ', $cleaned_stock_values );
    $final_contents[$bank_name] =  $values;


// Using PCRE functions
foreach ($contents as $bank_name => $stock_values) {
$final_contents[$bank_name] = preg_split('/(-$)|[\s]|[%]/', $stock_values, null, PREG_SPLIT_NO_EMPTY);


array (size=1)
  ' ALPHA ΒΑΝΚ ' =>
    array (size=7)
      0 => string '0,6320' (length=6)
      1 => string '-0,0530' (length=7)
      2 => string '-7,74' (length=5)
      3 => string '0,6260' (length=6)
      4 => string '0,6910' (length=6)
      5 => string '46.264.300' (length=10)
      6 => string '30.254.097' (length=10)

/* */

Both approaches in the above code-snippet achieve the same goal, it’s evident that the regular expression solution is more compact (more readable I would say). For the curious spirits out there, the above example is an associative array whose values are strings of stock-market values which have to be sliced into separate entities. Firstly, each string has to be cleaned from unwanted characters (ie : %, – and white-spaces) and then the same string has to be transformed into an array. The outcome of this transformation can be used by Google Charts  and other custom code to extract statistical conclusions in order to create attractive and interactive charts. (more…)