regexp – Marco Gonçalves

The problem, was to find and replace text inside HTML (without breaking the HTML), take for example this example string:

<img title=”My image” alt=”My image” src=”/gfx/this is my image.gif”>This is my string

and you want to replace the string “my” to another string or to enclose it inside another tag (let’s assume ), but only the “my” outside the html tags. So after the transformation it would look like:

<img title=”My image” alt=”My image” src=”/gfx/this is my image.gif”>This is my string

With PHP Regular Expression functions, the typical solution find and replace with word boundary fails here.

preg_replace('/\b(my)\b/i',
             '<strong>$1</strong>',
             $html_string);

you will end up with messed up html

<img title=”My image” alt=”My image” src=”/gfx/this is my image.gif”>This is my string

now think the wonderful mess that would be if you are replacing the words like “form” or “alt” that can be a text node, a html tag or attribute….

So how to fix this? I figured that the only common thing to all tags is the open and close character, the < and >, from here you simply search the word you want to replace and the next close tag char (the > sign), and within the matched result, you try to find a open tag char, if you don’t find an open tag you are within a tag, so you abort the replace. Here it is the code:

function checkOpenTag($matches) {
    if (strpos($matches[0], '<') === false) {
        return $matches[0];
    } else {
        return '<strong>'.$matches[1].'</strong>'.$matches[2];
    }
}

preg_replace_callback('/(\bmy\b)(.*?>)/i',
                      'checkOpenTag',
                      $html_string);

If you are going to use this kind of code to implement several words search in a HTML text (ex: a glossary implementation) test for performance and do think about a caching system.

That’s it, remember as this solution worked fine for me, it also can work terribly bad for you so proceed at your own risk (aka liability disclaimer).

UPDATE 19-04-14
There was a comment about this post that warms about only the first occurrence being replaced in an HTML segment. So, there is an updated version of the PHP example with this issue corrected:

<?

class replaceIfNotInsideTags {

  private function checkOpenTag($matches) {
    if (strpos($matches[0], '<') === false) {
      return $matches[0];
    } else {
      return '<strong>'.$matches[1].'</strong>'.$this->doReplace($matches[2]);
    }
  }

  private function doReplace($html) {
    return preg_replace_callback('/(\b'.$this->word.'\b)(.*?>)/i',
                                 array(&$this, 'checkOpenTag'),
                                 $html);
  }

  public function replace($html, $word) {
    $this->word = $word;

    return $this->doReplace($html);
  }
}

$html = '<p>my bird is my life is my dream</p>';

$obj = new replaceIfNotInsideTags();
echo $obj->replace($html, 'my');

?>

Tag: regexp

PHP Email validation function

PHP regexp replace word(s) in html string if not inside tags