Saturday 9 February 2013

Regular Expression Length Validation

There's no need to use a regular expression as length validation if length validation is the only thing you are doing. For that, you would use strlen(). However, you will likely find yourself faced with length and content validation at the same time. A primary example of this would be username validation. Let's set out some really basic rules for a username:
  • Must be between 3 and 20 characters in length.
  • Must start with a letter and must end with a letter or number.
  • Can only contain one period or hyphen.
  • Other than a period or hyphen, it must only contain alphanumeric characters.
These conditions could be achieved by using strlen to check the length and then a regular expression to check the other conditions.

Strlen and preg_match
if(strlen($username) >= 3 && strlen($username) <= 20)
   {
        if(preg_match("/^[a-z][a-z0-9]*[.-]?[a-z0-9]*$/i", $username))
            {
                 //continue
            }
   }

However, since you are already using a regular expression, you may as well add length validation to the pattern.

To do this, you need to use a lookahead at the beginning of the pattern. As we have a specific first character requirement we will place the lookahead after the first character class. This is an optimisation. If we did the lookahead straight away and then checked the first character to find it did not match we would have wasted some time processing a lookahead. If we had checked the first character straight away the pattern would have not matched and the lookahead would not have been processed. A small optimisation but one nonetheless.

To match all the username conditions above, including the length, with one regular pattern we would do the following.

Just preg_match
if(preg_match("/^[a-z](?=[a-z0-9.-]{2,19}$)[a-z0-9]*[.-]?[a-z0-9]*$/i", $username))
    {
         //continue
    }

Lookahead's are never captured but if they do not match the regular expression will stop. The lookahead here is checking to see if the allowed characters are repeated between 2 and 19 times (we've already matched the first character) and then the string ends. If the string is not repeated at least 2 times or is repeated more than 19 times the lookahead will fail and therefore the match will fail. And that is length validation.

A breakdown of the expression:
/
^                      #start of string
[a-z]                  #match a single letter
(?=                    #lookahead for:
    [a-z0-9.-]{2,19}       #match the allowed characters between 2 and 19 times
    $                      #end of string
)
[a-z0-9]*              #match a letter or number between 0 and infinity times
[.-]?                  #match a dot or a dash between 0 and 1 times
[a-z0-9]*              #match a letter or number between 0 and infinity times
$                      #end of string
/i                     #turn case insensitivity on

It's as simple as that.

No comments:

Post a Comment