/ regex

Introduction to Regular Expressions (aka regex) (Part 1)

A regular expression is a pattern describing a certain amount of text and is a type of shorthand to describe a search pattern. It can be used to find text which matches a pattern within a larger text, to replace the matching text or to split the matching text into groups. Regular expressions power of extracting specific text from documents resides in their ability to replace many lines of code with as little as one line.

Some terms used in regular expressions:

  1. Literal - A literal is any character we use in a search or matching expression. For example, to find "ind" in "windows", the "ind" is a literal string. Each character plays a part in the search, it is literally the string we want to find.

  2. Metacharacter - A metacharacter is one or more special characters that have a unique meaning and are NOT used as literals in the search expression. For example, the character ^ (caret).

  3. Target string - This term describes the string that we will be searching, i.e the string in which we want to find our match or search pattern.

  4. Search expression - Most commonly called the regular expression. This term describes the search expression that we will be using to search our target string, that is, the pattern we use to find what we want.

  5. Escape sequence - An escape sequence is a way of indicating that we want to use one of our metacharacters as a literal. In a regular expression an escape sequence involves placing \ (backslash) in front of the metacharacter that we want to use as a literal.

Note: Ruby's scan method is used below in example just to show how regular expression will work. Scan method returns array of matched data.

Regular expression basics:

Ranges and Negation

  • [] matches any single character in brackets. It is also called as character class.

e.g

"ruby".scan(/[r]/) will return ["r"].

"Rubyr".scan(/[r]/i) will return ["R", "r"] as we have used 'i' which is used for case insensitive search.

  • ^ (Caret) inside square brackets negates the expression. It will search for string except whatever is written inside square brackets.

e.g

"ruby".scan(/[^r]/) will return ["u", "b", "y"]

"Ruby".scan(/[^r]/i) will return ["u", "b", "y"]

  • - (dash) inside square brackets will work as 'range separator'. It creates range when it is placed between 2 characters.

e.g [0123456789] can be written as [0-9]

Position

  • The ^ (caret) outside square bracket will check for start of string.

e.g

"Ruby".scan(/^[r]/i) or "Ruby".scan(/^r/i) will return ["R"]

"Programming".scan(/^[r]/i) will return [] as "r" is not present at start of string.

  • \A checks for start of string.

e.g "ruby".scan(/\Ar/) will return ["r"]

  • \Z checks for end of string.

e.g "ruby".scan(/y\Z/) will return ["y"]

  • $ checks for end of string.

e.g "interaction".scan(/n$/) will search for 'n' at end of string.

Difference between ^ and \A:

^ will match at start of each line. It matches very first character and also after each line break character. Whereas \A just matches start of string.

e.g "ab\nac".scan(/^a/) will return ["a", "a"] whereas "ab\nac".scan(/\Aa/) will return ["a"]

Difference between $ and \Z:

$ will match at end of each line. It matches very last character and also after each line break character. Whereas \Z just matches end of string.

e.g "ab\ncb".scan(/b$/) will return ["b", "b"] whereas "ab\ncb".scan(/b\Z/) will return ["b"]

  • The . (dot) allows to match any single character except newline.

e.g

"ruby\nruby".scan(/./) will return ["r", "u", "b", "y", "r", "u", "b", "y"]

"ruby\nruby".scan(/./m) will match new line as well and will return ["r", "u", "b", "y", "\n", "r", "u", "b", "y"]

Iteration metacharacters:

  • The ? (question mark) matches when the preceding character occurs 0 or 1 times only.

e.g /ruby?is/ will match "rubyis" as well as "rubis".

  • The * (asterisk or star) matches when the preceding character occurs 0 or more times.

e.g /ruby*is/ will match "rubyis", "rubyyyis" and "rubis".

  • The + (plus) matches when the preceding character occurs 1 or more times.

e.g /ruby+is/ will match "rubyis" and "rubyyis" but it will not match "rubis".

  • {n} matches when the preceding character, or character range occuring n times exactly.

e.g /ruby{2}is/ will match "rubyyis" as it checks for exact 2 occurrences of character 'y'.

  • {n,m} matches when the preceding character occurs at least n times but not more than m times.

e.g /ruby{1,2}is/ will match "rubyis" and "rubyyis" but will not match "rubis" and "rubyyys" as it checks for occurrence of "y" between range of 1 and 2.

  • {n,} matches when the preceding character occurs at least n times.

e.g /ruby{2,}is/ will match "rubyyis", "rubyyyis" and so on but it will not match with "rubyis" or "rubis".

In this way part 1 includes very basic information which is needed to start with regular expressions.

Part 2 - It will contain Modifiers, Shorthand characters and grouping.