Python Regular Expressions or regex Matching, Searching, Replacing

Python Regular expressions

Description:

Python Regular expressions, also called REs, or regexes, or regex patterns, provide a powerful way to search and manipulate strings. Python Regular Expressions are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Python Regular Expressions use a sequence of characters and symbols to define a pattern of text. Such a pattern is used to locate a chunk of text in a string by matching up the pattern against the characters in the string. Python Regular Expressions are useful for finding phone numbers, email addresses, dates, and any other data that has a consistent format.



Python Regular Expressions Match Using Special Characters:

A regular expression pattern is made of simple characters, such as abc, or a combination of simple and special characters, such as ab*c. Simple patterns are constructed of characters for which you want to find a text match. For example, the pattern abc matches character combinations in strings only when the characters “abc” occur together and exactly in that order. Such a match would achieve in the strings: “Hi, do you know your abc’s?” and “The latest airplane designs advanced from slabcraft.” In both the cases, the match is with the substring “abc”. There is no  matchup in the string “Grab crab” because, while it contains the substring “ab c”, it does not contain the exact substring “abc”. Some characters are metacharacters, also called as special characters, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the python Regular Expressions by repeating them or changing their meaning. When the search for a match bears something more than a direct match, such as finding one or more b’s or finding white space, the pattern includes special characters. For example, the pattern ab*c matches any character admixture in which a single “a” is followed by zero or more ‘b’s (* means 0 or more occurrences of the preceding item) and then immediately followed by “c”. In the string “cbbabbbbcdebckkkkkaakkakakaaass,” the design matches the substring “abbbbc”. Below you will find a complete list and description of the special characters that can be used in python Regular Expressions.

Special Character → [xyz]

Description → Square brackets are used to indicate a set of characters. The square brackets [] are used for specifying a character class, also called a “character set,” which is a set of characters that you want to match. Place the characters you want to match between square brackets. This pattern type matches any one of the characters in the brackets, including escape sequences. Special characters like the dot(.) and asterisk (*) are not special inside a character set, so they do not need to be escaped. Characters can be listed apart, or a range of characters can be indicated by giving two characters and separating them by a hyphen (-).

Example → The pattern [abc] will match any one of the characters a, b, or c; this is the same as [a-c], which uses a range to expound the same set of characters. The pattern [akm$] will bout any of the characters ‘a’, ‘k’, ‘m’, or ‘$’. The character ‘$’ is usually a special character, but inside a character class it is stripped of its special nature. The pattern [a-d], which accomplishes the same match as [abcd], matches the ‘b’ in “brisket” and the ‘c’ in “city”.

Special Character → . (a period)

Description → Matches any single character except newline ‘\n’.

Example → The pattern .n matches the substrings ‘an’ and ‘on’ in the string “nay, an

apple is on the tree”, but not ‘nay’.

Special Character → ^

Description → Matches the start of the string and, in multiline mode, also matches immediately after each newline.

Example → The pattern ^A does not match the character ‘A’ in the string “an A” but does match the character ‘A’ in the string “An E”. You can match the characters not listed within the class by complementing the character set. That is, if the first character of the character set is ‘^’, all the characters that are not in the character set will be matched. The character ‘^’ has no special meaning if it’s not the first character in the character set. You can define a range of characters by using a hyphen. Everything that works in the normal glyph set also works here. For example, the pattern [^abc] is the same as [^a-c] pattern. They firstly match the character ‘r’ in the string “brisket” and the character ‘h’ in the string “chop.”

Special Character → $

Description → Matches the end of the string or just before the newline at the end of the string.

Example → The pattern t$ does not match the character ‘t’ in the string “eater” but

does match it in the string “eat”.

Special Character → *

Description → Matches the preceding expression 0 or more times.

Example → The pattern bo* matches the substring ‘boooo’ in the string “A ghost

boooeekewkrjwekrjkrroed” and matches the character ‘b’ in the string “A bird warbled” but nothing

in the string “A goat grunted”.

Special Character → +

Description → Matches the preceding expression 1 or more times.

Example → The pattern a+ matches the character ‘a’ in the string “candy” and all the a’s in “caaaaaaandy”, but nothing in “cndy”.

Special Character → ?

Description → Matches the preceding expression 0 or 1 time.

Example → The pattern e?le? matches the substring ‘el’ in the string “angel” and matches the substring ‘le’ in the string “angle” and also the character ‘l’ in the string “oslo”. If used immediately after any of the special characters *, +, or {}, makes the special character non-greedy (matching the fewest possible characters), as opposed to the default, which is greedy (matching as many characters as possible).


For example, applying the pattern \d+ to the string “123abc” matches the substring “123”. But applying the pattern \d+? to that same string matches only

the character “1”.

Special Character → \d

Description → Matches any decimal digit [0-9]

Example → The pattern \d or [0-9] matches the character ‘2’ in the string “B2 is the

suite number.”

Special Character → \D

Description → Matches any non-digit character. Equivalent to [^0-9].

Example → The pattern \D or [^0-9] matches the character ‘B’ in the string “B2 is the suite number.”

Special Character → \w

Description → Matches a “word” character and it can be a letter or digit or underscore. It is equivalent to [a-zA-Z0-9_]. Note that although “word” is the mnemonic for this, it only matches a single word character, not a whole word.

Example → The pattern \w matches the character ‘a’ in the string “apple”, the character ‘5’ in the string “$5.28” and the character ‘3’ in the string “3D.”

Special Character → \W

Description → Matches any non-word character. Equivalent to [^A-Za-z0-9_].

Example → The pattern \W or [^A-Za-z0-9_] matches the character ‘%’ in the string “50%.”

Special Character → \s

Description → Matches a single whitespace character including space, newline, tab, form feed. Equivalent to [\n\t\f].

Example → The pattern \s\w* matches the substring ‘bar’ in the string “foo bar.”

Special Character → \S

Description → Matches any non-whitespace character. Equivalent to [^ \n\t\f].

Example → The pattern \S* matches the substring ‘foo’ in the string “foo bar.”

Special Character → \b

Description → Matches a word boundary.

There are three different positions that qualify as word boundaries when the special character \b is placed:

  • Before the first character in the string and if the first character in the string is a word character.
  • After the last character in the string and if the last character in the string is a word character.
  • Between two characters in the string, where one is a word character in the string

and the other is not a word character. The special character \b allows you to perform a search of a complete word using a python Regular Expressions in the form of \bword\b; it won’t match when it is contained inside another word. Note that a word is defined as a sequence of word characters. The \b special character matches the empty string, but only at the beginning or end of a word.

Example → The pattern \bm matches the character ‘m’ in the string “moon”.

The pattern oo\b does not match the substring ‘oo’ in the string “moon”, because the substring ‘oo’ is followed by ‘n’ which is a word character. The pattern oon\b matches the substring ‘oon’ in the string “moon”, because ‘oon’ is the end of the string, thus not followed by a word character. The pattern \bfoo\b matches the string ‘foo’, ‘foo.’, ‘(foo)’, ‘bar foo baz’ but not ‘foobar’ or ‘foo3’. The pattern \w\b\w will never match anything because \b character can never be

preceded and followed by a word character.

Special Character → \B

Description → Matches a non-word boundary. This matches the following cases when the special character \B is placed:

  • Before the first character of the word and if the first character is not a word character.
  • After the last character of the word and if the last character is not a word character.
  • Between two-word characters
  • Between two non-word characters

The beginning and end of a string are considered non-word characters. The ‘\B’ special character matches an empty string, only when it is not at the beginning or end of the word.

Example → The pattern \B.. matches the substring ‘oo’ in “noonday”, and the pattern y\B. matches the substring ‘ye’ in the string “possibly yesterday.”

Special Character → \

Description → Matches agreeing to the following rules: A backslash that precedes a non-special character indicates that the next character is special and is not to be interpreted literally. A backslash that antecedes a special character indicates that the next character is not special and should be interpreted literally.

Example → The pattern ‘b’ without a preceding ‘\’ generally matches lowercase ‘b’s wherever they occur. But a ‘\b’ by itself does not match any character; it forms the special word boundary character. The pattern a* relies on the special character ‘*’ to match 0 or more a’s. By contrast, the pattern a\* removes the specialness of the ‘*’ to enable matches with strings like ‘a*’.

Special Character → {m, n}

Description → Where m and n are positive integers and m <= n. Matches at least m and at most n occurrences of the preceding expression. If n is omitted, i.e. {m,},

then it matches at least m occurrences of the preceding expression. Here m must be a positive integer.

Example → The pattern a{1,3} matches nothing in the string “cndy”, but matches the character ‘a’ in the string “candy”. The pattern a{1,3} matches the first two a’s in the string “caandy,” and the first three a’s in the string “caaaaaaandy”. Notice that when matching “caaaaaaandy”, the match is “aaa”, even though the original string had more a’s in it. The pattern a{2,} will match substrings “aa”, “aaa”, “aaaa”, “aaaaa”,

“aaaaaa”, “aaaaaaa” but not “a”.

Special Character → {m}

Description → Matches exactly m occurrences of the preceding expression. Here m must be a positive integer.

Example → The pattern a{2} doesn’t match the character ‘a’ in the string “candy,” but it does match all of the a’s in the string “caandy,” and the first two a’s in the string “caaandy.”

Special Character → A|B

Description → A | B Matches ‘A’, or ‘B’ (if there is no match for ‘A’), where A and B are Python Regular Expressions.

Example → The patterned green|red matches the substring ‘green’ in the string “green apple” and matches the substring ‘red’ in the string “red apple.” The order of ‘A’ and ‘B’ matters. For example, the pattern a*|b matches the empty string in the string “b”, the pattern but b|a* matches character “b” in the same string.


Using r Prefix for Python Regular Expressions

Consider the python Regular Expressions, r’^$’. This regular expression matches an empty line. The ‘^’ indicates the start of a line, and the ‘$’ indicates the end of a line. Having nothing between the special characters ‘^’ and ‘$’, therefore, matches an empty line. The ‘r’ prefix tells Python that the expression is a raw string and are handy in python Regular Expressions. In a raw string, escape sequences are not parsed. For example, ‘\n’ is a single newline character. therefore r’\n’ would be two words characters: a backslash and an ‘n’. Using an expression like r'[\w]’ instead of ‘[\\w]’ results in easier to read expressions.

Match Using Parentheses in Python Regular Expressions:

Special Character → (….)

Description → Matches whatever regular expression pattern is inside the parentheses and causes that part of the matched substring to be remembered. Once remembered, the substring can be recalled for other use. Parts of a regular expression pattern bounded by parentheses are called groups, and they contain the matched substring. The parentheses are also called as capturing parentheses or capturing group. Parentheses indicate the start ‘(‘ and end ‘)’ of a group. Based on the number of parentheses used in a regular expression, the number of groups are created. If your regular expression contains a single pair of parentheses (one capturing group), you only get one group in your match. If there are two pairs of parentheses, then there will be two groups in your match, and so on.

If you use a repetition operator on a capturing group (+ or *), the group gets “overwritten” each time the group is repeated, meaning that only the last match is captured. The gist of a group can be retrieved after a match has been performed. Groups are numbered starting from 0, for example, group(0) … up to group(99). To match the literals ‘(‘ or ‘)’, use \( or \), or surround them inside a character class: [(], [)]. Parenthesis not only group substrings but they create backreferences as well. A backreference in a regular expression identifies a previously matched and remembered group and allows you to specify its contents i.e., backreference matches a substring already found in a group. You simply add a backslash character and the number of the group to match again. For example, to find the content matched by the first group in a regular expression, you would include, “\1” in your regular expression pattern. Always represent backreferences as rawstrings in python Regular Expressions.

Example → The pattern Chapter (\d+)\.\d* illustrates additional escaped and special characters and indicates that part of the pattern should be remembered.

It matches precisely the characters ‘Chapter ‘ followed by a space, followed by one or more numeric characters (\d means any numeric character and + means 1 or more times), followed by a decimal point (which in itself is a special character; preceding the decimal point with \ means the pattern must look for the literal character ‘.’), followed by any numeric character 0 or more times (\d means numeric character, * means 0 or more times). In addition, parentheses are used to remember the first matched numeric characters.

To match a substring without causing the matched part to be remembered, within the parentheses preface the pattern with ?:. For example, (?:\d+) matches one or more numeric characters but does not remember the matched characters.



Regular Expression Methods

Now that we have looked at some simple python Regular Expressions, how do we actually use them in Python? In Python, methods to use and apply regular expressions can be accessed by importing the re module. The re module provides an interface to the Python regular expression engine.

Compiling Python Regular Expressions Using compile() Method of re Module

Python Regular Expressions can be compiled into a pattern object, which has methods for various operations such as searching for pattern matches, finding all pattern matches or performing string substitutions. When you have to use the same regular expression again and again on different strings, then it is an excellent idea to construct a regular expression as a Python object. This can be accomplished through the use of the re.compile() method.

re.compile(pattern[,flags])

where pattern is the regular expression and the optional flags argument is used to enable various special features and syntax variations. For example, specifying the flags re.A enables ASCII-only matching, re.I enables case-insensitive matching; expressions like [A-Z] will also match lowercase letters and re.M enables “multi-line matching.” When re.M flag is enabled, the meaning of ‘^’ and ‘$’ changes. The special character ‘^’ matches at the beginning of the string and also at the beginning of each line (immediately following each newline); and the special character ‘$’ matches at the end of the string and also at the end of each line (immediately preceding each newline). By default, the special character ‘^’ matches only at the beginning of the string, and the special character ‘$’ matches only at the end of the string and immediately before the newline (if any) at the end of the string.

The compile() method returns a regular expression as a Python object, which can be used for matching patterns by using its match(), search(), sub(), findall() and other methods The main difference between search() and match() methods is search() method searches anywhere in the entire string and returns a match object while the match() method matches zero or more characters at the beginning of the string and returns a match object.


Regular Expression Match Objects:

The match() and search() methods supported by a compiled regular expression object, returns None if no match is found. If they are successful, a match object instance is returned, containing information about the match like the substring it has matched, where the match starts and ends and much more. Since match() and search() return None when there is no match, you can test whether there was a match with a simple if statement as shown below. In order to build and use regular expressions, perform the following steps:

Step 1: Import re regular expression module.

Step 2: Compile regular expression pattern using re.compile() method. This method returns the regular expression pattern as an object.

Step 3: Invoke an appropriate method supported by the compiled regular expression object which returns a matched object instance containing information about matched strings.

Step 4: Call methods (group() method is appropriate for most cases) associated with the matched object to display the results.

For Example,

Import re module. Compile the regular expression pattern ‘(e)g’ which matches the characters eg found at the beginning of a string. Pass the string from which you want to extract the regular expression pattern as an argument to match() method. As you can see in the result of match_object, the matched string is assigned to match. To obtain the strings that were matched, use the group() method associated with match_object. Groups are always numbered starting with 0. Group 0 is always present and it represents the entire result of the regular expression itself, so group() method of match object all have 0 as their default argument. In, even though the string has the pattern eg in it, the characters eg are not found at the beginning of the string. Thus, if you try to use the group() method with match object then it results in an error. The ‘r‘, at the start of the pattern string designates a Python “raw” string. It is highly recommended that you make it a habit of writing pattern strings with an ‘r‘ prefix.

In the above example, regular expression pattern (ab)* will match zero or more repetitions of ab. Pass the string from which you want to extract the regular expression pattern as an argument to the match() method ➂. Groups indicated with ‘(‘, ‘)’ also capture the starting and ending index of the matched substring, and this can be retrieved using span() method. Also, the starting position of the match can be obtained by the start() method and ending position of the match is obtained by the end() method ➅. Since the match() method only checks if the regular expression matches at the start of a string, start() method will always return zero.


In the above example, the regular expression pattern ‘(a(b)c)d’ will match the string ‘abcd’. Pass the string from which you want to extract the regular expression pattern as an argument to the match() method. By passing an integer number argument greater than zero to the group() method, you can also extract part of the matched expression instead of entire expression. The group() method with integer 0 as argument returns the entire matched text while the group() method with greater than zero as argument returns only a part of the matched text. Each of these number arguments corresponds to specific groups. Groups are numbered from left to right, starting from number 1. For example, group(0) returns the entire matched string ‘abcd’ ➃, while group(1) returns ‘abc’ ➄ and group(2) returns ‘b’. To determine the integer number, count the number of parentheses pairs from left to right. Also, group() method can be passed to multiple group numbers at a time, in which case it will return a tuple containing the corresponding values for those groups. The groups() method returns a tuple (‘abc’, ‘b’) containing all the subgroups of the match.

In the above example, the regular expression pattern ‘\d+will match one or more digits of a number. Pass the string from which you want to extract the regular expression pattern as an argument to findall() method . The findall() returns a list numbers [‘100000’, ‘5000’] as strings with each string representing one match.

In the above example, the regular expression pattern ([\w\.]+)@([\w\.]+) matches user name and the domain name of an email ID which is to the left and right of the @ symbol. This regular expression pattern has two pairs of parenthesis representing two groups belonging to user name and domain name substrings. The dot (.) character is also matched in the user name and domain name substrings. The findall() method returns all the occurrences of the matching pattern as a list of tuples with each tuple having user name and domain name as its string items matching their corresponding parenthesis groups. Iterate through each of these tuple items in the list using for loop and display user name and domain name. Including parentheses in a regular expression pattern causes the corresponding matched group to be remembered. For example, /a(b)c/ matches the characters ‘abc’ and remembers ‘b’. To recall this matched substring group, use backreference like \1.

In the above example, regular expression ‘(\w+)\s(\w+)’ will match a substring followed by a space and another substring. There are two pairs of parentheses in the above code with each parenthesis matching a substring. The sub() method is used to switch the words in the string. For the replacement text, use r’\2 \1′ where \1 in the replacement is replaced by a matched substring of the first group and \2 is replaced by second matched substring of the second group.



In the above code, comma ‘,’ is replaced with dollar ‘$’ sign.

In the above code, the search() method searches for the pattern ‘tree:’ followed by a 3-letter word. The code pattern.search(“Example for tree:oak”) returns the search result as an object and is assigned to match_object variable. Then use if statement to test the match_objec. If it evaluates to Boolean True, then the search has succeeded and the matched string is displayed using match_object.group(). Otherwise, if the match is Boolean False (None to be more specific), then the search did not succeed, and there is no matching string.

Example Write a Python Program to Check the Validity of a Password  Given by User. The Password Should Satisfy the Following Criteria:

  1. 1 letter between a and z
  2. 1 number between 0 and 9
  3. 1 letter between A and Z
  4. 1 character from $, #, @
  5. Minimum length of password: 6
  6. Maximum length of password: 12

Pattern r'[a-z]’ checks for at least one lowercase letter between a and z. Pattern r'[A-Z]’ checks for at least one uppercase letter between A and Z. Pattern r’\d’ checks for at least one number between 0 and 9. Pattern r'[$#@]’checks for at least one character from $, #, @. Password length for minimum and maximum characters is checked in. If all the conditions are satisfied, then a “Valid Password” message is printed .


Named Groups in Python Regular Expressions

Python Regular Expressions use groups to capture strings of interest. As the regular expression becomes complex, it gets difficult to keep track of the number of groups in the regular expression. In order to overcome this problem Python provides named groups. Instead of referring to the groups by numbers, you can reference them by a name. The syntax for a named group is,

(?P<name>RE)

where the first name character is ?, followed by letter P (uppercase letter) that stands for Python Specific extension, name is the name of the group written within angle brackets, and RE is the regular expression. Named groups behave exactly like attaining groups, and additionally associate a name with a group. The match object methods that deal with attaining groups all accept either integers that refer to the group by number or strings that contain the desired group’s name.

In the above code, the python  regular expressions has a group that matches the pattern of a word boundary followed by one of more alphanumeric characters, that is, a-z, A-Z, 0-9 and _, followed by a word boundary. The name given to this group is <word> specified within angle brackets. Pass the string from which you want to extract the pattern as an argument to the search() method. By passing the group name ‘word’ as an argument to the group() method, you can extract the matched substring. This named group can still be used to retrieve information by the passed integer numbers instead of group name.


Searching in Python Regular Expressions:

Compiling a pattern allows it to be reused later on in a program. However, note that Python caches recently-used expressions.

Flags in Python Regular Expressions:

For some special cases we need to change the behavior of the Regular Expression, this is done using flags. Flags can be set in two ways, through the flags keyword or directly in the expression. Flags keyword Below an instance for re.search but it works for most functions in the re module.



Replacing in Python Regular Expressions:

Replacements can be made on strings using re.sub. Replacing strings

Using group references Replacements with a small number of groups can be made as follows:

However, if you make a group ID like ’10’, this doesn’t work: \10 is read as ‘ID number 1 followed by 0’. So you have to be more specific and use the \g<i> notation:

Using a replacement function

Related Article:

Python String Format String Specifiers, Escape Sequences, Raw Format

Recommended For You

Fawad

About the Author: Fawad

My name is Shahzada Fawad and I am a Programmer. Currently, I am running my own YouTube channel "Expertstech", and managing this Website. My Hobbies are * Watching Movies * Music * Photography * Travelling * gaming and so on...

Leave a Reply

%d bloggers like this: