Intro to Regex

Regex and NLP

Regular Expressions (regex) in Python are powerful tools used to search, match, and manipulate text. In NLP (Natural Language Processing), regex is often used during the pre-processing stage to clean and extract useful patterns from text. It allows developers to identify and handle structures like emails, phone numbers, dates, or specific word patterns quickly and efficiently.

In rule-based chatbots, regex is often the primary mechanism used to interpret user input. Rather than “understanding” language, the chatbot matches patterns in text to predefined rules, allowing it to classify intent, validate input, and trigger responses.

Why Regex?

Regex becomes extremely helpful in various text processing scenarios, such as:

Detecting and extracting emails or phone numbers from user input
Removing unwanted symbols or extra whitespaces
Identifying patterns such as hashtags, mentions, or URLs in social media content
Matching time formats, dates, or currency patterns
Filtering out profane or restricted words from text

How does it work?

Regex works by using a sequence of characters that defines a search pattern. When applied to a string, the regex engine reads the pattern and scans through the text to find matches. Each symbol or character in a regex pattern has a specific meaning—some match literal characters, while others serve as wildcards, quantifiers, or groups. The engine can perform a variety of tasks such as searching, replacing, splitting, or extracting substrings. Regex engines operate using state-based pattern matching, scanning text sequentially and transitioning between states until a match succeeds or fails, moving through a text one character at a time based on the defined rules until a match is found or the search ends.

Applying Regex

Regex Method Table

Function	What it Does
`re.match`	Matches from start of string
`re.search`	Matches anywhere
`re.fullmatch`	Matches entire string

Character Matching

Example Situation: You want to validate whether an input contains only alphabetic characters.

Why this utility is the right one for the situation: Character matching allows you to match specific letters or characters within text, making it ideal for validation checks.

Code Examples:

import re
re.match(r"^[a-zA-Z]+$", "Hello")  # Match only letters
re.match(r"^[a-z]+$", "world")       # Match only lowercase
re.match(r"^[A-Z]+$", "TEST")        # Match only uppercase

Alternation

Example Situation: You want to accept either "yes" or "no" from user input.

Why this utility is the right one for the situation: Alternation lets you match one out of multiple possible patterns.

Code Examples:

re.match(r"yes|no", "yes")
re.match(r"cat|dog|bird", "dog")
re.search(r"apple|orange|banana", "I like orange juice")

Character Sets and Negated Character Sets

Example Situation: You want to detect any digit or symbol in a string.

Why this utility is the right one for the situation: Character sets allow you to match specific ranges or types of characters, while negated sets match everything except what's specified.

Code Examples:

re.search(r"[0-9]", "My phone number is 555-1234")
re.search(r"[^a-zA-Z0-9]", "password!@#")
re.findall(r"[aeiou]", "education")

Wild Cards

Example Situation: You want to match any character in a specific position.

Why this utility is the right one for the situation: The dot (.) wildcard matches any single character.

Code Examples:

re.match(r"h.t", "hat")
re.match(r"c.r", "car")
re.search(r".at", "I have a cat")

Note:

re.match() checks for a pattern only at the beginning of a string, while re.search() scans the entire string for a match.

Ranges

Example Situation: You want to match characters between a certain range like a–z or 0–9.

Why this utility is the right one for the situation: Ranges offer a concise way to represent multiple characters.

Code Examples:

re.match(r"[a-z]", "g")
re.match(r"[A-Z]", "T")
re.search(r"[0-5]", "There are 3 items")

Shorthand Character Classes

Example Situation: You want to quickly check if a string contains a digit, a word character, or whitespace.

Why this utility is the right one for the situation: Shorthand character classes simplify common character set patterns.

Code Examples:

re.search(r"\d", "Room 101")     # Matches digit
re.search(r"\w", "hello_world") # Matches word character
re.search(r"\s", "hello world")  # Matches space

re.search(r"\D", "A1")           # Matches non-digit
re.search(r"\W", "abc!")         # Matches non-word character
re.search(r"\S", " 123")         # Matches non-whitespace

Grouping

Example Situation: You want to extract both area code and phone number from a string.

Why this utility is the right one for the situation: Grouping lets you extract and operate on specific portions of a match.

Code Examples:

re.match(r"(\d{3})-(\d{4})", "555-1234")
re.search(r"(Mr|Ms|Dr)\.\s\w+", "Dr. Smith")
re.findall(r"(\w+)@(\w+\.com)", "email@example.com")

Quantifiers Fixed

Example Situation: You want to match a word with exactly 4 letters.

Why this utility is the right one for the situation: Fixed quantifiers ensure a precise number of matches.

Code Examples:

re.match(r"^\w{4}$", "test")
re.match(r"\d{5}", "90210")
re.search(r"\w{3}\d{2}", "abc12")

Quantifiers Optional

Example Situation: You want to match both "color" and "colour".

Why this utility is the right one for the situation: Optional quantifiers match the presence or absence of a character.

Code Examples:

re.match(r"colou?r", "color")
re.match(r"colou?r", "colour")
re.search(r"Nov(ember)?", "November")

Quantifiers One or More

Example Situation: You want to match a sequence of one or more digits.

Why this utility is the right one for the situation: This ensures there’s at least one match but allows for many.

Code Examples:

re.match(r"\d+", "123")
re.search(r"[a-z]+", "hello")
re.findall(r"\w+", "Find all words in this sentence.")

Anchors

Example Situation: You want to check if a string starts or ends with a specific word.

Why this utility is the right one for the situation: Anchors are used to match patterns at the start (^) or end (\$) of a string.

Code Examples:

re.match(r"^Hello", "Hello world")
re.search(r"world$", "Hello world")
re.match(r"^\d+$", "12345")

Additional Tools

Other useful regex tools include:

Lookaheads and Lookbehinds: Match patterns based on what comes before or after without including it in the match.
Non-capturing groups: Useful for grouping without saving memory.
Flags: Such as re.IGNORECASE, re.MULTILINE, and re.DOTALL to alter matching behavior.

These tools provide even more flexibility when parsing complex language structures.

Here’s a concise and informative summary you can attach to the end of your intro-to-nlp-and-regex.md lecture:

Conclusion

In this lecture, we introduced Natural Language Processing (NLP) as a critical component of chatbot development, enabling machines to understand and work with human language. We explored the key stages of text pre-processing, including noise removal, tokenization, and normalization techniques like stemming and lemmatization, which prepare raw text for intelligent analysis.

We then examined how Regular Expressions (regex) are used within NLP workflows to detect patterns, clean data, and extract meaningful elements from user input. With detailed examples, we covered the full spectrum of regex tools—from basic character matching and wildcards to powerful quantifiers, grouping, and anchors—equipping you with essential skills for building rule-based language systems.

Together, these concepts form the foundation of rule-based chatbot logic, where understanding structure, syntax, and pattern recognition is vital to creating accurate, flexible, and efficient conversational flows.