16: Regular Expressions

Last updated
Save as PDF

Page ID: 122415

$ \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $

$ \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} $

$ \newcommand{\dsum}{\displaystyle\sum\limits} $

$ \newcommand{\dint}{\displaystyle\int\limits} $

$ \newcommand{\dlim}{\displaystyle\lim\limits} $

$ \newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$

( \newcommand{\kernel}{\mathrm{null}\,}\) $ \newcommand{\range}{\mathrm{range}\,}$

$ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$

$ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$

$ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$

$ \newcommand{\Span}{\mathrm{span}}$

$ \newcommand{\id}{\mathrm{id}}$

$ \newcommand{\Span}{\mathrm{span}}$

$ \newcommand{\kernel}{\mathrm{null}\,}$

$ \newcommand{\range}{\mathrm{range}\,}$

$ \newcommand{\RealPart}{\mathrm{Re}}$

$ \newcommand{\ImaginaryPart}{\mathrm{Im}}$

$ \newcommand{\Argument}{\mathrm{Arg}}$

$ \newcommand{\norm}[1]{\| #1 \|}$

$ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$

$ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\AA}{\unicode[.8,0]{x212B}}$

$ \newcommand{\vectorA}[1]{\vec{#1}} % arrow$

$ \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow$

$ \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $

$ \newcommand{\vectorC}[1]{\textbf{#1}} $

$ \newcommand{\vectorD}[1]{\overrightarrow{#1}} $

$ \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} $

$ \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} $

$ \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $

$\newcommand{\longvect}{\overrightarrow}$

$ \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} $

$\newcommand{\avec}{\mathbf a}$ $\newcommand{\bvec}{\mathbf b}$ $\newcommand{\cvec}{\mathbf c}$ $\newcommand{\dvec}{\mathbf d}$ $\newcommand{\dtil}{\widetilde{\mathbf d}}$ $\newcommand{\evec}{\mathbf e}$ $\newcommand{\fvec}{\mathbf f}$ $\newcommand{\nvec}{\mathbf n}$ $\newcommand{\pvec}{\mathbf p}$ $\newcommand{\qvec}{\mathbf q}$ $\newcommand{\svec}{\mathbf s}$ $\newcommand{\tvec}{\mathbf t}$ $\newcommand{\uvec}{\mathbf u}$ $\newcommand{\vvec}{\mathbf v}$ $\newcommand{\wvec}{\mathbf w}$ $\newcommand{\xvec}{\mathbf x}$ $\newcommand{\yvec}{\mathbf y}$ $\newcommand{\zvec}{\mathbf z}$ $\newcommand{\rvec}{\mathbf r}$ $\newcommand{\mvec}{\mathbf m}$ $\newcommand{\zerovec}{\mathbf 0}$ $\newcommand{\onevec}{\mathbf 1}$ $\newcommand{\real}{\mathbb R}$ $\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}$ $\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}$ $\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}$ $\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}$ $\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}$ $\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}$ $\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}$ $\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}$ $\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}$ $\newcommand{\laspan}[1]{\text{Span}\{#1\}}$ $\newcommand{\bcal}{\cal B}$ $\newcommand{\ccal}{\cal C}$ $\newcommand{\scal}{\cal S}$ $\newcommand{\wcal}{\cal W}$ $\newcommand{\ecal}{\cal E}$ $\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}$ $\newcommand{\gray}[1]{\color{gray}{#1}}$ $\newcommand{\lgray}[1]{\color{lightgray}{#1}}$ $\newcommand{\rank}{\operatorname{rank}}$ $\newcommand{\row}{\text{Row}}$ $\newcommand{\col}{\text{Col}}$ $\renewcommand{\row}{\text{Row}}$ $\newcommand{\nul}{\text{Nul}}$ $\newcommand{\var}{\text{Var}}$ $\newcommand{\corr}{\text{corr}}$ $\newcommand{\len}[1]{\left|#1\right|}$ $\newcommand{\bbar}{\overline{\bvec}}$ $\newcommand{\bhat}{\widehat{\bvec}}$ $\newcommand{\bperp}{\bvec^\perp}$ $\newcommand{\xhat}{\widehat{\xvec}}$ $\newcommand{\vhat}{\widehat{\vvec}}$ $\newcommand{\uhat}{\widehat{\uvec}}$ $\newcommand{\what}{\widehat{\wvec}}$ $\newcommand{\Sighat}{\widehat{\Sigma}}$ $\newcommand{\lt}{<}$ $\newcommand{\gt}{>}$ $\newcommand{\amp}{&}$ $\definecolor{fillinmathshade}{gray}{0.9}$

16.1: Regular Expressions
This page introduces regular expressions in Python, showcasing their complexity and power over basic string methods. It describes regular expressions as a mini programming language for string searching and parsing. Examples using the `search()` function demonstrate simple usage, like finding lines with "From:", while also highlighting advanced matching capabilities through special characters, such as the caret for matching line beginnings.
16.2: Character matching in regular expressions
This page discusses the use of special characters in regular expressions, particularly the period (.) for matching any character. It provides examples with regex patterns like "F..m:" and "^From:.+@" for string searches. The text explains the functions of the asterisk (*) and plus (+) symbols, denoting zero-or-more and one-or-more character matches, respectively, while highlighting their greedy behavior and ways to manage this matching.
16.3: Extracting data using regular expressions
This page describes how to extract email addresses from strings in Python using the `findall()` method with regular expressions. It illustrates initial regex patterns for basic email structures, followed by refined patterns that exclude unwanted characters and ensure valid email formats. The content emphasizes generating clean outputs when processing lines from a file.
16.4: Combining searching and extracting
This page explains how to use Python regular expressions to extract specific floating-point numbers and integers from text lines, particularly those starting with "X-". It demonstrates the process through examples, such as extracting revision numbers and timestamps from email logs. The advantages of regex over traditional string-splitting methods are emphasized, showcasing how it simplifies the code.
16.5: Escape Character
This page explains how special characters in regular expressions denote functions like matching line positions or serving as wildcards. To match them literally, they require a backslash prefix. An example provided is for finding money amounts using the regex '\$[0-9.]+' where '\$' matches a dollar sign and '[0-9.]' matches digits or a period. It also notes that characters within square brackets are treated as normal, while those outside have special meanings.
16.6: Bonus section for Unix / Linux users
This page discusses Unix's grep tool, used for regular expression file searching since the 1960s. It highlights how grep allows users to find lines starting with specific strings, comparing it to Python's search() function. The summary emphasizes key differences, particularly grep's lack of support for the "\S" non-blank character and its more complex notation for matching non-space characters.
16.7: Debugging
This page highlights Python's built-in documentation features, including the interactive help system with the help() function and the dir() command for listing module methods. It emphasizes that while the documentation is not comprehensive, it provides quick and useful references for users needing immediate information without internet access.
16.E: Regular Expressions (Exercises)
This page outlines two programming exercises: the first involves creating a program to simulate the Unix grep command by counting the number of lines that match a user-defined regular expression in "mbox.txt." The second task requires writing a program to extract numbers from lines that state "New Revision: [number]," calculate their average, and print this information from given files.
16.G: Regular Expressions (Glossary)
This page defines key programming concepts: "brittle code" as code that easily breaks with inconsistent input; "greedy matching" as a method in regular expressions that matches the longest string; "grep" as a Unix command for searching text files with regular expressions; "regular expressions" as languages used for complex search patterns utilizing special characters; and "wild card" as a character (the period) that represents any character in regular expressions.
16.S: Regular Expressions (Summary)
This page introduces regular expressions and their role as search strings with special characters for matching criteria and extracted content. It details operators and character sequences, including line markers, wildcards, whitespace, and character sets. It covers character ranges, inverted matches, and grouping with parentheses, as well as digit and non-digit matches, providing a foundational understanding of regular expressions.

Search

Text Color

Text Size

Margin Size

Font Type