40 KiB
- Feature Name: regex-1.0
- Start Date: 2016-05-11
- RFC PR: rust-lang/rfcs#1620
- Rust Issue: N/A
Table of contents
Summary
This RFC proposes a 1.0 API for the regex
crate and therefore a move out of
the rust-lang-nursery
organization and into the rust-lang
organization.
Since the API of regex
has largely remained unchanged since its inception
2 years ago,
significant emphasis is placed on retaining the existing API. Some minor
breaking changes are proposed.
Motivation
Regular expressions are a widely used tool and most popular programming languages either have an implementation of regexes in their standard library, or there exists at least one widely used third party implementation. It therefore seems reasonable for Rust to do something similar.
The regex
crate specifically serves many use cases, most of which are somehow
related to searching strings for patterns. Describing regular expressions in
detail is beyond the scope of this RFC, but briefly, these core use cases are
supported in the main API:
- Testing whether a pattern matches some text.
- Finding the location of a match of a pattern in some text.
- Finding the location of a match of a pattern---and locations of all its capturing groups---in some text.
- Iterating over successive non-overlapping matches of (2) and (3).
The expected outcome is that the regex
crate should be the preferred default
choice for matching regular expressions when writing Rust code. This is already
true today; this RFC formalizes it.
Detailed design
Syntax
Evolution
The public API of a regex
library includes the syntax of a regular
expression. A change in the semantics of the syntax can cause otherwise working
programs to break, yet, we'd still like the option to expand the syntax if
necessary. Thus, this RFC proposes:
- Any change that causes a previously invalid regex pattern to become valid is
not a breaking change. For example, the escape sequence
\y
is not a valid pattern, but could become one in a future release without a major version bump. - Any change that causes a previously valid regex pattern to become invalid is a breaking change.
- Any change that causes a valid regex pattern to change its matching
semantics is a breaking change. (For example, changing
\b
from "word boundary assertion" to "backspace character.")
Bug fixes and Unicode upgrades are exceptions to both (2) and (3).
Another interesting exception to (2) is that compiling a regex can fail if the
entire compiled object would exceed some pre-defined user configurable size.
In particular, future changes to the compiler could cause certain instructions
to use more memory, or indeed, the representation of the compiled regex could
change completely. This could cause a regex that fit under the size limit to
no longer fit, and therefore fail to compile. These cases are expected to be
extremely rare in practice. Notably, the default size limit is 10MB
.
Concrete syntax
The syntax is exhaustively documented in the current public API documentation: http://doc.rust-lang.org/regex/regex/index.html#syntax
To my knowledge, the evolution as proposed in this RFC has been followed since
regex
was created. The syntax has largely remained unchanged with few
additions.
Expansion concerns
There are a few possible avenues for expansion, and we take measures to make sure they are possible with respect to API evolution.
- Escape sequences are often blessed with special semantics. For example,
\d
is a Unicode character class that matches any digit and\b
is a word boundary assertion. We may one day like to add more escape sequences with special semantics. For this reason, any unrecognized escape sequence makes a pattern invalid. - If we wanted to expand the syntax with various look-around operators, then it
would be possible since most common syntax is considered an invalid pattern
today. In particular, all of the syntactic forms listed
here are invalid patterns
in
regex
. - Character class sets are another potentially useful feature that may be worth
adding. Currently, various forms of set
notation are treated
as valid patterns, but this RFC proposes making them invalid patterns before
1.0
. - Additional named Unicode classes or codepoints may be desirable to add.
Today, any pattern of the form
\p{NAME}
whereNAME
is unrecognized is considered invalid, which leaves room for expansion. - If all else fails, we can introduce new flags that enable new features that conflict with stable syntax. This is possible because using an unrecognized flag results in an invalid pattern.
Core API
The core API of the regex
crate is the Regex
type:
pub struct Regex(_);
It has one primary constructor:
impl Regex {
/// Creates a new regular expression. If the pattern is invalid or otherwise
/// fails to compile, this returns an error.
pub fn new(pattern: &str) -> Result<Regex, Error>;
}
And five core search methods. All searching completes in worst case linear time with respect to the search text (the size of the regex is taken as a constant).
impl Regex {
/// Returns true if and only if the text matches this regex.
pub fn is_match(&self, text: &str) -> bool;
/// Returns the leftmost-first match of this regex in the text given. If no
/// match exists, then None is returned.
///
/// The leftmost-first match is defined as the first match that is found
/// by a backtracking search.
pub fn find<'t>(&self, text: &'t str) -> Option<Match<'t>>;
/// Returns an iterator of successive non-overlapping matches of this regex
/// in the text given.
pub fn find_iter<'r, 't>(&'r self, text: &'t str) -> Matches<'r, 't>;
/// Returns the leftmost-first match of this regex in the text given with
/// locations for all capturing groups that participated in the match.
pub fn captures(&self, text: &str) -> Option<Captures>;
/// Returns an iterator of successive non-overlapping matches with capturing
/// group information in the text given.
pub fn captures_iter<'r, 't>(&'r self, text: &'t str) -> CaptureMatches<'r, 't>;
}
(N.B. The captures
method can technically replace all uses of find
and
is_match
, but is potentially slower. Namely, the API reflects a performance
trade off: the more you ask for, the harder the regex engine has to work.)
There is one additional, but idiosyncratic, search method:
impl Regex {
/// Returns the end location of a match if one exists in text.
///
/// This may return a location preceding the end of a proper leftmost-first
/// match. In particular, it may return the location at which a match is
/// determined to exist. For example, matching `a+` against `aaaaa` will
/// return `1` while the end of the leftmost-first match is actually `5`.
///
/// This has the same performance characteristics as `is_match`.
pub fn shortest_match(&self, text: &str) -> Option<usize>;
}
And two methods for splitting:
impl Regex {
/// Returns an iterator of substrings of `text` delimited by a match of
/// this regular expression. Each element yielded by the iterator corresponds
/// to text that *isn't* matched by this regex.
pub fn split<'r, 't>(&'r self, text: &'t str) -> Split<'r, 't>;
/// Returns an iterator of at most `limit` substrings of `text` delimited by
/// a match of this regular expression. Each element yielded by the iterator
/// corresponds to text that *isn't* matched by this regex. The remainder of
/// `text` that is not split will be the last element yielded by the
/// iterator.
pub fn splitn<'r, 't>(&'r self, text: &'t str, limit: usize) -> SplitN<'r, 't>;
}
And three methods for replacement. Replacement is discussed in more detail in a subsequent section.
impl Regex {
/// Replaces matches of this regex in `text` with `rep`. If no matches were
/// found, then the given string is returned unchanged, otherwise a new
/// string is allocated.
///
/// `replace` replaces the first match only. `replace_all` replaces all
/// matches. `replacen` replaces at most `limit` matches.
fn replace<'t, R: Replacer>(&self, text: &'t str, rep: R) -> Cow<'t, str>;
fn replace_all<'t, R: Replacer>(&self, text: &'t str, rep: R) -> Cow<'t, str>;
fn replacen<'t, R: Replacer>(&self, text: &'t str, limit: usize, rep: R) -> Cow<'t, str>;
}
And lastly, three simple accessors:
impl Regex {
/// Returns the original pattern string.
pub fn as_str(&self) -> &str;
/// Returns an iterator over all capturing group in the pattern in the order
/// they were defined (by position of the leftmost parenthesis). The name of
/// the group is yielded if it has a name, otherwise None is yielded.
pub fn capture_names(&self) -> CaptureNames;
/// Returns the total number of capturing groups in the pattern. This
/// includes the implicit capturing group corresponding to the entire
/// pattern.
pub fn captures_len(&self) -> usize;
}
Finally, Regex
impls the Send
, Sync
, Display
, Debug
, Clone
and
FromStr
traits from the standard library.
Error
The Error
enum is an extensible enum, similar to std::io::Error
,
corresponding to the different ways that regex compilation can fail. In
particular, this means that adding a new variant to this enum is not a breaking
change. (Removing or changing an existing variant is still a breaking change.)
pub enum Error {
/// A syntax error.
Syntax(SyntaxError),
/// The compiled program exceeded the set size limit.
/// The argument is the size limit imposed.
CompiledTooBig(usize),
/// Hints that destructuring should not be exhaustive.
///
/// This enum may grow additional variants, so this makes sure clients
/// don't count on exhaustive matching. (Otherwise, adding a new variant
/// could break existing code.)
#[doc(hidden)]
__Nonexhaustive,
}
Note that the Syntax
variant could contain the Error
type from the
regex-syntax
crate, but this couples regex-syntax
to the public API
of regex
. We sidestep this hazard by defining a newtype in regex
that
internally wraps regex_syntax::Error
. This also enables us to selectively
expose more information in the future.
RegexBuilder
In most cases, the construction of a regex is done with Regex::new
. There are
however some options one might want to tweak. This can be done with a
RegexBuilder
:
impl RegexBuilder {
/// Creates a new builder from the given pattern.
pub fn new(pattern: &str) -> RegexBuilder;
/// Compiles the pattern and all set options. If successful, a Regex is
/// returned. Otherwise, if compilation failed, an Error is returned.
///
/// N.B. `RegexBuilder::new("...").compile()` is equivalent to
/// `Regex::new("...")`.
pub fn build(&self) -> Result<Regex, Error>;
/// Set the case insensitive flag (i).
pub fn case_insensitive(&mut self, yes: bool) -> &mut RegexBuilder;
/// Set the multi line flag (m).
pub fn multi_line(&mut self, yes: bool) -> &mut RegexBuilder;
/// Set the dot-matches-any-character flag (s).
pub fn dot_matches_new_line(&mut self, yes: bool) -> &mut RegexBuilder;
/// Set the swap-greedy flag (U).
pub fn swap_greed(&mut self, yes: bool) -> &mut RegexBuilder;
/// Set the ignore whitespace flag (x).
pub fn ignore_whitespace(&mut self, yes: bool) -> &mut RegexBuilder;
/// Set the Unicode flag (u).
pub fn unicode(&mut self, yes: bool) -> &mut RegexBuilder;
/// Set the approximate size limit (in bytes) of the compiled regular
/// expression.
///
/// If compiling a pattern would approximately exceed this size, then
/// compilation will fail.
pub fn size_limit(&mut self, limit: usize) -> &mut RegexBuilder;
/// Set the approximate size limit (in bytes) of the cache used by the DFA.
///
/// This is a per thread limit. Once the DFA fills the cache, it will be
/// wiped and refilled again. If the cache is wiped too frequently, the
/// DFA will quit and fall back to another matching engine.
pub fn dfa_size_limit(&mut self, limit: usize) -> &mut RegexBuilder;
}
Captures
A Captures
value stores the locations of all matching capturing groups for
a single match. It provides convenient access to those locations indexed by
either number, or, if available, name.
The first capturing group (index 0
) is always unnamed and always corresponds
to the entire match. Other capturing groups correspond to groups in the
pattern. Capturing groups are indexed by the position of their leftmost
parenthesis in the pattern.
Note that Captures
is a type constructor with a single parameter: the
lifetime of the text searched by the corresponding regex. In particular, the
lifetime of Captures
is not tied to the lifetime of a Regex
.
impl<'t> Captures<'t> {
/// Returns the match associated with the capture group at index `i`. If
/// `i` does not correspond to a capture group, or if the capture group
/// did not participate in the match, then `None` is returned.
pub fn get(&self, i: usize) -> Option<Match<'t>>;
/// Returns the match for the capture group named `name`. If `name` isn't a
/// valid capture group or didn't match anything, then `None` is returned.
pub fn name(&self, name: &str) -> Option<Match<'t>>;
/// Returns the number of captured groups. This is always at least 1, since
/// the first unnamed capturing group corresponding to the entire match
/// always exists.
pub fn len(&self) -> usize;
/// Expands all instances of $name in the text given to the value of the
/// corresponding named capture group. The expanded string is written to
/// dst.
///
/// The name in $name may be integer corresponding to the index of a capture
/// group or it can be the name of a capture group. If the name isn't a valid
/// capture group, then it is replaced with an empty string.
///
/// The longest possible name is used. e.g., $1a looks up the capture group
/// named 1a and not the capture group at index 1. To exert more precise
/// control over the name, use braces, e.g., ${1}a.
///
/// To write a literal $, use $$.
pub fn expand(&self, replacement: &str, dst: &mut String);
}
The Captures
type impls Debug
, Index<usize>
(for numbered capture groups)
and Index<str>
(for named capture groups). A downside of the Index
impls is
that the return value is bounded to the lifetime of Captures
instead of the
lifetime of the actual text searched because of how the Index
trait is
defined. Callers can work around that limitation if necessary by using an
explicit method such as get
or name
.
Replacer
The Replacer
trait is a helper trait to make the various replace
methods on
Regex
more ergonomic. In particular, it makes it possible to use either a
standard string as a replacement, or a closure with more explicit access to a
Captures
value.
pub trait Replacer {
/// Appends text to dst to replace the current match.
///
/// The current match is represents by caps, which is guaranteed to have a
/// match at capture group 0.
///
/// For example, a no-op replacement would be
/// dst.extend(caps.at(0).unwrap()).
fn replace_append(&mut self, caps: &Captures, dst: &mut String);
/// Return a fixed unchanging replacement string.
///
/// When doing replacements, if access to Captures is not needed, then
/// it can be beneficial from a performance perspective to avoid finding
/// sub-captures. In general, this is called once for every call to replacen.
fn no_expansion<'r>(&'r mut self) -> Option<Cow<'r, str>> {
None
}
}
Along with this trait, there is also a helper type, NoExpand
that implements
Replacer
like so:
pub struct NoExpand<'t>(pub &'t str);
impl<'t> Replacer for NoExpand<'t> {
fn replace_append(&mut self, _: &Captures, dst: &mut String) {
dst.push_str(self.0);
}
fn no_expansion<'r>(&'r mut self) -> Option<Cow<'r, str>> {
Some(Cow::Borrowed(self.0))
}
}
This permits callers to use NoExpand
with the replace
methods to guarantee
that the replacement string is never searched for $group
replacement syntax.
We also provide two more implementations of the Replacer
trait: &str
and
FnMut(&Captures) -> String
.
quote
There is one free function in regex
:
/// Escapes all regular expression meta characters in `text`.
///
/// The string returned may be safely used as a literal in a regex.
pub fn quote(text: &str) -> String;
RegexSet
A RegexSet
represents the union of zero or more regular expressions. It is a
specialized machine that can match multiple regular expressions simultaneously.
Conceptually, it is similar to joining multiple regexes as alternates, e.g.,
re1|re2|...|reN
, with one crucial difference: in a RegexSet
, multiple
expressions can match. This means that each pattern can be reasoned about
independently. A RegexSet
is ideal for building simpler lexers or an HTTP
router.
Because of their specialized nature, they can only report which regexes match. They do not report match locations. In theory, this could be added in the future, but is difficult.
pub struct RegexSet(_);
impl RegexSet {
/// Constructs a new RegexSet from the given sequence of patterns.
///
/// The order of the patterns given is used to assign increasing integer
/// ids starting from 0. Namely, matches are reported in terms of these ids.
pub fn new<I, S>(patterns: I) -> Result<RegexSet, Error>
where S: AsRef<str>, I: IntoIterator<Item=S>;
/// Returns the total number of regexes in this set.
pub fn len(&self) -> usize;
/// Returns true if and only if one or more regexes in this set match
/// somewhere in the given text.
pub fn is_match(&self, text: &str) -> bool;
/// Returns the set of regular expressions that match somewhere in the given
/// text.
pub fn matches(&self, text: &str) -> SetMatches;
}
RegexSet
impls the Debug
and Clone
traits.
The SetMatches
type is queryable and implements IntoIterator
.
pub struct SetMatches(_);
impl SetMatches {
/// Returns true if this set contains 1 or more matches.
pub fn matched_any(&self) -> bool;
/// Returns true if and only if the regex identified by the given id is in
/// this set of matches.
///
/// This panics if the id given is >= the number of regexes in the set that
/// these matches came from.
pub fn matched(&self, id: usize) -> bool;
/// Returns the total number of regexes in the set that created these
/// matches.
pub fn len(&self) -> usize;
/// Returns an iterator over the ids in the set that correspond to a match.
pub fn iter(&self) -> SetMatchesIter;
}
SetMatches
impls the Debug
and Clone
traits.
Note that a builder is not proposed for RegexSet
in this RFC; however, it is
likely one will be added at some point in a backwards compatible way.
The bytes
submodule
All of the above APIs have thus far been explicitly for searching text
where
text
has type &str
. While this author believes that suits most use cases,
it should also be possible to search a regex on arbitrary bytes, i.e.,
&[u8]
. One particular use case is quickly searching a file via a memory map.
If regexes could only search &str
, then one would have to verify it was UTF-8
first, which could be costly. Moreover, if the file isn't valid UTF-8, then you
either can't search it, or you have to allocate a new string and lossily copy
the contents. Neither case is particularly ideal. It would instead be nice to
just search the &[u8]
directly.
This RFC including a bytes
submodule in the crate. The API of this submodule
is a clone of the API described so far, except with &str
replaced by &[u8]
for the search text (patterns are still &str
). The clone includes Regex
itself, along with all supporting types and traits such as Captures
,
Replacer
, FindIter
, RegexSet
, RegexBuilder
and so on. (This RFC
describes some alternative designs in a subsequent section.)
Since the API is a clone of what has been seen so far, it is not written out again. Instead, we'll discuss the key differences.
Again, the first difference is that a bytes::Regex
can search &[u8]
while a Regex
can search &str
.
The second difference is that a bytes::Regex
can completely disable Unicode
support and explicitly match arbitrary bytes. The details:
- The
u
flag can be disabled even when disabling it might cause the regex to match invalid UTF-8. When theu
flag is disabled, the regex is said to be in "ASCII compatible" mode. - In ASCII compatible mode, neither Unicode codepoints nor Unicode character classes are allowed.
- In ASCII compatible mode, Perl character classes (
\w
,\d
and\s
) revert to their typical ASCII definition.\w
maps to[[:word:]]
,\d
maps to[[:digit:]]
and\s
maps to[[:space:]]
. - In ASCII compatible mode, word boundaries use the ASCII compatible
\w
to determine whether a byte is a word byte or not. - Hexadecimal notation can be used to specify arbitrary bytes instead of
Unicode codepoints. For example, in ASCII compatible mode,
\xFF
matches the literal byte\xFF
, while in Unicode mode,\xFF
is a Unicode codepoint that matches its UTF-8 encoding of\xC3\xBF
. Similarly for octal notation. .
matches any byte except for\n
instead of any Unicode codepoint. When thes
flag is enabled,.
matches any byte.
An interesting property of the above is that while the Unicode flag is enabled,
a bytes::Regex
is guaranteed to match only valid UTF-8 in a &[u8]
. Like
Regex
, the Unicode flag is enabled by default.
N.B. The Unicode flag can also be selectively disabled in a Regex
, but not in
a way that permits matching invalid UTF-8.
Drawbacks
Guaranteed linear time matching
A significant contract in the API of the regex
crate is that all searching
has worst case O(n)
complexity, where n ~ length(text)
. (The size of the
regular expression is taken as a constant.) This contract imposes significant
restrictions on both the implementation and the set of features exposed in the
pattern language. A full analysis is beyond the scope of this RFC, but here are
the highlights:
- Unbounded backtracking can't be used to implement matching. Backtracking can be quite fast in practice (indeed, the current implementation uses bounded backtracking in some cases), but has worst case exponential time.
- Permitting backreferences in the pattern language can cause matching to become NP-complete, which (probably) can't be solved in linear time.
- Arbitrary look around is probably difficult to fit into a linear time guarantee in practice.
The benefit to the linear time guarantee is just that: no matter what, all searching completes in linear time with respect to the search text. This is a valuable guarantee to make, because it means that one can execute arbitrary regular expressions over arbitrary input and be absolutely sure that it will finish in some "reasonable" time.
Of course, in practice, constants that are omitted from complexity analysis
actually matter. For this reason, the regex
crate takes a number of steps
to keep constants low. For example, by placing a limit on the size of the
regular expression or choosing an appropriate matching engine when another
might result in higher constant factors.
This particular drawback segregates Rust's regular expression library from most other regular expression libraries that programmers may be familiar with. Languages such as Java, Python, Perl, Ruby, PHP and C++ support more flavorful regexes by default. Go is the only language this author knows of whose standard regex implementation guarantees linear time matching. Of course, RE2 is also worth mentioning, which is a C++ regex library that guarantees linear time matching. There are other implementations of regexes that guarantee linear time matching (TRE, for example), but none of them are particularly popular.
It is also worth noting that since Rust's FFI is zero cost, one can bind to existing regex implementations that provide more features (bindings for both PCRE1 and Oniguruma exist today).
Allocation
The regex
API assumes that the implementation can dynamically allocate
memory. Indeed, the current implementation takes advantage of this. A regex
library that has no requirement on dynamic memory allocation would look
significantly different than the one that exists today. Dynamic memory
allocation is utilized pervasively in the parser, compiler and even during
search.
The benefit of permitting dynamic memory allocation is that it makes the
implementation and API simpler. This does make use of the regex
crate in
environments that don't have dynamic memory allocation impossible.
This author isn't aware of any regex
library that can work without dynamic
memory allocation.
With that said, regex
may want to grow custom allocator support when the
corresponding traits stabilize.
Synchronization is implicit
Every Regex
value can be safely used from multiple threads simultaneously.
Since a Regex
has interior mutable state, this implies that it must do some
kind of synchronization in order to be safe.
There are some reasons why we might want to do synchronization automatically:
Regex
exposes an immutable API. That is, from looking at its set of methods, none of them borrow theRegex
mutably (or otherwise claim to mutate theRegex
). This author claims that since there is no observable mutation of aRegex
, it not being thread safe would violate the principle of least surprise.- Often, a
Regex
should be compiled once and reused repeatedly in multiple searches. To facilitate this,lazy_static!
can be used to guarantee that compilation happens exactly once.lazy_static!
requires its types to beSync
. A user ofRegex
could work around this by wrapping aRegex
in aMutex
, but this would make misuse too easy. For example, locking aRegex
in one thread would prevent simultaneous searching in another thread.
Synchronization has overhead, although it is extremely small (and dwarfed
by general matching overhead). The author has ad hoc benchmarked the
regex
implementation with GNU Grep, and per match overhead is comparable in
single threaded use. It is this author's opinion, that it is good enough. If
synchronization overhead across multiple threads is too much, callers may elect
to clone the Regex
so that each thread gets its own copy. Cloning a Regex
is no more expensive than what would be done internally automatically, but it
does eliminate contention.
An alternative is to increase the API surface and have types that are synchronized by default and types that aren't synchronized. This was discussed at length in this thread. My conclusion from this thread is that we either expand the surface of the API, or we break the current API or we keep implicit synchronization as-is. In this author's opinion, neither expanding the API or breaking the API is worth avoiding negligible synchronization overhead.
The implementation is complex
Regular expression engines have a lot of moving parts and it often requires
quite a bit of context on how the whole library is organized in order to make
significant contributions. Therefore, moving regex
into rust-lang
is a
maintenance hazard. This author has tried to mitigate this hazard somewhat by
doing the following:
- Offering to mentor contributions. Significant contributions have thus far fizzled, but minor contributions---even to complex code like the DFA---have been successful.
- Documenting not just the API, but the internals. The DFA is, for example, heavily documented.
- Wrote a
HACKING.md
guide that gives a sweeping overview of the design. - Significant test and benchmark suites.
With that said, there is still a lot more that could be done to mitigate the maintenance hazard. In this author's opinion, the interaction between the three parts of the implementation (parsing, compilation, searching) is not documented clearly enough.
Alternatives
Big picture
The most important alternative is to decide not to bless a particular
implementation of regular expressions. We might want to go this route for any
number of reasons (see: Drawbacks). However, the regex
crate is already
widely used, which provides at least some evidence that some set of programmers
find it good enough for general purpose regex searching.
The impact of not moving regex
into rust-lang
is, plainly, that Rust won't
have an "officially blessed" regex implementation. Many programmers may
appreciate the complexity of a regex implementation, and therefore might insist
that one be officially maintained. However, to be honest, it isn't quite clear
what would happen in practice. This author is speculating.
bytes::Regex
This RFC proposes stabilizing the bytes
sub-module of the regex
crate in
its entirety. The bytes
sub-module is a near clone of the API at the crate
level with one important difference: it searches &[u8]
instead of &str
.
This design was motivated by a similar split in std
, but there are
alternatives.
A regex trait
One alternative is designing a trait that looks something like this:
trait Regex {
type Text: ?Sized;
fn is_match(&self, text: &Self::Text) -> bool;
fn find(&self, text: &Self::Text) -> Option<Match>;
fn find_iter<'r, 't>(&'r self, text: &'t Self::Text) -> Matches<'r, 't, Self::Text>;
// and so on
}
However, there are a couple problems with this approach. First and foremost,
the use cases of such a trait aren't exactly clear. It does make writing
generic code that searches either a &str
or a &[u8]
possible, but the
semantics of searching &str
(always valid UTF-8) or &[u8]
are quite a bit
different with respect to the original Regex
. Secondly, the trait isn't
obviously implementable by others. For example, some of the methods return
iterator types such as Matches
that are typically implemented with a
lower level API that isn't exposed. This suggests that a straight-forward
traitification of the current API probably isn't appropriate, and perhaps,
a better trait needs to be more fundamental to regex searching.
Perhaps the strongest reason to not adopt this design for regex 1.0
is that
we don't have any experience with it and there hasn't been any demand for it.
In particular, it could be prototyped in another crate.
Reuse some types
In the current proposal, the bytes
submodule completely duplicates the
top-level API, including all iterator types, Captures
and even the Replacer
trait. We could parameterize many of those types over the type of the text
searched. For example, the proposed Replacer
trait looks like this:
trait Replacer {
fn replace_append(&mut self, caps: &Captures, dst: &mut String);
fn no_expansion<'r>(&'r mut self) -> Option<Cow<'r, str>> {
None
}
}
We might add an associated type like so:
trait Replacer {
type Text: ToOwned + ?Sized;
fn replace_append(
&mut self,
caps: &Captures<Self::Text>,
dst: &mut <Self::Text as ToOwned>::Owned,
);
fn no_expansion<'r>(&'r mut self) -> Option<Cow<'r, Self::Text>> {
None
}
}
But parameterizing the Captures
type is a little bit tricky. Namely, methods
like get
want to slice the text at match offsets, but this can't be done
safely in generic code without introducing another public trait.
The final death knell in this idea is that these two implementations cannot co-exist:
impl<F> Replacer for F where F: FnMut(&Captures) -> String {
type Text = str;
fn replace_append(&mut self, caps: &Captures, dst: &mut String) {
dst.push_str(&(*self)(caps));
}
}
impl<F> Replacer for F where F: FnMut(&Captures) -> Vec<u8> {
type Text = [u8];
fn replace_append(&mut self, caps: &Captures, dst: &mut Vec<u8>) {
dst.extend(&(*self)(caps));
}
}
Perhaps there is a path through this using yet more types or more traits, but without a really strong motivating reason to find it, I'm not convinced it's worth it. Duplicating all of the types is unfortunate, but it's simple.
Unresolved questions
The regex
repository has more than just the regex
crate.
regex-syntax
This crate exposes a regular expression parser and abstract syntax that is
completely divorced from compilation or searching. It is not part of regex
proper since it may experience more frequent breaking changes and is far less
frequently used. It is not clear whether this crate will ever see 1.0
, and if
it does, what criteria would be used to judge it suitable for 1.0
.
Nevertheless, it is a useful public API, but it is not part of this RFC.
regex-capi
Recently, regex-capi
was built to provide a C API to this regex library. It
has been used to build cgo bindings to this library for
Go. Given its young age, it is not part
of this proposal but will be maintained as a pre-1.0 crate in the same
repository.
regex_macros
The regex!
compiler plugin is a macro that can compile regular expressions
when your Rust program compiles. Stated differently, regex!("...")
is
transformed into Rust code that executes a search of the given pattern
directly. It was written two years ago and largely hasn't changed since. When
it was first written, it had two major benefits:
- If there was a syntax error in your regex, your Rust program would not compile.
- It was faster.
Today, (1) can be simulated in practice with the use of a Clippy lint and (2)
is no longer true. In fact, regex!
is at least one order of magnitude slower
than the standard Regex
implementation.
The future of regex_macros
is not clear. In one sense, since it is a
compiler plugin, there hasn't been much interest in developing it further since
its audience is necessarily limited. In another sense, it's not entirely clear
what its implementation path is. It would take considerable work for it to beat
the current Regex
implementation (if it's even possible). More discussion on
this is out of scope.
Dependencies
As of now, regex
has several dependencies:
aho-corasick
memchr
thread_local
regex-syntax
utf8-ranges
All of them except for thread_local
were written by this author, and were
primarily motivated for use in the regex
crate. They were split out because
they seem generally useful.
There may be other things in regex
(today or in the future) that may also be
helpful to others outside the strict context of regex
. Is it beneficial to
split such things out and create a longer list of dependencies? Or should we
keep regex
as tight as possible?
Exposing more internals
It is conceivable that others might find interest in the regex compiler or more
lower level access to the matching engines. We could do something similar to
regex-syntax
and expose some internals in a separate crate. However, there
isn't a pressing desire to do this at the moment, and would probably require a
good deal of work.
Breaking changes
This section of the RFC lists all breaking changes between regex 0.1
and the
API proposed in this RFC.
find
andfind_iter
now return values of typeMatch
instead of(usize, usize)
. TheMatch
type hasstart
andend
methods which can be used to recover the original offsets, as well as anas_str
method to get the matched text.- The
Captures
type no longer has any iterators defined. Instead, callers should use theRegex::capture_names
method. bytes::Regex
enables the Unicode flag by default. Previously, it disabled it by default. The flag can be disabled in the pattern with(?-u)
.- The definition of the
Replacer
trait was completely re-worked. Namely, its API inverts control of allocation so that the caller must provide aString
to write to. Previous implementors will need to examine the new API. Moving to the new API should be straight-forward. - The
is_empty
method onCaptures
was removed since it always returnsfalse
(because everyCaptures
has at least one capture group corresponding to the entire match). - The
PartialEq
andEq
impls onRegex
were removed. If you need this functionality, add a newtype aroundRegex
and write the correspondingPartialEq
andEq
impls. - The lifetime parameters for the
iter
anditer_named
methods onCaptures
were fixed. The corresponding iterator types,SubCaptures
andSubCapturesNamed
, grew an additional lifetime parameter. - The constructor,
Regex::with_size_limit
, was removed. It can be replaced with use ofRegexBuilder
. - The
is_match
free function was removed. Instead, compile aRegex
explicitly and call theis_match
method. - Many iterator types were renamed. (e.g.,
RegexSplits
toSplitsIter
.) - Replacements now return a
Cow<str>
instead of aString
. Namely, the subject text doesn't need to be copied if there are no replacements. Callers may need to addinto_owned()
calls to convert theCow<str>
to a properString
. - The
Error
type no longer has theInvalidSet
variant, since the error is no longer possible. ItsSyntax
variant was also modified to wrap aString
instead of aregex_syntax::Error
. If you need access to specific parse error information, use theregex-syntax
crate directly. - To allow future growth, some character classes may no longer compile to make room for possibly adding class set notation in the future.
- Various iterator types have been renamed.
- The
RegexBuilder
type now takes an&mut self
on most methods instead ofself
. Additionally, the final build step now usesbuild()
instead ofcompile()
.