rfcs/text/1620-regex-1.0.md

972 lines
40 KiB
Markdown
Raw Normal View History

2016-05-12 08:45:44 +08:00
- Feature Name: regex-1.0
- Start Date: 2016-05-11
2016-09-13 06:16:27 +08:00
- RFC PR: [rust-lang/rfcs#1620](https://github.com/rust-lang/rfcs/pull/1620)
- Rust Issue: N/A
2016-05-12 08:45:44 +08:00
# Table of contents
* [Summary][summary]
* [Motivation][motivation]
* [Detailed design][design]
* [Syntax][syntax]
* [Evolution][evolution]
* [Concrete syntax][concrete-syntax]
* [Expansion concerns][expansion-concerns]
* [Core API][core-api]
* [RegexBuilder][regexbuilder]
* [Replacer][replacer]
* [quote][quote]
* [RegexSet][regexset]
* [The `bytes` submodule][the-bytes-submodule]
* [Drawbacks][drawbacks]
* [Guaranteed linear time matching][guaranteed-linear-time-matching]
* [Allocation][allocation]
* [Synchronization is implicit][synchronization-is-implicit]
* [The implementation is complex][the-implementation-is-complex]
* [Alternatives][alternatives]
* [Big picture][big-picture]
* [`bytes::Regex`][bytesregex]
* [A regex trait][a-regex-trait]
* [Reuse some types][reuse-some-types]
* [Unresolved questions][unresolved]
* [`regex-syntax`][regex-syntax]
* [`regex-capi`][regex-capi]
* [`regex_macros`][regex_macros]
* [Dependencies][dependencies]
* [Exposing more internals][exposing-more-internals]
* [Breaking changes][breaking-changes]
# Summary
[summary]: #summary
This RFC proposes a 1.0 API for the `regex` crate and therefore a move out of
the `rust-lang-nursery` organization and into the `rust-lang` organization.
Since the API of `regex` has largely remained unchanged since its inception
[2 years ago](https://github.com/rust-lang/rfcs/blob/master/text/0042-regexps.md),
significant emphasis is placed on retaining the existing API. Some minor
breaking changes are proposed.
# Motivation
[motivation]: #motivation
Regular expressions are a widely used tool and most popular programming
languages either have an implementation of regexes in their standard library,
or there exists at least one widely used third party implementation. It
therefore seems reasonable for Rust to do something similar.
The `regex` crate specifically serves many use cases, most of which are somehow
related to searching strings for patterns. Describing regular expressions in
detail is beyond the scope of this RFC, but briefly, these core use cases are
supported in the main API:
1. Testing whether a pattern matches some text.
2. Finding the location of a match of a pattern in some text.
3. Finding the location of a match of a pattern---and locations of all its
capturing groups---in some text.
4. Iterating over successive non-overlapping matches of (2) and (3).
The expected outcome is that the `regex` crate should be the preferred default
choice for matching regular expressions when writing Rust code. This is already
true today; this RFC formalizes it.
# Detailed design
[design]: #detailed-design
## Syntax
[syntax]: #syntax
### Evolution
[evolution]: #evolution
The public API of a `regex` library *includes* the syntax of a regular
expression. A change in the semantics of the syntax can cause otherwise working
programs to break, yet, we'd still like the option to expand the syntax if
necessary. Thus, this RFC proposes:
1. Any change that causes a previously invalid regex pattern to become valid is
*not* a breaking change. For example, the escape sequence `\y` is not a
valid pattern, but could become one in a future release without a major
version bump.
2. Any change that causes a previously valid regex pattern to become invalid
*is* a breaking change.
3. Any change that causes a valid regex pattern to change its matching
semantics *is* a breaking change. (For example, changing `\b` from "word
boundary assertion" to "backspace character.")
2016-08-22 05:32:40 +08:00
Bug fixes and Unicode upgrades are exceptions to both (2) and (3).
2016-05-12 08:45:44 +08:00
Another interesting exception to (2) is that compiling a regex can fail if the
entire compiled object would exceed some pre-defined user configurable size.
In particular, future changes to the compiler could cause certain instructions
to use more memory, or indeed, the representation of the compiled regex could
change completely. This could cause a regex that fit under the size limit to
no longer fit, and therefore fail to compile. These cases are expected to be
extremely rare in practice. Notably, the default size limit is `10MB`.
### Concrete syntax
[concrete-syntax]: #concrete-syntax
The syntax is exhaustively documented in the current public API documentation:
http://doc.rust-lang.org/regex/regex/index.html#syntax
To my knowledge, the evolution as proposed in this RFC has been followed since
2016-05-19 18:35:27 +08:00
`regex` was created. The syntax has largely remained unchanged with few
2016-05-12 08:45:44 +08:00
additions.
### Expansion concerns
[expansion-concerns]: #expansion-concerns
There are a few possible avenues for expansion, and we take measures to make
sure they are possible with respect to API evolution.
* Escape sequences are often blessed with special semantics. For example, `\d`
is a Unicode character class that matches any digit and `\b` is a word
boundary assertion. We may one day like to add more escape sequences with
special semantics. For this reason, any unrecognized escape sequence makes a
pattern invalid.
* If we wanted to expand the syntax with various look-around operators, then it
would be possible since most common syntax is considered an invalid pattern
today. In particular, all of the [syntactic forms listed
here](http://www.regular-expressions.info/refadv.html) are invalid patterns
in `regex`.
* Character class sets are another potentially useful feature that may be worth
adding. Currently, [various forms of set
notation](http://www.regular-expressions.info/refcharclass.html) are treated
as valid patterns, but this RFC proposes making them invalid patterns before
`1.0`.
* Additional named Unicode classes or codepoints may be desirable to add.
Today, any pattern of the form `\p{NAME}` where `NAME` is unrecognized is
considered invalid, which leaves room for expansion.
* If all else fails, we can introduce new flags that enable new features that
conflict with stable syntax. This is possible because using an unrecognized
flag results in an invalid pattern.
## Core API
[core-api]: #core-api
The core API of the `regex` crate is the `Regex` type:
```rust
pub struct Regex(_);
```
It has one primary constructor:
```rust
impl Regex {
/// Creates a new regular expression. If the pattern is invalid or otherwise
/// fails to compile, this returns an error.
pub fn new(pattern: &str) -> Result<Regex, Error>;
}
```
And five core search methods. All searching completes in worst case linear time
with respect to the search text (the size of the regex is taken as a constant).
```rust
impl Regex {
/// Returns true if and only if the text matches this regex.
pub fn is_match(&self, text: &str) -> bool;
/// Returns the leftmost-first match of this regex in the text given. If no
/// match exists, then None is returned.
///
/// The leftmost-first match is defined as the first match that is found
/// by a backtracking search.
pub fn find<'t>(&self, text: &'t str) -> Option<Match<'t>>;
2016-05-12 08:45:44 +08:00
/// Returns an iterator of successive non-overlapping matches of this regex
/// in the text given.
2016-08-22 05:20:29 +08:00
pub fn find_iter<'r, 't>(&'r self, text: &'t str) -> Matches<'r, 't>;
2016-05-12 08:45:44 +08:00
/// Returns the leftmost-first match of this regex in the text given with
/// locations for all capturing groups that participated in the match.
pub fn captures(&self, text: &str) -> Option<Captures>;
/// Returns an iterator of successive non-overlapping matches with capturing
/// group information in the text given.
2016-08-22 05:20:29 +08:00
pub fn captures_iter<'r, 't>(&'r self, text: &'t str) -> CaptureMatches<'r, 't>;
2016-05-12 08:45:44 +08:00
}
```
(N.B. The `captures` method can technically replace all uses of `find` and
`is_match`, but is potentially slower. Namely, the API reflects a performance
trade off: the more you ask for, the harder the regex engine has to work.)
There is one additional, but idiosyncratic, search method:
```rust
impl Regex {
/// Returns the end location of a match if one exists in text.
///
/// This may return a location preceding the end of a proper leftmost-first
/// match. In particular, it may return the location at which a match is
/// determined to exist. For example, matching `a+` against `aaaaa` will
/// return `1` while the end of the leftmost-first match is actually `5`.
///
/// This has the same performance characteristics as `is_match`.
pub fn shortest_match(&self, text: &str) -> Option<usize>;
}
```
And two methods for splitting:
```rust
impl Regex {
/// Returns an iterator of substrings of `text` delimited by a match of
/// this regular expression. Each element yielded by the iterator corresponds
/// to text that *isn't* matched by this regex.
2016-08-22 05:20:29 +08:00
pub fn split<'r, 't>(&'r self, text: &'t str) -> Split<'r, 't>;
2016-05-12 08:45:44 +08:00
/// Returns an iterator of at most `limit` substrings of `text` delimited by
/// a match of this regular expression. Each element yielded by the iterator
/// corresponds to text that *isn't* matched by this regex. The remainder of
/// `text` that is not split will be the last element yielded by the
/// iterator.
2016-08-22 05:20:29 +08:00
pub fn splitn<'r, 't>(&'r self, text: &'t str, limit: usize) -> SplitN<'r, 't>;
2016-05-12 08:45:44 +08:00
}
```
And three methods for replacement. Replacement is discussed in more detail in a
subsequent section.
```rust
impl Regex {
/// Replaces matches of this regex in `text` with `rep`. If no matches were
/// found, then the given string is returned unchanged, otherwise a new
/// string is allocated.
///
/// `replace` replaces the first match only. `replace_all` replaces all
/// matches. `replacen` replaces at most `limit` matches.
fn replace<'t, R: Replacer>(&self, text: &'t str, rep: R) -> Cow<'t, str>;
fn replace_all<'t, R: Replacer>(&self, text: &'t str, rep: R) -> Cow<'t, str>;
fn replacen<'t, R: Replacer>(&self, text: &'t str, limit: usize, rep: R) -> Cow<'t, str>;
}
```
And lastly, three simple accessors:
```rust
impl Regex {
/// Returns the original pattern string.
pub fn as_str(&self) -> &str;
/// Returns an iterator over all capturing group in the pattern in the order
/// they were defined (by position of the leftmost parenthesis). The name of
/// the group is yielded if it has a name, otherwise None is yielded.
2016-08-22 05:20:29 +08:00
pub fn capture_names(&self) -> CaptureNames;
2016-05-12 08:45:44 +08:00
/// Returns the total number of capturing groups in the pattern. This
/// includes the implicit capturing group corresponding to the entire
/// pattern.
pub fn captures_len(&self) -> usize;
}
```
Finally, `Regex` impls the `Send`, `Sync`, `Display`, `Debug`, `Clone` and
`FromStr` traits from the standard library.
## Error
The `Error` enum is an *extensible* enum, similar to `std::io::Error`,
corresponding to the different ways that regex compilation can fail. In
particular, this means that adding a new variant to this enum is not a breaking
change. (Removing or changing an existing variant is still a breaking change.)
```rust
pub enum Error {
/// A syntax error.
2016-05-19 07:59:23 +08:00
Syntax(SyntaxError),
2016-05-12 08:45:44 +08:00
/// The compiled program exceeded the set size limit.
/// The argument is the size limit imposed.
CompiledTooBig(usize),
/// Hints that destructuring should not be exhaustive.
///
/// This enum may grow additional variants, so this makes sure clients
/// don't count on exhaustive matching. (Otherwise, adding a new variant
/// could break existing code.)
#[doc(hidden)]
__Nonexhaustive,
}
```
Note that the `Syntax` variant could contain the `Error` type from the
2016-05-19 07:59:23 +08:00
`regex-syntax` crate, but this couples `regex-syntax` to the public API
of `regex`. We sidestep this hazard by defining a newtype in `regex` that
internally wraps `regex_syntax::Error`. This also enables us to selectively
expose more information in the future.
2016-05-12 08:45:44 +08:00
## RegexBuilder
[regexbuilder]: #regexbuilder
In most cases, the construction of a regex is done with `Regex::new`. There are
however some options one might want to tweak. This can be done with a
`RegexBuilder`:
```rust
impl RegexBuilder {
/// Creates a new builder from the given pattern.
pub fn new(pattern: &str) -> RegexBuilder;
/// Compiles the pattern and all set options. If successful, a Regex is
/// returned. Otherwise, if compilation failed, an Error is returned.
///
/// N.B. `RegexBuilder::new("...").compile()` is equivalent to
/// `Regex::new("...")`.
2016-08-22 05:20:29 +08:00
pub fn build(&self) -> Result<Regex, Error>;
2016-05-12 08:45:44 +08:00
/// Set the case insensitive flag (i).
2016-08-22 06:08:57 +08:00
pub fn case_insensitive(&mut self, yes: bool) -> &mut RegexBuilder;
2016-05-12 08:45:44 +08:00
/// Set the multi line flag (m).
2016-08-22 06:08:57 +08:00
pub fn multi_line(&mut self, yes: bool) -> &mut RegexBuilder;
2016-05-12 08:45:44 +08:00
/// Set the dot-matches-any-character flag (s).
2016-08-22 06:08:57 +08:00
pub fn dot_matches_new_line(&mut self, yes: bool) -> &mut RegexBuilder;
2016-05-12 08:45:44 +08:00
/// Set the swap-greedy flag (U).
2016-08-22 06:08:57 +08:00
pub fn swap_greed(&mut self, yes: bool) -> &mut RegexBuilder;
2016-05-12 08:45:44 +08:00
/// Set the ignore whitespace flag (x).
2016-08-22 06:08:57 +08:00
pub fn ignore_whitespace(&mut self, yes: bool) -> &mut RegexBuilder;
2016-05-12 08:45:44 +08:00
/// Set the Unicode flag (u).
2016-08-22 06:08:57 +08:00
pub fn unicode(&mut self, yes: bool) -> &mut RegexBuilder;
2016-05-12 08:45:44 +08:00
/// Set the approximate size limit (in bytes) of the compiled regular
/// expression.
///
/// If compiling a pattern would approximately exceed this size, then
/// compilation will fail.
2016-08-22 06:08:57 +08:00
pub fn size_limit(&mut self, limit: usize) -> &mut RegexBuilder;
2016-05-12 08:45:44 +08:00
/// Set the approximate size limit (in bytes) of the cache used by the DFA.
///
/// This is a per thread limit. Once the DFA fills the cache, it will be
/// wiped and refilled again. If the cache is wiped too frequently, the
/// DFA will quit and fall back to another matching engine.
2016-08-22 06:08:57 +08:00
pub fn dfa_size_limit(&mut self, limit: usize) -> &mut RegexBuilder;
2016-05-12 08:45:44 +08:00
}
```
## Captures
A `Captures` value stores the locations of all matching capturing groups for
a single match. It provides convenient access to those locations indexed by
either number, or, if available, name.
The first capturing group (index `0`) is always unnamed and always corresponds
to the entire match. Other capturing groups correspond to groups in the
pattern. Capturing groups are indexed by the position of their leftmost
parenthesis in the pattern.
Note that `Captures` is a type constructor with a single parameter: the
lifetime of the text searched by the corresponding regex. In particular, the
lifetime of `Captures` is not tied to the lifetime of a `Regex`.
```rust
impl<'t> Captures<'t> {
/// Returns the match associated with the capture group at index `i`. If
/// `i` does not correspond to a capture group, or if the capture group
/// did not participate in the match, then `None` is returned.
pub fn get(&self, i: usize) -> Option<Match<'t>>;
2016-05-12 08:45:44 +08:00
/// Returns the match for the capture group named `name`. If `name` isn't a
/// valid capture group or didn't match anything, then `None` is returned.
pub fn name(&self, name: &str) -> Option<Match<'t>>;
2016-05-12 08:45:44 +08:00
/// Returns the number of captured groups. This is always at least 1, since
/// the first unnamed capturing group corresponding to the entire match
/// always exists.
pub fn len(&self) -> usize;
/// Expands all instances of $name in the text given to the value of the
/// corresponding named capture group. The expanded string is written to
/// dst.
///
/// The name in $name may be integer corresponding to the index of a capture
/// group or it can be the name of a capture group. If the name isn't a valid
/// capture group, then it is replaced with an empty string.
///
/// The longest possible name is used. e.g., $1a looks up the capture group
/// named 1a and not the capture group at index 1. To exert more precise
/// control over the name, use braces, e.g., ${1}a.
///
/// To write a literal $, use $$.
pub fn expand(&self, replacement: &str, dst: &mut String);
}
```
The `Captures` type impls `Debug`, `Index<usize>` (for numbered capture groups)
and `Index<str>` (for named capture groups). A downside of the `Index` impls is
that the return value is bounded to the lifetime of `Captures` instead of the
lifetime of the actual text searched because of how the `Index` trait is
defined. Callers can work around that limitation if necessary by using an
2016-08-22 05:20:29 +08:00
explicit method such as `get` or `name`.
2016-05-12 08:45:44 +08:00
## Replacer
[replacer]: #replacer
The `Replacer` trait is a helper trait to make the various `replace` methods on
`Regex` more ergonomic. In particular, it makes it possible to use either a
standard string as a replacement, or a closure with more explicit access to a
`Captures` value.
```rust
pub trait Replacer {
/// Appends text to dst to replace the current match.
///
/// The current match is represents by caps, which is guaranteed to have a
/// match at capture group 0.
///
/// For example, a no-op replacement would be
/// dst.extend(caps.at(0).unwrap()).
fn replace_append(&mut self, caps: &Captures, dst: &mut String);
/// Return a fixed unchanging replacement string.
///
/// When doing replacements, if access to Captures is not needed, then
/// it can be beneficial from a performance perspective to avoid finding
/// sub-captures. In general, this is called once for every call to replacen.
fn no_expansion<'r>(&'r mut self) -> Option<Cow<'r, str>> {
None
}
}
```
Along with this trait, there is also a helper type, `NoExpand` that implements
`Replacer` like so:
```rust
pub struct NoExpand<'t>(pub &'t str);
impl<'t> Replacer for NoExpand<'t> {
fn replace_append(&mut self, _: &Captures, dst: &mut String) {
dst.push_str(self.0);
2016-05-12 08:45:44 +08:00
}
fn no_expansion<'r>(&'r mut self) -> Option<Cow<'r, str>> {
Some(Cow::Borrowed(self.0))
2016-05-12 08:45:44 +08:00
}
}
```
This permits callers to use `NoExpand` with the `replace` methods to guarantee
that the replacement string is never searched for `$group` replacement syntax.
We also provide two more implementations of the `Replacer` trait: `&str` and
`FnMut(&Captures) -> String`.
## quote
[quote]: #quote
There is one free function in `regex`:
```rust
/// Escapes all regular expression meta characters in `text`.
///
/// The string returned may be safely used as a literal in a regex.
pub fn quote(text: &str) -> String;
```
## RegexSet
[regexset]: #regexset
A `RegexSet` represents the union of zero or more regular expressions. It is a
specialized machine that can match multiple regular expressions simultaneously.
Conceptually, it is similar to joining multiple regexes as alternates, e.g.,
`re1|re2|...|reN`, with one crucial difference: in a `RegexSet`, multiple
expressions can match. This means that each pattern can be reasoned about
independently. A `RegexSet` is ideal for building simpler lexers or an HTTP
router.
Because of their specialized nature, they can only report which regexes match.
They do not report match locations. In theory, this could be added in the
future, but is difficult.
```rust
pub struct RegexSet(_);
impl RegexSet {
/// Constructs a new RegexSet from the given sequence of patterns.
///
/// The order of the patterns given is used to assign increasing integer
/// ids starting from 0. Namely, matches are reported in terms of these ids.
pub fn new<I, S>(patterns: I) -> Result<RegexSet, Error>
where S: AsRef<str>, I: IntoIterator<Item=S>;
/// Returns the total number of regexes in this set.
pub fn len(&self) -> usize;
/// Returns true if and only if one or more regexes in this set match
/// somewhere in the given text.
pub fn is_match(&self, text: &str) -> bool;
/// Returns the set of regular expressions that match somewhere in the given
/// text.
pub fn matches(&self, text: &str) -> SetMatches;
}
```
`RegexSet` impls the `Debug` and `Clone` traits.
The `SetMatches` type is queryable and implements `IntoIterator`.
```rust
pub struct SetMatches(_);
impl SetMatches {
/// Returns true if this set contains 1 or more matches.
pub fn matched_any(&self) -> bool;
/// Returns true if and only if the regex identified by the given id is in
/// this set of matches.
///
/// This panics if the id given is >= the number of regexes in the set that
/// these matches came from.
pub fn matched(&self, id: usize) -> bool;
/// Returns the total number of regexes in the set that created these
/// matches.
pub fn len(&self) -> usize;
/// Returns an iterator over the ids in the set that correspond to a match.
pub fn iter(&self) -> SetMatchesIter;
}
```
`SetMatches` impls the `Debug` and `Clone` traits.
Note that a builder is not proposed for `RegexSet` in this RFC; however, it is
likely one will be added at some point in a backwards compatible way.
## The `bytes` submodule
[the-bytes-submodule]: #the-bytes-submodule
All of the above APIs have thus far been explicitly for searching `text` where
`text` has type `&str`. While this author believes that suits most use cases,
it should also be possible to search a regex on *arbitrary* bytes, i.e.,
`&[u8]`. One particular use case is quickly searching a file via a memory map.
If regexes could only search `&str`, then one would have to verify it was UTF-8
first, which could be costly. Moreover, if the file isn't valid UTF-8, then you
either can't search it, or you have to allocate a new string and lossily copy
the contents. Neither case is particularly ideal. It would instead be nice to
just search the `&[u8]` directly.
This RFC including a `bytes` submodule in the crate. The API of this submodule
is a clone of the API described so far, except with `&str` replaced by `&[u8]`
for the search text (patterns are still `&str`). The clone includes `Regex`
itself, along with all supporting types and traits such as `Captures`,
`Replacer`, `FindIter`, `RegexSet`, `RegexBuilder` and so on. (This RFC
describes some alternative designs in a subsequent section.)
Since the API is a clone of what has been seen so far, it is not written out
again. Instead, we'll discuss the key differences.
Again, the first difference is that a `bytes::Regex` can search `&[u8]`
while a `Regex` can search `&str`.
The second difference is that a `bytes::Regex` can completely disable Unicode
support and explicitly match arbitrary bytes. The details:
1. The `u` flag can be disabled even when disabling it might cause the regex to
match invalid UTF-8. When the `u` flag is disabled, the regex is said to be in
"ASCII compatible" mode.
2. In ASCII compatible mode, neither Unicode codepoints nor Unicode character
classes are allowed.
3. In ASCII compatible mode, Perl character classes (`\w`, `\d` and `\s`)
revert to their typical ASCII definition. `\w` maps to `[[:word:]]`, `\d` maps
to `[[:digit:]]` and `\s` maps to `[[:space:]]`.
4. In ASCII compatible mode, word boundaries use the ASCII compatible `\w` to
determine whether a byte is a word byte or not.
5. Hexadecimal notation can be used to specify arbitrary bytes instead of
Unicode codepoints. For example, in ASCII compatible mode, `\xFF` matches the
literal byte `\xFF`, while in Unicode mode, `\xFF` is a Unicode codepoint that
matches its UTF-8 encoding of `\xC3\xBF`. Similarly for octal notation.
6. `.` matches any byte except for `\n` instead of any Unicode codepoint. When
the `s` flag is enabled, `.` matches any byte.
An interesting property of the above is that while the Unicode flag is enabled,
a `bytes::Regex` is *guaranteed* to match only valid UTF-8 in a `&[u8]`. Like
`Regex`, the Unicode flag is enabled by default.
N.B. The Unicode flag can also be selectively disabled in a `Regex`, but not in
a way that permits matching invalid UTF-8.
# Drawbacks
[drawbacks]: #drawbacks
## Guaranteed linear time matching
[guaranteed-linear-time-matching]: #guaranteed-linear-time-matching
A significant contract in the API of the `regex` crate is that all searching
has worst case `O(n)` complexity, where `n ~ length(text)`. (The size of the
regular expression is taken as a constant.) This contract imposes significant
restrictions on both the implementation and the set of features exposed in the
pattern language. A full analysis is beyond the scope of this RFC, but here are
the highlights:
1. Unbounded backtracking can't be used to implement matching. Backtracking can
be quite fast in practice (indeed, the current implementation uses bounded
backtracking in some cases), but has worst case exponential time.
2. Permitting backreferences in the pattern language can cause matching to
become NP-complete, which (probably) can't be solved in linear time.
3. Arbitrary look around is probably difficult to fit into a linear time
guarantee *in practice*.
The benefit to the linear time guarantee is just that: no matter what, all
searching completes in linear time with respect to the search text. This is a
valuable guarantee to make, because it means that one can execute arbitrary
regular expressions over arbitrary input and be absolutely sure that it will
finish in some "reasonable" time.
Of course, in practice, constants that are omitted from complexity analysis
*actually matter*. For this reason, the `regex` crate takes a number of steps
to keep constants low. For example, by placing a limit on the size of the
regular expression or choosing an appropriate matching engine when another
might result in higher constant factors.
This particular drawback segregates Rust's regular expression library from most
other regular expression libraries that programmers may be familiar with.
Languages such as Java, Python, Perl, Ruby, PHP and C++ support more flavorful
regexes by default. Go is the only language this author knows of whose standard
regex implementation guarantees linear time matching. Of course, RE2
is also worth mentioning, which is a C++ regex library that guarantees linear
time matching. There are other implementations of regexes that guarantee linear
time matching (TRE, for example), but none of them are particularly popular.
It is also worth noting that since Rust's FFI is zero cost, one can bind to
existing regex implementations that provide more features (bindings for both
PCRE1 and Oniguruma exist today).
## Allocation
[allocation]: #allocation
The `regex` API assumes that the implementation can dynamically allocate
memory. Indeed, the current implementation takes advantage of this. A `regex`
library that has no requirement on dynamic memory allocation would look
significantly different than the one that exists today. Dynamic memory
allocation is utilized pervasively in the parser, compiler and even during
search.
The benefit of permitting dynamic memory allocation is that it makes the
implementation *and* API simpler. This does make use of the `regex` crate in
environments that don't have dynamic memory allocation impossible.
This author isn't aware of any `regex` library that can work without dynamic
memory allocation.
With that said, `regex` may want to grow custom allocator support when the
corresponding traits stabilize.
## Synchronization is implicit
[synchronization-is-implicit]: #synchronization-is-implicit
Every `Regex` value can be safely used from multiple threads simultaneously.
Since a `Regex` has interior mutable state, this implies that it must do some
kind of synchronization in order to be safe.
There are some reasons why we might want to do synchronization
automatically:
1. `Regex` exposes an *immutable API*. That is, from looking at its set of
methods, none of them borrow the `Regex` mutably (or otherwise claim to
mutate the `Regex`). This author claims that since there is no *observable
mutation* of a `Regex`, it *not* being thread safe would violate the
principle of least surprise.
2. Often, a `Regex` should be compiled once and reused repeatedly in multiple
searches. To facilitate this, `lazy_static!` can be used to guarantee that
compilation happens exactly once. `lazy_static!` requires its types to be
`Sync`. A user of `Regex` could work around this by wrapping a `Regex` in a
`Mutex`, but this would make misuse too easy. For example, locking a `Regex`
in one thread would prevent simultaneous searching in another thread.
Synchronization has overhead, although it is extremely small (and dwarfed
by general matching overhead). The author has *ad hoc* benchmarked the
`regex` implementation with GNU Grep, and per match overhead is comparable in
single threaded use. It is this author's opinion, that it is good enough. If
synchronization overhead across multiple threads is too much, callers may elect
to clone the `Regex` so that each thread gets its own copy. Cloning a `Regex`
is no more expensive than what would be done internally automatically, but it
does eliminate contention.
An alternative is to increase the API surface and have types that are
synchronized by default and types that aren't synchronized. This was discussed
at length in
[this
thread](https://users.rust-lang.org/t/help-me-reduce-overhead-of-regex-matching/5220/1).
My conclusion from this thread is that we either expand the surface of the API,
or we break the current API or we keep implicit synchronization as-is. In this
author's opinion, neither expanding the API or breaking the API is worth
avoiding negligible synchronization overhead.
## The implementation is complex
[the-implementation-is-complex]: #the-implementation-is-complex
Regular expression engines have a lot of moving parts and it often requires
quite a bit of context on how the whole library is organized in order to make
significant contributions. Therefore, moving `regex` into `rust-lang` is a
*maintenance hazard*. This author has tried to mitigate this hazard somewhat by
doing the following:
1. Offering to mentor contributions. Significant contributions have thus far
fizzled, but minor contributions---even to complex code like the DFA---have
been successful.
2. Documenting not just the API, but the *internals*. The DFA is, for example,
heavily documented.
3. Wrote a `HACKING.md` guide that gives a sweeping overview of the design.
4. Significant test and benchmark suites.
With that said, there is still a lot more that could be done to mitigate the
maintenance hazard. In this author's opinion, the interaction between the three
parts of the implementation (parsing, compilation, searching) is not documented
clearly enough.
# Alternatives
[alternatives]: #alternatives
## Big picture
[big-picture]: #big-picture
The most important alternative is to decide *not* to bless a particular
implementation of regular expressions. We might want to go this route for any
number of reasons (see: Drawbacks). However, the `regex` crate is already
widely used, which provides at least some evidence that some set of programmers
find it good enough for general purpose regex searching.
The impact of not moving `regex` into `rust-lang` is, plainly, that Rust won't
have an "officially blessed" regex implementation. Many programmers may
appreciate the complexity of a regex implementation, and therefore might insist
that one be officially maintained. However, to be honest, it isn't quite clear
what would happen in practice. This author is speculating.
## `bytes::Regex`
[bytesregex]: #bytesregex
This RFC proposes stabilizing the `bytes` sub-module of the `regex` crate in
its entirety. The `bytes` sub-module is a near clone of the API at the crate
level with one important difference: it searches `&[u8]` instead of `&str`.
This design was motivated by a similar split in `std`, but there are
alternatives.
### A regex trait
[a-regex-trait]: #a-regex-trait
One alternative is designing a trait that looks something like this:
```rust
trait Regex {
type Text: ?Sized;
fn is_match(&self, text: &Self::Text) -> bool;
2016-08-22 05:20:29 +08:00
fn find(&self, text: &Self::Text) -> Option<Match>;
fn find_iter<'r, 't>(&'r self, text: &'t Self::Text) -> Matches<'r, 't, Self::Text>;
2016-05-12 08:45:44 +08:00
// and so on
}
```
However, there are a couple problems with this approach. First and foremost,
the use cases of such a trait aren't exactly clear. It does make writing
generic code that searches either a `&str` or a `&[u8]` possible, but the
semantics of searching `&str` (always valid UTF-8) or `&[u8]` are quite a bit
different with respect to the original `Regex`. Secondly, the trait isn't
obviously implementable by others. For example, some of the methods return
2016-08-22 05:20:29 +08:00
iterator types such as `Matches` that are typically implemented with a
2016-05-12 08:45:44 +08:00
lower level API that isn't exposed. This suggests that a straight-forward
traitification of the current API probably isn't appropriate, and perhaps,
a better trait needs to be more fundamental to regex searching.
Perhaps the strongest reason to not adopt this design for regex `1.0` is that
we don't have any experience with it and there hasn't been any demand for it.
In particular, it could be prototyped in another crate.
### Reuse some types
[reuse-some-types]: #reuse-some-types
In the current proposal, the `bytes` submodule completely duplicates the
top-level API, including all iterator types, `Captures` and even the `Replacer`
trait. We could parameterize many of those types over the type of the text
searched. For example, the proposed `Replacer` trait looks like this:
```rust
trait Replacer {
fn replace_append(&mut self, caps: &Captures, dst: &mut String);
fn no_expansion<'r>(&'r mut self) -> Option<Cow<'r, str>> {
None
}
}
```
We might add an associated type like so:
```rust
trait Replacer {
type Text: ToOwned + ?Sized;
fn replace_append(
&mut self,
caps: &Captures<Self::Text>,
dst: &mut <Self::Text as ToOwned>::Owned,
);
fn no_expansion<'r>(&'r mut self) -> Option<Cow<'r, Self::Text>> {
None
}
}
```
But parameterizing the `Captures` type is a little bit tricky. Namely, methods
2016-08-22 05:20:29 +08:00
like `get` want to slice the text at match offsets, but this can't be done
2016-05-12 08:45:44 +08:00
safely in generic code without introducing another public trait.
The final death knell in this idea is that these two implementations cannot
co-exist:
```rust
impl<F> Replacer for F where F: FnMut(&Captures) -> String {
type Text = str;
fn replace_append(&mut self, caps: &Captures, dst: &mut String) {
dst.push_str(&(*self)(caps));
}
}
impl<F> Replacer for F where F: FnMut(&Captures) -> Vec<u8> {
type Text = [u8];
fn replace_append(&mut self, caps: &Captures, dst: &mut Vec<u8>) {
dst.extend(&(*self)(caps));
}
}
```
Perhaps there is a path through this using yet more types or more traits, but
without a really strong motivating reason to find it, I'm not convinced it's
worth it. Duplicating all of the types is unfortunate, but it's *simple*.
# Unresolved questions
[unresolved]: #unresolved-questions
The `regex` repository has more than just the `regex` crate.
## `regex-syntax`
[regex-syntax]: #regex-syntax
This crate exposes a regular expression parser and abstract syntax that is
completely divorced from compilation or searching. It is not part of `regex`
proper since it may experience more frequent breaking changes and is far less
frequently used. It is not clear whether this crate will ever see `1.0`, and if
it does, what criteria would be used to judge it suitable for `1.0`.
Nevertheless, it is a useful public API, but it is not part of this RFC.
## `regex-capi`
[regex-capi]: #regex-capi
Recently, `regex-capi` was built to provide a C API to this regex library. It
has been used to build [cgo bindings to this library for
Go](https://github.com/BurntSushi/rure-go). Given its young age, it is not part
of this proposal but will be maintained as a pre-1.0 crate in the same
repository.
## `regex_macros`
[regex_macros]: #regex_macros
The `regex!` compiler plugin is a macro that can compile regular expressions
when your Rust program compiles. Stated differently, `regex!("...")` is
transformed into Rust code that executes a search of the given pattern
directly. It was written two years ago and largely hasn't changed since. When
it was first written, it had two major benefits:
1. If there was a syntax error in your regex, your Rust program would not
compile.
2. It was faster.
Today, (1) can be simulated in practice with the use of a Clippy lint and (2)
2016-05-19 18:34:33 +08:00
is no longer true. In fact, `regex!` is at least one order of magnitude slower
2016-05-12 08:45:44 +08:00
than the standard `Regex` implementation.
The future of `regex_macros` is not clear. In one sense, since it is a
compiler plugin, there hasn't been much interest in developing it further since
its audience is necessarily limited. In another sense, it's not entirely clear
what its implementation path is. It would take considerable work for it to beat
the current `Regex` implementation (if it's even possible). More discussion on
this is out of scope.
## Dependencies
[dependencies]: #dependencies
As of now, `regex` has several dependencies:
* `aho-corasick`
* `memchr`
* `thread_local`
* `regex-syntax`
* `utf8-ranges`
All of them except for `thread_local` were written by this author, and were
primarily motivated for use in the `regex` crate. They were split out because
they seem generally useful.
There may be other things in `regex` (today or in the future) that may also be
helpful to others outside the strict context of `regex`. Is it beneficial to
split such things out and create a longer list of dependencies? Or should we
keep `regex` as tight as possible?
## Exposing more internals
[exposing-more-internals]: #exposing-more-internals
It is conceivable that others might find interest in the regex compiler or more
lower level access to the matching engines. We could do something similar to
`regex-syntax` and expose some internals in a separate crate. However, there
isn't a pressing desire to do this at the moment, and would probably require a
good deal of work.
# Breaking changes
[breaking-changes]: #breaking-changes
This section of the RFC lists all breaking changes between `regex 0.1` and the
API proposed in this RFC.
* `find` and `find_iter` now return values of type `Match` instead of
`(usize, usize)`. The `Match` type has `start` and `end` methods which can
be used to recover the original offsets, as well as an `as_str` method to
get the matched text.
* The `Captures` type no longer has any iterators defined. Instead, callers
should use the `Regex::capture_names` method.
2016-05-12 08:45:44 +08:00
* `bytes::Regex` enables the Unicode flag by default. Previously, it disabled
it by default. The flag can be disabled in the pattern with `(?-u)`.
* The definition of the `Replacer` trait was completely re-worked. Namely, its
API inverts control of allocation so that the caller must provide a `String`
to write to. Previous implementors will need to examine the new API. Moving
to the new API should be straight-forward.
* The `is_empty` method on `Captures` was removed since it always returns
`false` (because every `Captures` has at least one capture group
corresponding to the entire match).
* The `PartialEq` and `Eq` impls on `Regex` were removed. If you need this
functionality, add a newtype around `Regex` and write the corresponding
`PartialEq` and `Eq` impls.
* The lifetime parameters for the `iter` and `iter_named` methods on
`Captures` were fixed. The corresponding iterator types, `SubCaptures` and
`SubCapturesNamed`, grew an additional lifetime parameter.
* The constructor, `Regex::with_size_limit`, was removed. It can be replaced
with use of `RegexBuilder`.
* The `is_match` free function was removed. Instead, compile a `Regex`
explicitly and call the `is_match` method.
* Many iterator types were renamed. (e.g., `RegexSplits` to `SplitsIter`.)
* Replacements now return a `Cow<str>` instead of a `String`. Namely, the
subject text doesn't need to be copied if there are no replacements. Callers
may need to add `into_owned()` calls to convert the `Cow<str>` to a proper
`String`.
* The `Error` type no longer has the `InvalidSet` variant, since the error is
no longer possible. Its `Syntax` variant was also modified to wrap a `String`
instead of a `regex_syntax::Error`. If you need access to specific parse
error information, use the `regex-syntax` crate directly.
* To allow future growth, some character classes may no longer compile to make
room for possibly adding class set notation in the future.
2016-08-22 05:20:29 +08:00
* Various iterator types have been renamed.
* The `RegexBuilder` type now takes an `&mut self` on most methods instead of
`self`. Additionally, the final build step now uses `build()` instead of
`compile()`.