libs (needle)
pattern
Stabilize all string functions working with search patterns around a new generic API that provides a unified way to define and use those patterns.
Right now, string slices define a couple of methods for string
manipulation that work with user provided values that act as
search patterns. For example, split()
takes an type implementing CharEq
to split the slice at all codepoints that match that predicate.
Among these methods, the notion of what exactly is being used as a search
pattern varies inconsistently: Many work with the generic CharEq
,
which only looks at a single codepoint at a time; and some
work with char
or &str
directly, sometimes duplicating a method to
provide operations for both.
This presents a couple of issues:
contains
vs contains_char
)rsplit
for &str
patterns)At the moment, the full set of relevant string methods roughly looks like this:
pub trait StrExt for ?Sized {
fn contains(&self, needle: &str) -> bool;
fn contains_char(&self, needle: char) -> bool;
fn split<Sep: CharEq>(&self, sep: Sep) -> CharSplits<Sep>;
fn splitn<Sep: CharEq>(&self, sep: Sep, count: uint) -> CharSplitsN<Sep>;
fn rsplitn<Sep: CharEq>(&self, sep: Sep, count: uint) -> CharSplitsN<Sep>;
fn split_terminator<Sep: CharEq>(&self, sep: Sep) -> CharSplits<Sep>;
fn split_str<'a>(&'a self, &'a str) -> StrSplits<'a>;
fn match_indices<'a>(&'a self, sep: &'a str) -> MatchIndices<'a>;
fn starts_with(&self, needle: &str) -> bool;
fn ends_with(&self, needle: &str) -> bool;
fn trim_chars<C: CharEq>(&self, to_trim: C) -> &'a str;
fn trim_left_chars<C: CharEq>(&self, to_trim: C) -> &'a str;
fn trim_right_chars<C: CharEq>(&self, to_trim: C) -> &'a str;
fn find<C: CharEq>(&self, search: C) -> Option<uint>;
fn rfind<C: CharEq>(&self, search: C) -> Option<uint>;
fn find_str(&self, &str) -> Option<uint>;
// ...
}
This RFC proposes to fix those issues by providing a unified Pattern
trait
that all "string pattern" types would implement, and that would be used by the string API
exclusively.
This fixes the duplication, consistency, and extensibility problems, and also allows to define newtype wrappers for the same pattern types that use different or specific search implementations.
As an additional design goal, the new abstractions should also not pose a problem for optimization - like for iterators, a concrete instance should produce similar machine code to a hardcoded optimized loop written in C.
First, new traits will be added to the str
module in the std library:
trait Pattern<'a> {
type Searcher: Searcher<'a>;
fn into_matcher(self, haystack: &'a str) -> Self::Searcher;
fn is_contained_in(self, haystack: &'a str) -> bool { /* default*/ }
fn match_starts_at(self, haystack: &'a str, idx: usize) -> bool { /* default*/ }
fn match_ends_at(self, haystack: &'a str, idx: usize) -> bool
where Self::Searcher: ReverseSearcher<'a> { /* default*/ }
}
A Pattern
represents a builder for an associated type implementing a
family of Searcher
traits (see below), and will be implemented by all types that
represent string patterns, which includes:
&str
char
, and everything else implementing CharEq
&Regex
or Ascii
struct BoyerMoore(&str)
impl<'a> Pattern<'a> for char { /* ... */ }
impl<'a, 'b> Pattern<'a> for &'b str { /* ... */ }
impl<'a, 'b> Pattern<'a> for &'b [char] { /* ... */ }
impl<'a, F> Pattern<'a> for F where F: FnMut(char) -> bool { /* ... */ }
impl<'a, 'b> Pattern<'a> for &'b Regex { /* ... */ }
The lifetime parameter on Pattern
exists in order to allow threading the lifetime
of the haystack (the string to be searched through) through the API, and is a workaround
for not having associated higher kinded types yet.
Consumers of this API can then call into_searcher()
on the pattern to convert it into
a type implementing a family of Searcher
traits:
pub enum SearchStep {
Match(usize, usize),
Reject(usize, usize),
Done
}
pub unsafe trait Searcher<'a> {
fn haystack(&self) -> &'a str;
fn next(&mut self) -> SearchStep;
fn next_match(&mut self) -> Option<(usize, usize)> { /* default*/ }
fn next_reject(&mut self) -> Option<(usize, usize)> { /* default*/ }
}
pub unsafe trait ReverseSearcher<'a>: Searcher<'a> {
fn next_back(&mut self) -> SearchStep;
fn next_match_back(&mut self) -> Option<(usize, usize)> { /* default*/ }
fn next_reject_back(&mut self) -> Option<(usize, usize)> { /* default*/ }
}
pub trait DoubleEndedSearcher<'a>: ReverseSearcher<'a> {}
The basic idea of a Searcher
is to expose a interface for
iterating through all connected string fragments of the haystack while classifying them as either a match, or a reject.
This happens in form of the returned enum value. A Match
needs to contain the start and end indices of a complete non-overlapping match, while a Rejects
may be emitted for arbitrary non-overlapping rejected parts of the string, as long as the start and end indices lie on valid utf8 boundaries.
Similar to iterators, depending on the concrete implementation a searcher can have additional capabilities that build on each other, which is why they will be defined in terms of a three-tier hierarchy:
Searcher<'a>
is the basic trait that all searchers need to implement.
It contains a next()
method that returns the start
and end
indices of
the next match or reject in the haystack, with the search beginning at the front
(left) of the string. It also contains a haystack()
getter for returning the
actual haystack, which is the source of the 'a
lifetime on the hierarchy.
The reason for this getter being made part of the trait is twofold:
ReverseSearcher<'a>
adds an next_back()
method, for also allowing to efficiently
search in reverse (starting from the right).
However, the results are not required to be equal to the results of
next()
in reverse, (as would be the case for the DoubleEndedIterator
trait)
because that can not be efficiently guaranteed for all searchers. (For an example, see further below)DoubleEndedSearcher<'a>
is provided as an marker trait for expressing
that guarantee - If a searcher implements this trait, all results found from the
left need to be equal to all results found from the right in reverse order.As an important last detail, both
Searcher
and ReverseSearcher
are marked as unsafe
traits, even though the actual methods
aren't. This is because every implementation of these traits need to ensure that all
indices returned by next()
and next_back()
lie on valid utf8 boundaries
in the haystack.
Without that guarantee, every single match returned by a matcher would need to be double-checked for validity, which would be unnecessary and most likely unoptimizable work.
This is in contrast to the current hardcoded implementations, which can make use of such guarantees because the concrete types are known and all unsafe code needed for such optimizations is contained inside a single safe impl.
Given that most implementations of these traits will likely
live in the std library anyway, and are thoroughly tested, marking these traits unsafe
doesn't seem like a huge burden to bear for good, optimizable performance.
Pattern
, Searcher
and ReverseSearcher
each offer a few additional
default methods that give better optimization opportunities.
Most consumers of the pattern API will use them to more narrowly constraint how they are looking for a pattern, which given an optimized implementantion, should lead to mostly optimal code being generated.
Let the haystack be the string "fooaaaaabar"
, and let the pattern be the string "aa"
.
Then a efficient, lazy implementation of the matcher searching from the left would find these matches:
"foo[aa][aa]abar"
However, the same algorithm searching from the right would find these matches:
"fooa[aa][aa]bar"
This discrepancy can not be avoided without additional overhead or even allocations for caching in the reverse matcher, and thus "matching from the front" needs to be considered a different operation than "matching from the back".
(uint, uint)
instead of &str
Note: This section is a bit outdated now
It would be possible to define next
and next_back
to return &str
s instead of (uint, uint)
tuples.
A concrete searcher impl could then make use of unsafe code to construct such an slice cheaply, and by its very nature it is guaranteed to lie on utf8 boundaries, which would also allow not marking the traits as unsafe.
However, this approach has a couple of issues. For one, not every consumer of this API cares about only the matched slice itself:
split()
family of operations cares about the slices between matches.match_indices()
and find()
need to actually return the offset
to the start of the string as part of their definition.trim()
and Xs_with()
family of operations need to compare individual match
offsets with each other and the start and end of the string.In order for these use cases to work with a &str
match, the concrete adapters
would need to unsafely calculate the offset of a match &str
to the start of the haystack &str
.
But that in turn would require matcher implementors to only return actual sub slices into
the haystack, and not random static
string slices, as the API defined with &str
would allow.
In order to resolve that issue, you'd have to do one of:
In both cases, the API does not really improve significantly, so uint
indices have been chosen
as the "simple" default design.
StrExt
With the Pattern
and Searcher
traits defined and implemented, the actual str
methods will be changed to make use of them:
pub trait StrExt for ?Sized {
fn contains<'a, P>(&'a self, pat: P) -> bool where P: Pattern<'a>;
fn split<'a, P>(&'a self, pat: P) -> Splits<P> where P: Pattern<'a>;
fn rsplit<'a, P>(&'a self, pat: P) -> RSplits<P> where P: Pattern<'a>;
fn split_terminator<'a, P>(&'a self, pat: P) -> TermSplits<P> where P: Pattern<'a>;
fn rsplit_terminator<'a, P>(&'a self, pat: P) -> RTermSplits<P> where P: Pattern<'a>;
fn splitn<'a, P>(&'a self, pat: P, n: uint) -> NSplits<P> where P: Pattern<'a>;
fn rsplitn<'a, P>(&'a self, pat: P, n: uint) -> RNSplits<P> where P: Pattern<'a>;
fn matches<'a, P>(&'a self, pat: P) -> Matches<P> where P: Pattern<'a>;
fn rmatches<'a, P>(&'a self, pat: P) -> RMatches<P> where P: Pattern<'a>;
fn match_indices<'a, P>(&'a self, pat: P) -> MatchIndices<P> where P: Pattern<'a>;
fn rmatch_indices<'a, P>(&'a self, pat: P) -> RMatchIndices<P> where P: Pattern<'a>;
fn starts_with<'a, P>(&'a self, pat: P) -> bool where P: Pattern<'a>;
fn ends_with<'a, P>(&'a self, pat: P) -> bool where P: Pattern<'a>,
P::Searcher: ReverseSearcher<'a>;
fn trim_matches<'a, P>(&'a self, pat: P) -> &'a str where P: Pattern<'a>,
P::Searcher: DoubleEndedSearcher<'a>;
fn trim_left_matches<'a, P>(&'a self, pat: P) -> &'a str where P: Pattern<'a>;
fn trim_right_matches<'a, P>(&'a self, pat: P) -> &'a str where P: Pattern<'a>,
P::Searcher: ReverseSearcher<'a>;
fn find<'a, P>(&'a self, pat: P) -> Option<uint> where P: Pattern<'a>;
fn rfind<'a, P>(&'a self, pat: P) -> Option<uint> where P: Pattern<'a>,
P::Searcher: ReverseSearcher<'a>;
// ...
}
These are mainly the same pattern-using methods as currently existing, only changed to uniformly use the new pattern API. The main differences are:
contains(char)
and contains_str(&str)
got merged into single generic methods.CharEq
-centric naming got changed to Pattern
-centric naming by changing chars
to matches
in a few method names.Matches
iterator has been added, that just returns the pattern matches as &str
slices.
Its uninteresting for patterns that look for a single string fragment, like the char
and &str
matcher, but useful for advanced patterns like predicates over codepoints, or regular expressions.r
prefixed reverse versions. As explained above,
this is because both represent different operations, and thus need to be handled as such.
To be more precise, the two can not be abstracted over by providing a DoubleEndedIterator
implementations, as the different results would break the requirement for double ended iterators
to behave like a double ended queues where you just pop elements from both sides.However, all iterators will still implement DoubleEndedIterator
if the underlying
matcher implements DoubleEndedSearcher
, to keep the ability to do things like foo.split('a').rev()
.
Most changes in this RFC can be made in such a way that code using the old hardcoded or CharEq
-using
methods will still compile, or give deprecation warning.
It would even be possible to generically implement Pattern
for all CharEq
types,
making the transition more painless.
Long-term, post 1.0, it would be possible to define new sets of Pattern
and Searcher
without a lifetime parameter by making use of higher kinded types in order to simplify the
string APIs. Eg, instead of fn starts_with<'a, P>(&'a self, pat: P) -> bool where P: Pattern<'a>;
you'd have fn starts_with<P>(&self, pat: P) -> bool where P: Pattern;
.
In order to not break backwards-compability, these can use the same generic-impl trick to forward to the old traits, which would roughly look like this:
unsafe trait NewPattern {
type Searcher<'a> where Searcher: NewSearcher;
fn into_matcher<'a>(self, s: &'a str) -> Self::Searcher<'a>;
}
unsafe impl<'a, P> Pattern<'a> for P where P: NewPattern {
type Searcher = <Self as NewPattern>::Searcher<'a>;
fn into_matcher(self, haystack: &'a str) -> Self::Searcher {
<Self as NewPattern>::into_matcher(self, haystack)
}
}
unsafe trait NewSearcher for Self<'_> {
fn haystack<'a>(self: &Self<'a>) -> &'a str;
fn next_match<'a>(self: &mut Self<'a>) -> Option<(uint, uint)>;
}
unsafe impl<'a, M> Searcher<'a> for M<'a> where M: NewSearcher {
fn haystack(&self) -> &'a str {
<M as NewSearcher>::haystack(self)
}
fn next_match(&mut self) -> Option<(uint, uint)> {
<M as NewSearcher>::next_match(self)
}
}
Based on coherency experiments and assumptions about how future HKT will work, the author is assuming that the above implementation will work, but can not experimentally prove it.
Note: There might be still an issue with this upgrade path on the concrete iterator types. That is,
Split<P>
might turn intoSplit<'a, P>
... Maybe require the'a
from the beginning?
In order for these new traits to fully replace the old ones without getting in their way,
the old ones need to not be defined in a way that makes them "final".
That is, they should be defined in their own submodule, like str::pattern
that can grow
a sister module like str::newpattern
, and not be exported in a global place like str
or even
the prelude
(which would be unneeded anyway).
Note: This section is not updated to the new naming scheme
In general:
contains
and contains_str
.
Similar to status quo, but no CharEq
and thus no generics.Under the assumption that the lifetime parameter on the traits in this proposal is too big a wart to have in the release string API, there is an primary alternative that would avoid it:
CharEq
- This would mean hardcoded _str
methods,
generic CharEq
methods, and no extensibility to types like Regex
, but has a
upgrade path for later upgrading CharEq
to a full-fledged, HKT-using Pattern
API, by providing
back-comp generic impls.Next, there are alternatives that might make a positive difference in the authors opinion, but still have some negative trade-offs:
Matcher
traits having the unsafe constraint of returning results unique to the
current haystack already, they could just directly return a (*const u8, *const u8)
pointing into it.
This would allow a few more micro-optimizations, as now the matcher -> match -> final slice
pipeline would no longer need to keep adding and subtracting the start address of the haystack
for immediate results.Pattern
into Pattern
and ReversePattern
, starting the forward-reverse split at the level of
patterns directly. The two would still be in a inherits-from relationship like
Matcher
and ReverseSearcher
, and be interchangeable if the later also implement DoubleEndedSearcher
,
but on the str
API where clauses like where P: Pattern<'a>, P::Searcher: ReverseSearcher<'a>
would turn into where P: ReversePattern<'a>
.Lastly, there are alternatives that don't seem very favorable, but are listed for completeness sake:
unsafe
from the API by returning a special SubSlice<'a>
type instead of (uint, uint)
in each
match, that wraps the haystack and the
current match as a (*start, *match_start, *match_end, *end)
pointer quad. It is unclear whether
those two additional words per match end up being an issue after monomorphization, but two of them
will be constant for the duration of the iteration, so changes are they won't matter.
The haystack()
could also be removed that way, as each match already returns the haystack.
However, this still prevents removal of the lifetime parameters without HKT.Matcher
and Pattern
by requiring users of the API to store the haystack slice
themselves, duplicating it in the in-memory representation.
However, this still runs into HKT issues with the impl of Pattern
.Pattern
and Matcher
by making them fully unsafe API's,
and require implementations to unsafely transmuting back the lifetime of the haystack slice.unsafe
from the API by not marking the Matcher
traits as unsafe
, requiring users of the API
to explicitly check every match on validity in regard to utf8 boundaries.unsafe
traits by providing parallel safe and unsafe Matcher
traits or methods,
with the one per default implemented in terms of the other.Matches
and RMatches
could both implement DoubleEndedIterator
if a
DoubleEndedSearcher
exists, in the latter only Matches
would, with RMatches
only providing the
minimum to support reverse operation.
A ruling in favor of symmetry would also speak for the ReversePattern
alternative.A similar abstraction system could be implemented for String
APIs, so that for example string.push("foo")
,
string.push('f')
, string.push('f'.to_ascii())
all work by using something like a StringSource
trait.
This would allow operations like s.replace(®ex!(...), "foo")
,
which would be a method generic over both the pattern matched and the string fragment it gets replaced with:
fn replace<P, S>(&mut self, pat: P, with: S) where P: Pattern, S: StringSource { /* ... */ }