RFC 0528: string-patterns

libs (needle)

Summary

Stabilize all string functions working with search patterns around a new generic API that provides a unified way to define and use those patterns.

Motivation

Right now, string slices define a couple of methods for string manipulation that work with user provided values that act as search patterns. For example, split() takes an type implementing CharEq to split the slice at all codepoints that match that predicate.

Among these methods, the notion of what exactly is being used as a search pattern varies inconsistently: Many work with the generic CharEq, which only looks at a single codepoint at a time; and some work with char or &str directly, sometimes duplicating a method to provide operations for both.

This presents a couple of issues:

At the moment, the full set of relevant string methods roughly looks like this:

pub trait StrExt for ?Sized {
    fn contains(&self, needle: &str) -> bool;
    fn contains_char(&self, needle: char) -> bool;

    fn split<Sep: CharEq>(&self, sep: Sep) -> CharSplits<Sep>;
    fn splitn<Sep: CharEq>(&self, sep: Sep, count: uint) -> CharSplitsN<Sep>;
    fn rsplitn<Sep: CharEq>(&self, sep: Sep, count: uint) -> CharSplitsN<Sep>;
    fn split_terminator<Sep: CharEq>(&self, sep: Sep) -> CharSplits<Sep>;
    fn split_str<'a>(&'a self, &'a str) -> StrSplits<'a>;

    fn match_indices<'a>(&'a self, sep: &'a str) -> MatchIndices<'a>;

    fn starts_with(&self, needle: &str) -> bool;
    fn ends_with(&self, needle: &str) -> bool;

    fn trim_chars<C: CharEq>(&self, to_trim: C) -> &'a str;
    fn trim_left_chars<C: CharEq>(&self, to_trim: C) -> &'a str;
    fn trim_right_chars<C: CharEq>(&self, to_trim: C) -> &'a str;

    fn find<C: CharEq>(&self, search: C) -> Option<uint>;
    fn rfind<C: CharEq>(&self, search: C) -> Option<uint>;
    fn find_str(&self, &str) -> Option<uint>;

    // ...
}

This RFC proposes to fix those issues by providing a unified Pattern trait that all "string pattern" types would implement, and that would be used by the string API exclusively.

This fixes the duplication, consistency, and extensibility problems, and also allows to define newtype wrappers for the same pattern types that use different or specific search implementations.

As an additional design goal, the new abstractions should also not pose a problem for optimization - like for iterators, a concrete instance should produce similar machine code to a hardcoded optimized loop written in C.

Detailed design

New traits

First, new traits will be added to the str module in the std library:

trait Pattern<'a> {
    type Searcher: Searcher<'a>;
    fn into_matcher(self, haystack: &'a str) -> Self::Searcher;

    fn is_contained_in(self, haystack: &'a str) -> bool { /* default*/ }
    fn match_starts_at(self, haystack: &'a str, idx: usize) -> bool { /* default*/ }
    fn match_ends_at(self, haystack: &'a str, idx: usize) -> bool
        where Self::Searcher: ReverseSearcher<'a> { /* default*/ }
}

A Pattern represents a builder for an associated type implementing a family of Searcher traits (see below), and will be implemented by all types that represent string patterns, which includes:

impl<'a>     Pattern<'a> for char       { /* ... */ }
impl<'a, 'b> Pattern<'a> for &'b str    { /* ... */ }

impl<'a, 'b> Pattern<'a> for &'b [char] { /* ... */ }
impl<'a, F>  Pattern<'a> for F where F: FnMut(char) -> bool { /* ... */ }

impl<'a, 'b> Pattern<'a> for &'b Regex  { /* ... */ }

The lifetime parameter on Pattern exists in order to allow threading the lifetime of the haystack (the string to be searched through) through the API, and is a workaround for not having associated higher kinded types yet.

Consumers of this API can then call into_searcher() on the pattern to convert it into a type implementing a family of Searcher traits:

pub enum SearchStep {
    Match(usize, usize),
    Reject(usize, usize),
    Done
}
pub unsafe trait Searcher<'a> {
    fn haystack(&self) -> &'a str;
    fn next(&mut self) -> SearchStep;

    fn next_match(&mut self) -> Option<(usize, usize)> { /* default*/ }
    fn next_reject(&mut self) -> Option<(usize, usize)> { /* default*/ }
}
pub unsafe trait ReverseSearcher<'a>: Searcher<'a> {
    fn next_back(&mut self) -> SearchStep;

    fn next_match_back(&mut self) -> Option<(usize, usize)> { /* default*/ }
    fn next_reject_back(&mut self) -> Option<(usize, usize)> { /* default*/ }
}
pub trait DoubleEndedSearcher<'a>: ReverseSearcher<'a> {}

The basic idea of a Searcher is to expose a interface for iterating through all connected string fragments of the haystack while classifying them as either a match, or a reject.

This happens in form of the returned enum value. A Match needs to contain the start and end indices of a complete non-overlapping match, while a Rejects may be emitted for arbitrary non-overlapping rejected parts of the string, as long as the start and end indices lie on valid utf8 boundaries.

Similar to iterators, depending on the concrete implementation a searcher can have additional capabilities that build on each other, which is why they will be defined in terms of a three-tier hierarchy:

As an important last detail, both Searcher and ReverseSearcher are marked as unsafe traits, even though the actual methods aren't. This is because every implementation of these traits need to ensure that all indices returned by next() and next_back() lie on valid utf8 boundaries in the haystack.

Without that guarantee, every single match returned by a matcher would need to be double-checked for validity, which would be unnecessary and most likely unoptimizable work.

This is in contrast to the current hardcoded implementations, which can make use of such guarantees because the concrete types are known and all unsafe code needed for such optimizations is contained inside a single safe impl.

Given that most implementations of these traits will likely live in the std library anyway, and are thoroughly tested, marking these traits unsafe doesn't seem like a huge burden to bear for good, optimizable performance.

The role of the additional default methods

Pattern, Searcher and ReverseSearcher each offer a few additional default methods that give better optimization opportunities.

Most consumers of the pattern API will use them to more narrowly constraint how they are looking for a pattern, which given an optimized implementantion, should lead to mostly optimal code being generated.

Example for the issue with double-ended searching

Let the haystack be the string "fooaaaaabar", and let the pattern be the string "aa".

Then a efficient, lazy implementation of the matcher searching from the left would find these matches:

"foo[aa][aa]abar"

However, the same algorithm searching from the right would find these matches:

"fooa[aa][aa]bar"

This discrepancy can not be avoided without additional overhead or even allocations for caching in the reverse matcher, and thus "matching from the front" needs to be considered a different operation than "matching from the back".

Why (uint, uint) instead of &str

Note: This section is a bit outdated now

It would be possible to define next and next_back to return &strs instead of (uint, uint) tuples.

A concrete searcher impl could then make use of unsafe code to construct such an slice cheaply, and by its very nature it is guaranteed to lie on utf8 boundaries, which would also allow not marking the traits as unsafe.

However, this approach has a couple of issues. For one, not every consumer of this API cares about only the matched slice itself:

In order for these use cases to work with a &str match, the concrete adapters would need to unsafely calculate the offset of a match &str to the start of the haystack &str.

But that in turn would require matcher implementors to only return actual sub slices into the haystack, and not random static string slices, as the API defined with &str would allow.

In order to resolve that issue, you'd have to do one of:

In both cases, the API does not really improve significantly, so uint indices have been chosen as the "simple" default design.

New methods on StrExt

With the Pattern and Searcher traits defined and implemented, the actual str methods will be changed to make use of them:

pub trait StrExt for ?Sized {
    fn contains<'a, P>(&'a self, pat: P) -> bool where P: Pattern<'a>;

    fn split<'a, P>(&'a self, pat: P) -> Splits<P> where P: Pattern<'a>;
    fn rsplit<'a, P>(&'a self, pat: P) -> RSplits<P> where P: Pattern<'a>;
    fn split_terminator<'a, P>(&'a self, pat: P) -> TermSplits<P> where P: Pattern<'a>;
    fn rsplit_terminator<'a, P>(&'a self, pat: P) -> RTermSplits<P> where P: Pattern<'a>;
    fn splitn<'a, P>(&'a self, pat: P, n: uint) -> NSplits<P> where P: Pattern<'a>;
    fn rsplitn<'a, P>(&'a self, pat: P, n: uint) -> RNSplits<P> where P: Pattern<'a>;

    fn matches<'a, P>(&'a self, pat: P) -> Matches<P> where P: Pattern<'a>;
    fn rmatches<'a, P>(&'a self, pat: P) -> RMatches<P> where P: Pattern<'a>;
    fn match_indices<'a, P>(&'a self, pat: P) -> MatchIndices<P> where P: Pattern<'a>;
    fn rmatch_indices<'a, P>(&'a self, pat: P) -> RMatchIndices<P> where P: Pattern<'a>;

    fn starts_with<'a, P>(&'a self, pat: P) -> bool where P: Pattern<'a>;
    fn ends_with<'a, P>(&'a self, pat: P) -> bool where P: Pattern<'a>,
                                                        P::Searcher: ReverseSearcher<'a>;

    fn trim_matches<'a, P>(&'a self, pat: P) -> &'a str where P: Pattern<'a>,
                                                              P::Searcher: DoubleEndedSearcher<'a>;
    fn trim_left_matches<'a, P>(&'a self, pat: P) -> &'a str where P: Pattern<'a>;
    fn trim_right_matches<'a, P>(&'a self, pat: P) -> &'a str where P: Pattern<'a>,
                                                                    P::Searcher: ReverseSearcher<'a>;

    fn find<'a, P>(&'a self, pat: P) -> Option<uint> where P: Pattern<'a>;
    fn rfind<'a, P>(&'a self, pat: P) -> Option<uint> where P: Pattern<'a>,
                                                            P::Searcher: ReverseSearcher<'a>;

    // ...
}

These are mainly the same pattern-using methods as currently existing, only changed to uniformly use the new pattern API. The main differences are:

However, all iterators will still implement DoubleEndedIterator if the underlying matcher implements DoubleEndedSearcher, to keep the ability to do things like foo.split('a').rev().

Transition and deprecation plans

Most changes in this RFC can be made in such a way that code using the old hardcoded or CharEq-using methods will still compile, or give deprecation warning.

It would even be possible to generically implement Pattern for all CharEq types, making the transition more painless.

Long-term, post 1.0, it would be possible to define new sets of Pattern and Searcher without a lifetime parameter by making use of higher kinded types in order to simplify the string APIs. Eg, instead of fn starts_with<'a, P>(&'a self, pat: P) -> bool where P: Pattern<'a>; you'd have fn starts_with<P>(&self, pat: P) -> bool where P: Pattern;.

In order to not break backwards-compability, these can use the same generic-impl trick to forward to the old traits, which would roughly look like this:

unsafe trait NewPattern {
    type Searcher<'a> where Searcher: NewSearcher;

    fn into_matcher<'a>(self, s: &'a str) -> Self::Searcher<'a>;
}

unsafe impl<'a, P> Pattern<'a> for P where P: NewPattern {
    type Searcher = <Self as NewPattern>::Searcher<'a>;

    fn into_matcher(self, haystack: &'a str) -> Self::Searcher {
        <Self as NewPattern>::into_matcher(self, haystack)
    }
}

unsafe trait NewSearcher for Self<'_> {
    fn haystack<'a>(self: &Self<'a>) -> &'a str;
    fn next_match<'a>(self: &mut Self<'a>) -> Option<(uint, uint)>;
}

unsafe impl<'a, M> Searcher<'a> for M<'a> where M: NewSearcher {
    fn haystack(&self) -> &'a str {
        <M as NewSearcher>::haystack(self)
    }
    fn next_match(&mut self) -> Option<(uint, uint)> {
        <M as NewSearcher>::next_match(self)
    }
}

Based on coherency experiments and assumptions about how future HKT will work, the author is assuming that the above implementation will work, but can not experimentally prove it.

Note: There might be still an issue with this upgrade path on the concrete iterator types. That is, Split<P> might turn into Split<'a, P>... Maybe require the 'a from the beginning?

In order for these new traits to fully replace the old ones without getting in their way, the old ones need to not be defined in a way that makes them "final". That is, they should be defined in their own submodule, like str::pattern that can grow a sister module like str::newpattern, and not be exported in a global place like str or even the prelude (which would be unneeded anyway).

Drawbacks

Alternatives

Note: This section is not updated to the new naming scheme

In general:

Under the assumption that the lifetime parameter on the traits in this proposal is too big a wart to have in the release string API, there is an primary alternative that would avoid it:

Next, there are alternatives that might make a positive difference in the authors opinion, but still have some negative trade-offs:

Lastly, there are alternatives that don't seem very favorable, but are listed for completeness sake:

Unresolved questions

Additional extensions

A similar abstraction system could be implemented for String APIs, so that for example string.push("foo"), string.push('f'), string.push('f'.to_ascii()) all work by using something like a StringSource trait.

This would allow operations like s.replace(&regex!(...), "foo"), which would be a method generic over both the pattern matched and the string fragment it gets replaced with:

fn replace<P, S>(&mut self, pat: P, with: S) where P: Pattern, S: StringSource { /* ... */ }