| #!/bin/sh |
| |
| # vim: indentexpr= nosmartindent autoindent |
| # vim: tabstop=2 shiftwidth=2 softtabstop=2 |
| |
| # This is a regex that I reverse engineered from the sentence boundary chain |
| # rules in UAX #29. Unlike the grapheme regex, which is essentially provided |
| # for us in UAX #29, no such sentence regex exists. |
| # |
| # I looked into how ICU achieves this, since UAX #29 hints that producing |
| # finite state machines for grapheme/sentence/word/line breaking is possible, |
| # but only easy to do for graphemes. ICU does this by implementing their own |
| # DSL for describing the break algorithms in terms of the chaining rules |
| # directly. You can see an example for sentences in |
| # icu4c/source/data/brkitr/rules/sent.txt. ICU then builds a finite state |
| # machine from those rules in a mostly standard way, but implements the |
| # "chaining" aspect of the rules by connecting overlapping end and start |
| # states. For example, given SB7: |
| # |
| # (Upper | Lower) ATerm x Upper |
| # |
| # Then the naive way to convert this into a regex would be something like |
| # |
| # [\p{sb=Upper}\p{sb=Lower}]\p{sb=ATerm}\p{sb=Upper} |
| # |
| # Unfortunately, this is incorrect. Why? Well, consider an example like so: |
| # |
| # U.S.A. |
| # |
| # A correct implementation of the sentence breaking algorithm should not insert |
| # any breaks here, exactly in accordance with repeatedly applying rule SB7 as |
| # given above. Our regex fails to do this because it will first match `U.S` |
| # without breaking them---which is correct---but will then start looking for |
| # its next rule beginning with a full stop (in ATerm) and followed by an |
| # uppercase letter (A). This will wind up triggering rule SB11 (without |
| # matching `A`), which inserts a break. |
| # |
| # The reason why this happens is because our initial application of rule SB7 |
| # "consumes" the next uppercase letter (S), which we want to reuse as a prefix |
| # in the next rule application. A natural way to express this would be with |
| # look-around, although it's not clear that works in every case since you |
| # ultimately might want to consume that ending uppercase letter. In any case, |
| # we can't use look-around in our truly regular regexes, so we must fix this. |
| # The approach we take is to explicitly repeat rules when a suffix of a rule |
| # is a prefix of another rule. In the case of SB7, the end of the rule, an |
| # uppercase letter, also happens to match the beginning of the rule. This can |
| # in turn be repeated indefinitely. Thus, our actual translation to a regex is: |
| # |
| # [\p{sb=Upper}\p{sb=Lower}]\p{sb=ATerm}\p{sb=Upper}(\p{sb=ATerm}\p{sb=Upper}* |
| # |
| # It turns out that this is exactly what ICU does, but in their case, they do |
| # it automatically. In our case, we connect the chaining rules manually. It's |
| # tedious. With that said, we do no implement Unicode line breaking with this |
| # approach, which is a far scarier beast. In that case, it would probably be |
| # worth writing the code to do what ICU does. |
| # |
| # In the case of sentence breaks, there aren't *too* many overlaps of this |
| # nature. We list them out exhaustively to make this clear, because it's |
| # essentially impossible to easily observe this in the regex. (It took me a |
| # full day to figure all of this out.) Rules marked with N/A mean that they |
| # specify a break, and this strategy only really applies to stringing together |
| # non-breaks. |
| # |
| # SB1 - N/A |
| # SB2 - N/A |
| # SB3 - None |
| # SB4 - N/A |
| # SB5 - None |
| # SB6 - None |
| # SB7 - End overlaps with beginning of SB7 |
| # SB8 - End overlaps with beginning of SB7 |
| # SB8a - End overlaps with beginning of SB6, SB8, SB8a, SB9, SB10, SB11 |
| # SB9 - None |
| # SB10 - None |
| # SB11 - None |
| # SB998 - N/A |
| # |
| # SB8a is in particular quite tricky to get right without look-ahead, since it |
| # allows ping-ponging between match rules SB8a and SB9-11, where SB9-11 |
| # otherwise indicate that a break has been found. In the regex below, we tackle |
| # this by only permitting part of SB8a to match inside our core non-breaking |
| # repetition. In particular, we only allow the parts of SB8a to match that |
| # permit the non-breaking components to continue. If a part of SB8a matches |
| # that guarantees a pop out to SB9-11, (like `STerm STerm`), then we let it |
| # happen. This still isn't correct because an SContinue might be seen which |
| # would allow moving back into SB998 and thus the non-breaking repetition, so |
| # we handle that case as well. |
| # |
| # Finally, the last complication here is the sprinkling of $Ex* everywhere. |
| # This essentially corresponds to the implementation of SB5 by following |
| # UAX #29's recommendation in S6.2. Essentially, we use it avoid ever breaking |
| # in the middle of a grapheme cluster. |
| |
| CR="\p{sb=CR}" |
| LF="\p{sb=LF}" |
| Sep="\p{sb=Sep}" |
| Close="\p{sb=Close}" |
| Sp="\p{sb=Sp}" |
| STerm="\p{sb=STerm}" |
| ATerm="\p{sb=ATerm}" |
| SContinue="\p{sb=SContinue}" |
| Numeric="\p{sb=Numeric}" |
| Upper="\p{sb=Upper}" |
| Lower="\p{sb=Lower}" |
| OLetter="\p{sb=OLetter}" |
| |
| Ex="[\p{sb=Extend}\p{sb=Format}]" |
| ParaSep="[$Sep $CR $LF]" |
| SATerm="[$STerm $ATerm]" |
| |
| LetterSepTerm="[$OLetter $Upper $Lower $ParaSep $SATerm]" |
| |
| echo "(?x) |
| ( |
| # SB6 |
| $ATerm $Ex* |
| $Numeric |
| | |
| # SB7 |
| [$Upper $Lower] $Ex* $ATerm $Ex* |
| $Upper $Ex* |
| # overlap with SB7 |
| ($ATerm $Ex* $Upper $Ex*)* |
| | |
| # SB8 |
| $ATerm $Ex* $Close* $Ex* $Sp* $Ex* |
| ([^$LetterSepTerm] $Ex*)* $Lower $Ex* |
| # overlap with SB7 |
| ($ATerm $Ex* $Upper $Ex*)* |
| | |
| # SB8a |
| $SATerm $Ex* $Close* $Ex* $Sp* $Ex* |
| ( |
| $SContinue |
| | |
| $ATerm $Ex* |
| # Permit repetition of SB8a |
| (($Close $Ex*)* ($Sp $Ex*)* $SATerm)* |
| # In order to continue non-breaking matching, we now must observe |
| # a match with a rule that keeps us in SB6-8a. Otherwise, we've entered |
| # one of SB9-11 and know that a break must follow. |
| ( |
| # overlap with SB6 |
| $Numeric |
| | |
| # overlap with SB8 |
| ($Close $Ex*)* ($Sp $Ex*)* |
| ([^$LetterSepTerm] $Ex*)* $Lower $Ex* |
| # overlap with SB7 |
| ($ATerm $Ex* $Upper $Ex*)* |
| | |
| # overlap with SB8a |
| ($Close $Ex*)* ($Sp $Ex*)* $SContinue |
| ) |
| | |
| $STerm $Ex* |
| # Permit repetition of SB8a |
| (($Close $Ex*)* ($Sp $Ex*)* $SATerm)* |
| # As with ATerm above, in order to continue non-breaking matching, we |
| # must now observe a match with a rule that keeps us out of SB9-11. |
| # For STerm, the only such possibility is to see an SContinue. Anything |
| # else will result in a break. |
| ($Close $Ex*)* ($Sp $Ex*)* $SContinue |
| ) |
| | |
| # SB998 |
| # The logic behind this catch-all is that if we get to this point and |
| # see a Sep, CR, LF, STerm or ATerm, then it has to fall into one of |
| # SB9, SB10 or SB11. In the cases of SB9-11, we always find a break since |
| # SB11 acts as a catch-all to induce a break following a SATerm that isn't |
| # handled by rules SB6-SB8a. |
| [^$ParaSep $SATerm] |
| )* |
| # The following collapses rules SB3, SB4, part of SB8a, SB9, SB10 and SB11. |
| ($SATerm $Ex* ($Close $Ex*)* ($Sp $Ex*)*)* ($CR $LF | $ParaSep)? |
| " |