RegEx in Zig.
At the moment, parsing of basic RegEx's (literal characters, concatenation, |, and *) is implemented; \ is used as an escaping character; parentheses can be used as you would expect. The empty string is not a valid regular expression - to match only the empty string, use the regex []. All other current features are merely syntactic sugar (as the previous features are enough to recognize any Type-3 language); here's a list of all the sweet stuff:
- Character ranges and groups:
[a-d],[abcd],[ab-d],(a|b|c|d)are all supported (and equivalent). Empty character ranges (i.e.[]) are the only case where ranges are not syntactic sugar, as there is no other way to regocnize only the empty string.- Character groups can be inverted:
[^a]matches any character buta.
- Character groups can be inverted:
- Using
.to mean "any character" is supported. To match a literal '.', use[.]or\.. - The
+operator can be used as a shorthand for a concatentation and a Kleene star:x+is equivalent toxx*, i.e.+means "at least one character". - The
?operator can be used as a shorthand for a union with the empty string:x?is equivalent tox|[].
Translation of regular expressions to
The RegEx ASTs, and the NFAs and DFAs include Graphviz DOT output as a visualization aid.
The project also includes a JIT compiler for x86-64 that can convert DFAs to machine code that is executable immediately. Profile guided optimization is also implemented: the JIT compiler can use information collected during profiling runs of the DFA (pre-compilation, i.e. interpreted runs), to emit more efficient machine code. Although the compile-time performance of using this feature is not great yet.
Example of a regex combining almost all current features: x[yz]|.w*[a-c]*[f-i]
AST:
NFA (still has the epsilon transitions, but they can all be ignored):
DFA:
Machine code can be found here
Building the DFA at compile-time should be possible quite easily if/when Zig gets working compile-time allocators. It should also be possible now, but not without significant rewrites.