Initial release.
This document introduces RegexKitLite for Mac OS X. RegexKitLite enables easy access to regular expressions by providing a number of additions to the standard Foundation NSString class. RegexKitLite acts as a bridge between the NSString class and the regular expression engine in the International Components for Unicode, or ICU, dynamic shared library that is shipped with Mac OS X.
For consistency, this document follows the style conventions of Apples on-line HTML documentation. When printed, this document follows the style conventions of Apples printable PDF documentation using a special CSS stylesheet tailored to printed media. While not required, the best results are obtained when the following fonts are available:
While RegexKitLite is not a descendant of the RegexKit.framework source code, it does provide a small subset of RegexKits NSString methods for performing various regular expression tasks. These include determining the range that a regular expression matches within a string, easily creating a new string from the results of a match, splitting a string in to a NSArray with a regular expression, and performing search and replace operations with regular expressions using common $n substitution syntax.
RegexKitLite uses the regular expression provided by the ICU library that ships with Mac OS X. The two files, RegexKitLite.h and RegexKitLite.m, and linking against the /usr/lib/libicucore.dylib ICU shared library is all that is required. Adding RegexKitLite to your project only adds a few kilobytes of overhead to your applications size and typically only requires a few kilobytes of memory at run-time. Since a regular expression must first be compiled by the ICU library before it can be used, RegexKitLite keeps a small pseudo Least Recently Used cache of the compiled regular expressions.
RegexKit.framework and RegexKitLite are two different projects. In retrospect, RegexKitLite should have been given a more distinctive name. Below is a table summarizing some of the key differences between the two:
RegexKit.framework | RegexKitLite | |
---|---|---|
Regex Library | PCRE | ICU |
Library Included | Yes, built into framework object file. | No, provided by Mac OS X. |
Library Linked As | Statically linked into framework. | Dynamically linked to /usr/lib/libicucore.dylb. |
Compiled Size | Approximately 371KB† per architecture. | Very small, approximately 16KB—20KB‡ per architecture. |
Style | External, linked to framework. | Compiled directly in to final executable. |
Feature Set | Large, with additions to many classes. | Minimal, NSString only. |
The NSString that contains the regular expression must be compiled in to an ICU URegularExpression. This can be an expensive, time consuming step, and the compiled regular expression can be reused again in another search, even if the strings to be searched are different. Therefore RegexKitLite keeps a small cache of recently compiled regular expressions.
This cache is a simple hash table, the size of which can be tuned with the pre-processor define RKL_CACHE_SIZE. The default cache size, which should always be a prime number, is set to 23. The NSString regexString is mapped to a cache slot using modular arithmetic: Cache slot ≡ [regexString hash] mod RKL_CACHE_SIZE, i.e. cacheSlot = [regexString hash] % 23;. Since RegexKitLite uses Core Foundation, this is actually coded as cacheSlot = CFHash(regexString) % RKL_CACHE_SIZE;.
If the cache slot currently contains a compiled URegularExpression, checks are made to ensure that the current regexString is identical to the regular expression used to create the compiled URegularExpression. If they are a match, the cached compiled regular expression is used. If they are not a match, the current compiled regular expression for the selected cache slot is ejected and all of its resources are freed. Then the regexString that caused the ejection is compiled and fills the cache slot. Only one compiled regular expression can reside in a cache slot at a time.
When a regular expression is compiled, an immutable copy of the string is kept. For immutable NSString objects, the copy is usually the same object with its reference count increased by one. Only NSMutableString objects will cause a new, immutable NSString to be created.
If the regular expression being used is stored in a NSMutableString, the cached regular expression will continue to be used as long as the NSMutableString remains unchanged. Once mutated, the changed NSMutableString will no longer be a match for the cached compiled regular expression that was being used by it previously. Even if the newly mutated strings hash is congruent to the previous unmutated strings hash modulo RKL_CACHE_SIZE, that is to say they share the same cache slot (i.e., ([mutatedString hash] % RKL_CACHE_SIZE) == ([unmutatedString hash] % RKL_CACHE_SIZE)), the immutable copy of the regular expression string used to create the compiled regular expression is used to ensure true equality. The newly mutated string will have to go through the whole cache slot entry creation process and be compiled in to a URegularExpression.
This means that NSMutableString objects can be safely used as regular expressions, and any mutations to those objects will immediately be detected and reflected in the regular expression used for matching.
Unfortunately, the ICU regular expression API requires that the compiled regular expression be "set" to the string to be searched. To search a different string, the compiled regular expression must be "set" to the new string. Therefore, RegexKitLite tracks the last NSString that each compiled regular expression was set to, recording the pointer to the NSString object, its hash, and its length. If any of these parameters are different from the last parameters used for a compiled regular expression, the compiled regular expression is "set" to the new string. Since mutating a string will likely change its hash value, it's generally safe to search NSMutableString objects, and in most cases the mutation will reset the compiled regular expression to the updated contents of the NSMutableString.
When performing a match, the arguments used to perform the match are kept. If those same arguments are used again, the actual matching operation is skipped because the compiled regular expression already contains the results for the given arguments. This is mostly useful when a regular expression contains multiple capture groups, and the results for different capture groups for the same match are needed. This means that there is only a small penalty for iterating over all the capture groups in a regular expression for a match, and essentially becomes the direct ICU regular expression API equivalent of uregex_start() and uregex_end().
RegexKitLite is ideal when the string being matched is a non-ASCII, Unicode string. This is because the regular expression engine used, ICU, can only operate on UTF-16 encoded strings. Since Cocoa keeps essentially all non-ASCII strings encoded in UTF-16 form internally, this means that RegexKitLite can operate directly on the strings buffer without having to make a temporary copy and transcode the string in to ICU's required format.
Like all object oriented programming, the internal representation of an objects information is private. However, the ICU regular expression engine requires that the text to be search be encoded as a UTF-16 string. For pragmatic purposes, Core Foundation has several public functions that can provide direct access to the buffer used to hold the contents of the string, but such direct access is only available if the private buffer is already encoded in the requested direct access format. As a rough rule of thumb, 8-bit simple strings, such as ASCII, are kept in their 8-bit format. Non 8-bit simple strings are stored as UTF-16 strings. Of course, this is an implementation private detail, so this behavior should never be relied upon. It is mentioned because of the tremendous impact on matching performance and efficiency it can have if a string must be converted to UTF-16.
For strings in which direct access to the UTF-16 string is available, RegexKitLite uses that buffer. This is the ideal case as no extra work needs to be performed, such as converting the string in to a UTF-16 string, and allocating memory to hold the temporary conversion. Of course, direct access is not always available, and occasionally the string to be searched will need to be converted in to a UTF-16 string.
RegexKitLite has two conversion buffer caches. Each buffer can only hold the contents of a single NSString at a time. If the selected buffer does not contain the contents of the NSString that is currently being searched, the previous occupant is ejected from the buffer and the current NSString takes it place. The first conversion buffer is fixed in size and set by the C pre-processor define RKL_FIXED_LENGTH, which defaults to 2048. Any string whose length is less than RKL_FIXED_LENGTH will use the fixed size conversion buffer. The second conversion buffer, for strings whose length is longer than RKL_FIXED_LENGTH, will use the dynamically sized conversion buffer. The memory allocation for the dynamically sized conversion buffer is resized for each conversion with realloc() to the size needed to hold the entire contents of the UTF-16 converted string.
This strategy was chosen for its relative simplicity. Keeping track of dynamically created resources is required to prevent memory leaks. As designed, there is only a single pointer to dynamically allocated memory: the pointer to hold the conversion contents of strings whose length is larger than RKL_FIXED_LENGTH. However, since realloc() is used to manage that memory allocation, it becomes very difficult to accidentally leak the buffer. Having the fixed sized buffer means that the memory allocation system isn't bothered with many small requests, most of which are transient in nature to begin with. The current strategy tries to strike the best balance between performance and simplicity.
When converted in to a UTF-16 string, the hash of the NSString is recorded, along with the pointer to the NSString object and the strings length. In order for the RegexKitLite to use the cached conversion, all of these parameters must be equal to their values of the NSString to be searched. If there is any difference, the cached conversion is discarded and the current NSString, or NSMutableString as the case may be, is reconverted in to a UTF-16 string.
RegexKitLite is also multithreading safe. Access to the compiled regular expression cache and the conversion cache is protected by a single OSSpinLock to ensure that only one thread has access at a time. The lock remains held while the regular expression match is performed since the compiled regular expression returned by the ICU library is not safe to use from multiple threads. Once the match has completed, the lock is released, and another thread is free to lock the cache and perform a match.
The goal of RegexKitLite is not to be a comprehensive Objective-C regular expression framework, but to provide a set of easy to use primitives from which additional functionality can be created. To this end, RegexKitLite provides the following two core primitives from which everything else is built:
RegexKitLite 2.0 adds the ability to split strings by dividing them with a regular expression, and the ability to perform search and replace operations using common $n substitution syntax. replaceOccurrencesOfRegex:withString: is used to modify the contents of NSMutableString objects directly and stringByReplacingOccurrencesOfRegex:withString: will create a new, immutable NSString from the receiver.
There are no additional classes that supply the regular expression matching functionality, everything is accomplished with the two methods above. These methods are added to the existing NSString class via an Objective-C category extension. See NSString RegexKitLite Additions Reference for a complete list of methods.
The real workhorse is the rangeOfRegex:options:inRange:capture:error: method. The receiver of the message is an ordinary NSString class member that you wish to perform a regular expression match on. The parameters of the method are a NSString containing the regular expression regex, any RKLRegexOptions match options, the NSRange range of the receiver that is to be searched, the capture number from the regular expression regex that you would like the result for, and an optional error parameter that will contain a NSError object if a problem occurs with the details of the error.
A simple example:
In the previous example, the NSRange that capture number 2 matched is {5, 2}, which corresponds to the word is in searchString. Once the NSRange is known, you can create a new string containing just the matching text:
You can perform search and replace operations on NSString objects and use common $n capture group substitution in the replacement string:
In this example, the regular expression \b(\w+)\b has a single capture group, which is created with the use of () parenthesis. The text that was matched inside the parenthesis is available for use in the replacement text by using $n, where n is the parenthesized capture group you would like to use. Additional capture groups are numbered sequentially in the order that they appear from left to right. Capture group 0 (zero) is also available and is equivalent to all the text that the regular expression matched.
Mutable strings can be manipulated directly:
Strings can be split with a regular expression using the componentsSeparatedByRegex: methods. This functionality is nearly identical to the preexisting NSString method componentsSeparatedByString:, except instead of only being able to use a fixed string as a separator, you can use a regular expression:
The output from NSLog() when run from a shell:
Unfortunately our example string @"This is neat." doesn't allow us to show off the power of regular expressions. As you can probably imagine, splitting the string with the regular expression \s+ allows for one or more white space characters to be matched. This can be much more flexible than just a fixed string of @" ", which will split on a single space only. If our example string contained extra spaces, say @"This is neat.", the result would have been the same.
As a practical example of how to use the simple primitives provided by RegexKitLite, consider the common need of having to enumerate all the matches of a regular expression in a target string. The following example creates a simple NSEnumerator based enumerator for all the matches of a regular expression in a target string, returning a NSString of the text matched by the regular expression (capture 0) for each call to nextObject until the end of the string is reached. Each match begins searching where the last match ended.
The match enumerator is divided in to two parts. The public part is defined in the header RKLMatchEnumerator.h, below. The second part is a private subclass of NSEnumerator whose interface resides only in the file RKLMatchEnumerator.m. Match enumerators are instantiated by sending a NSString class member the message matchEnumeratorWithRegex:. A NSString with the regular expression is passed as the only argument, and a NSEnumerator is returned.
Next, in RKLMatchEnumerator.m, we define our private sub-class of NSEnumerator. In it we declare three instance variables, string, regex, and location. The string ivar holds the string to search, while regex holds the regular expression string. To guard against mutations to either, an immutable copy is made. The location ivar is used to keep track of the current location from which to begin matching. Finally, we declare our designated initializer which initializes the instantiated RKLMatchEnumerator object with the string to search and the regular expression to use.
The following begins the implementation section of RKLMatchEnumerator and a fairly standard initialization method, initWithString:regex:.
The following implements the heart of any NSEnumerator, the nextObject method. If all of the matches have been enumerated, location will be set to NSNotFound, and the body of the if statement won't be evaluated and NULL will be returned.
If there are still matches to be found, searchRange is created to begin at value of the location ivar, with the NSRange length set to the remaining length of the string to be searched, or location - [string length].
Then, the match is performed using the RegexKitLite method rangeOfRegex:inRange: and the result stored in the variable matchedRange.
Next, the location ivar is updated to point to the location at the end of the matchedRange. Since it is possible to have a match with a length of zero, it must handle that special case by adding one, otherwise it will loop endlessly, always matching the same location of zero length. If there was no match, matchedRange.location will be NSNotFound and matchedRange.length will be 0, and the location ivar will be set to NSNotFound.
If the matched range location is not NSNotFound, then a substring of the matched range will be returned. Otherwise, we will exit the if body and return NULL, indicating that the NSEnumerator has no more matches to enumerate.
A standard dealloc, releasing the string and regex ivar objects created during initialization.
And finally, the NSString category addition that returns our match enumerator. This simply creates an instance of our private NSEnumerator sub-class RKLMatchEnumerator, initializes it with the string to match, self, using the regular expression regex, then sends the instantiated object autorelease, which is finally returned. Since this is a NSString category addition, this message will be sent to an instance of an object that is a member of the NSString class, which includes any objects whose super class is ultimately NSString. Therefore, the string to match is the instance receiving the message, self.
The following piece of code is a simple demonstration of the match enumerator which will use a regular expression to enumerate all the lines in the string to be searched.
The variable searchString contains the string to search. The example string includes several embedded \n, or new-line characters. There are a total of four lines of text, with the third line containing no characters.
The variable regex contains the regular expression to be used for matching. This regular expression begins with the sequence (?m) which is used to enable the RKLMultiline regular expression option from the text of the regular expression itself. This enables the metacharacters ^ and $ to match the start of and end of a line, respectively. The remaining characters .* will match any character '.' zero or more times '*'. The prose translation would be:
Enable the RKLMultiline option and match all of the characters from the beginning of a line until the end of a line.
The match enumerator is then instantiated and the results are enumerated with a standard while loop, setting matchedString to the object returned by nextObject. For each line that is returned, the current line number, length of the matched string, and the matched string are printed.
The following shell transcript demonstrates compiling the example and executing it. Line number three clearly demonstrates that matches of zero length are possible. Without the additional logic in nextObject to handle this special case, the enumerator would never advance past the match.
In this section:
For your convenience, the regular expression syntax from the ICU documentation is included below. When in doubt, you should refer to the official ICU User Guide - Regular Expressions documentation page.
Operator | Description |
---|---|
| | Alternation. A|B matches either A or B. |
* | Match zero or more times. Match as many times as possible. |
+ | Match one or more times. Match as many times as possible. |
? | Match zero or one times. Prefer one. |
{n} | Match exactly n times. |
{n,} | Match at least n times. Match as many times as possible. |
{n,m} | Match between n and m times. Match as many times as possible, but not more than m. |
*? | Match zero or more times. Match as few times as possible. |
+? | Match one or more times. Match as few times as possible. |
?? | Match zero or one times. Prefer zero. |
{n}? | Match exactly n times. |
{n,}? | Match at least n times, but no more than required for an overall pattern match. |
{n,m}? | Match between n and m times. Match as few times as possible, but not less than n. |
*+ | Match zero or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails. Possessive match. |
++ | Match one or more times. Possessive match. |
?+ | Match zero or one times. Possessive match. |
{n}+ | Match exactly n times. Possessive match. |
{n,}+ | Match at least n times. Possessive match. |
{n,m}+ | Match between n and m times. Possessive match. |
(…) | Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match. |
(?:…) | Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses. |
(?>…) | Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the (?> . |
(?#…) | Free-format comment (?#comment). |
(?=…) | Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position. |
(?!…) | Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position. |
(?<=…) | Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators). |
(?<!…) | Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators). |
(?ismwx-ismwx:…) | Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled. |
(?ismwx-ismwx) | Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match. See also: Regular Expression Options |
The following was originally from ICU User Guide - UnicodeSet, but has been adapted to fit the needs of this documentation. Specifically, the ICU UnicodeSet documentation describes an ICU C++ object— UnicodeSet. The term UnicodeSet was effectively replaced with Character Class, which is more appropriate in the context of regular expressions. As always, you should refer to the original, official documentation when in doubt.
A character class is a regular expression pattern that represents a set of Unicode characters or character strings. The following table contains some example character class patterns:
Pattern | Description |
---|---|
[a-z] | The lower case letters a through z |
[abc123] | The six characters a, b, c, 1, 2, and 3 |
[\p{Letter}] | All characters with the Unicode General Category of Letter. |
In addition to being a set of Unicode code point characters, a character class may also contain string values. Conceptually, a character class is always a set of strings, not a set of characters. Historically, regular expressions have treated […] character classes as being composed of single characters only, which is equivalent to a string that contains only a single character.
Patterns are a series of characters bounded by square brackets that contain lists of characters and Unicode property sets. Lists are a sequence of characters that may have ranges indicated by a - between two characters, as in a-z. The sequence specifies the range of all characters from the left to the right, in Unicode order. For example, [a c d-f m] is equivalent to [a c d e f m]. Whitespace can be freely used for clarity as [a c d-f m] means the same as [acd-fm].
Unicode property sets are specified by a Unicode property, such as [:Letter:]. ICU version 2.0 supports General Category, Script, and Numeric Value properties (ICU will support additional properties in the future). For a list of the property names, see the end of this section. The syntax for specifying the property names is an extension of either POSIX or Perl syntax with the addition of =value. For example, you can match letters by using the POSIX syntax [:Letter:], or by using the Perl syntax \p{Letter}. The type can be omitted for the Category and Script properties, but is required for other properties.
The following table lists the standard and negated forms for specifying Unicode properties in both POSIX or Perl syntax. The negated form specifies a character class that includes everything but the specified property. For example, [:^Letter:] matches all characters that are not [:Letter:].
Syntax Style | Standard | Negated |
---|---|---|
POSIX | [:type=value:] | [:^type=value:] |
Perl | \p{type=value} | \P{type=value} |
Character classes can then be modified using standard set operations— Union, Inverse, Difference, and Intersection.
To union two sets, simply concatenate them. For example, [[:letter:] [:number:]]
To intersect two sets, use the & operator. For example, [[:letter:] & [a-z]]
To take the set-difference of two sets, use the - operator. For example, [[:letter:] - [a-z]]
To invert a set, place a ^ immediately after the opening [. For example, [^a-z]. In any other location, the ^ does not have a special meaning.
The binary operators & and - have equal precedence and bind left-to-right. Thus [[:letter:]-[a-z]-[\u0100-\u01FF]] is equivalent to [[[:letter:]-[a-z]]-[\u0100-\u01FF]]. Another example is the set [[ace][bdf] - [abc][def]] is not the empty set, but instead the set [def]. This only really matters for the difference operation, as the intersection operation is commutative.
Another caveat with the & and - operators is that they operate between sets. That is, they must be immediately preceded and immediately followed by a set. For example, the pattern [[:Lu:]-A] is illegal, since it is interpreted as the set [:Lu:] followed by the incomplete range -A. To specify the set of uppercase letters except for A, enclose the A in a set: [[:Lu:]-[A]].
Pattern | Description |
---|---|
[a] | The set containing a. |
[a-z] | The set containing a through z and all letters in between, in Unicode order. |
[^a-z] | The set containing all characters but a through z, that is, U+0000 through a-1 and z+1 through U+FFFF. |
[[pat1][pat2]] | The union of sets specified by pat1 and pat2. |
[[pat1]&[pat2]] | The intersection of sets specified by pat1 and pat2. |
[[pat1]-[pat2]] | The asymmetric difference of sets specified by pat1 and pat2. |
[:Lu:] | The set of characters belonging to the given Unicode category. In this case, Unicode uppercase letters. The long form for this is [:UppercaseLetter:]. |
[:L:] | The set of characters belonging to all Unicode categories starting with L, that is, [[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]. The long form for this is [:Letter:]. |
String values are enclosed in {curly brackets}. For example:
Pattern | Description |
---|---|
[abc{def}] | A set containing four members, the single characters a, b, and c and the string def |
[{abc}{def}] | A set containing two members, the string abc and the string def. |
[{a}{b}{c}][abc] | These two sets are equivalent. Each contains three items, the three individual characters a, b, and c. A {string} containing a single character is equivalent to that same character specified in any other way. |
Two single quotes represent a single quote, either inside or outside single quotes. Text within single quotes is not interpreted in any way, except for two adjacent single quotes. It is taken as literal text— special characters become non-special. These quoting conventions for ICU character classes differ from those of Perl or Java. In those environments, single quotes have no special meaning, and are treated like any other literal character.
Outside of single quotes, certain backslashed characters have special meaning:
Pattern | Description |
---|---|
\uhhhh | Exactly 4 hex digits; h in [0-9A-Fa-f] |
\Uhhhhhhhh | Exactly 8 hex digits |
\xhh | 1-2 hex digits |
\ooo | 1-3 octal digits; o in [0-7] |
\a | U+0007 BELL |
\b | U+0008 BACKSPACE |
\t | U+0009 HORIZONTAL TAB |
\n | U+000A LINE FEED |
\v | U+000B VERTICAL TAB |
\f | U+000C FORM FEED |
\r | U+000D CARRIAGE RETURN |
\\ | U+005C BACKSLASH |
Anything else following a backslash is mapped to itself, except in an environment where it is defined to have some special meaning. For example, \p{Lu} is the set of uppercase letters. Any character formed as the result of a backslash escape loses any special meaning and is treated as a literal. In particular, note that \u and \U escapes create literal characters.
Whitespace, as defined by the ICU API, is ignored unless it is quoted or backslashed.
The following property value styles are recognized:
Style | Description |
---|---|
Short | Omits the =type argument. Used to prevent ambiguity and only allowed with the Category and Script properties. |
Medium | Uses an abbreviated type and value. |
Long | Uses a full type and value. |
If the type or value is omitted, then the = equals sign is also omitted. The short style is only used for Category and Script properties because these properties are very common and their omission is unambiguous.
In actual practice, you can mix type names and values that are omitted, abbreviated, or full. For example, if Category=Unassigned you could use what is in the table explicitly, \p{gc=Unassigned}, \p{Category=Cn}, or \p{Unassigned}.
When these are processed, case and whitespace are ignored so you may use them for clarity, if desired. For example, \p{Category = Uppercase Letter} or \p{Category = uppercase letter}.
For a list of properties supported by ICU, see ICU User Guide - Unicode Properties.
The following tables list some of the commonly used Unicode Properties, which can be matched in a regular expression with \p{Property}. The tables were created from the Unicode 5.0 Unicode Character Database, which is the version used by ICU that ships with Mac OS X 10.5.
Category | |
---|---|
L | Letter |
LC | CasedLetter |
Lu | UppercaseLetter |
Ll | LowercaseLetter |
Lt | TitlecaseLetter |
Lm | ModifierLetter |
Lo | OtherLetter |
P | Punctuation |
Pc | ConnectorPunctuation |
Pd | DashPunctuation |
Ps | OpenPunctuation |
Pe | ClosePunctuation |
Pi | InitialPunctuation |
Pf | FinalPunctuation |
Po | OtherPunctuation |
N | Number |
Nd | DecimalNumber |
Nl | LetterNumber |
No | OtherNumber |
M | Mark |
Mn | NonspacingMark |
Mc | SpacingMark |
Me | EnclosingMark |
S | Symbol |
Sm | MathSymbol |
Sc | CurrencySymbol |
Sk | ModifierSymbol |
So | OtherSymbol |
Z | Separator |
Zs | SpaceSeparator |
Zl | LineSeparator |
Zp | ParagraphSeparator |
C | Other |
Cc | Control |
Cf | Format |
Cs | Surrogate |
Co | PrivateUse |
Cn | Unassigned |
Script | ||
---|---|---|
Arabic | Armenian | Balinese |
Bengali | Bopomofo | Braille |
Buginese | Buhid | CanadianAboriginal |
Cherokee | Common | Coptic |
Cuneiform | Cypriot | Cyrillic |
Deseret | Devanagari | Ethiopic |
Georgian | Glagolitic | Gothic |
Greek | Gujarati | Gurmukhi |
Han | Hangul | Hanunoo |
Hebrew | Hiragana | Inherited |
Kannada | Katakana | Kharoshthi |
Khmer | Lao | Latin |
Limbu | LinearB | Malayalam |
Mongolian | Myanmar | NewTaiLue |
Nko | Ogham | OldItalic |
OldPersian | Oriya | Osmanya |
PhagsPa | Phoenician | Runic |
Shavian | Sinhala | SylotiNagri |
Syriac | Tagalog | Tagbanwa |
TaiLe | Tamil | Telugu |
Thaana | Thai | Tibetan |
Tifinagh | Ugaritic | Unknown |
Yi |
Extended Property Class | |
---|---|
ASCIIHexDigit | Alphabetic |
BidiControl | Dash |
DefaultIgnorableCodePoint | Deprecated |
Diacritic | Extender |
GraphemeBase | GraphemeExtend |
GraphemeLink | HexDigit |
Hyphen | IDSBinaryOperator |
IDSTrinaryOperator | IDContinue |
IDStart | Ideographic |
JoinControl | LogicalOrderException |
Lowercase | Math |
NoncharacterCodePoint | OtherAlphabetic |
OtherDefaultIgnorableCodePoint | OtherGraphemeExtend |
OtherIDContinue | OtherIDStart |
OtherLowercase | OtherMath |
OtherUppercase | PatternSyntax |
PatternWhiteSpace | QuotationMark |
Radical | STerm |
SoftDotted | TerminalPunctuation |
UnifiedIdeograph | Uppercase |
VariationSelector | WhiteSpace |
XIDContinue |
Unicode properties are defined in the Unicode Character Database, or UCD. From time to time the UCD is revised and updated. The properties available, and the definition of the characters they match, depend on the UCD that ICU was built with.
Character | Description |
---|---|
$n | The text of capture group n will be substituted for $n. n must be ≥ 0 and not greater than the number of capture groups. A $ not followed by a digit has no special meaning, and will appear in the substitution text as itself, a $.
Important: |
\ | Treat the character following the backslash as a literal, suppressing any special meaning. Backslash escaping in substitution text is only required for $ and \, but may proceed any character. The backslash itself will not be copied to the substitution text. |
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. |
Jamie Zawinski |
This section contains a collection of regular expressions and example code demonstrating how RegexKitLite makes some common programming choirs easier. RegexKitLite makes it easy to match part of a string and extract just that part, or even create an entirely new string using just a few pieces of the original string. A great example of this is a string that contains a URL and you need to extract just a part of it, perhaps the host or maybe just the port used. This example demonstrates how easy it is to extract the port used from a URL, which is then converted in to a NSInteger value:
Inside you'll find more examples like this that you can use as the starting point for your own regular expression pattern matching solution. Keep in mind that these are meant to be examples to help get you started and not necessarily the ideal solution for every need. Trade‑offs are usually made when creating a regular expression, matching an email address is a perfect example of this. A regular expression that precisely matches the formal definition of email address is both complicated and usually unnecessary. Knowing which trade‑offs are acceptable requires that you understand what it is you're trying to match, the data that you're searching through, and the requirements and uses of the matched results. It won't take long until you gain an appreciation for Jamie Zawinski's infamous quote.
Copied Regex Escape Style: | ||
Escape Style Options: | Smart escape | |
Use C99 \u character escapes | ||
Escaped Unicode in NSString literals |
Escape Style Preview:
Preview:
This browser supports Copy To Clipboard features that can be used by this document. Using the Copy Regex to Clipboard Preferences interface, you can configure the behavior of so that when you select and copy a regular expression in this document, the clipboard will contain the selected regular expression that has been escaped according to your preference choices. The resulting text in the clipboard can be pasted directly in to your source code, no further modification of the regular expression is required.
Your preference choices are stored in the browser using a cookie. Provided that cookies are enabled in your browser, your preferences should persist across browser sessions. The preference panel also contains a preview of what would be copied to the clipboard for some example regular expressions using the current settings. When you change a setting in the preferences, the examples will transition between the previous settings and the new settings, while briefly highlighting the differences between the two settings, so you can determine the effect the change has caused.
Some of the problems of using regular expressions unmodified in C and Objective-C are:
There are several different ways to escape Unicode characters. When using Unicode characters in NSString literals, the best method depends on the version of Xcode that is used to compile the source and what C standard you are using (i.e., gcc -std=(c|gnu)99).
When the None, then the selected regular expression is copied unmodified to the clipboard.
is set toWhen the Escape Only, C String, or NSString then these details are automatically dealt with for you. This allows you to paste the selected regular expression directly in to your source code without having to escape every use of \ by hand, or having to convert the Unicode characters in to a acceptable format.
is set toIf this option is disabled then all occurrences of \ and " are escaped with a \, resulting in \\ or \". No other processing is performed. Since this prevents the C compiler from interpreting any special meaning of \ sequences, it is the safest option. The ICU regular expression engine is then responsible for interpreting the meaning of any \ escaped character sequences.
If this option is enabled then certain escape sequences are interpreted and rewritten using the current preference options.
Normally, Unicode characters are embedded in string literals as the characters UTF-8 byte sequence using \ddd octal escapes. When this option is enabled, the C99 \u and \U character escape sequences are used instead. gcc will issue a warning if \u character escape sequences are present and the compiler is not configured to use the C99 (or later) standard (i.e., gcc -std=(c|gnu)99).
Under the C99 standard, \u and \U are used to specify a universal character name, which is a character encoded in the ISO/IEC 10646 character set (essentially identical to Unicode in this context). Ultimately, a universal character name is translated in to a sequence of bytes needed to represent the designated character in the C environments execution character set. Usually, although certainly not always, a string literal should be encoded as UTF-8, which happens to be the default execution character set for gcc. This is an important point to remember because the more convenient and easier to use \u escape sequences are not guaranteed to convert in to a specific sequence of bytes, unlike an octal \ddd or hex \xhh escape sequence. There is currently no way to specify that a particular string literal should always be translated using a specific character set encoding. This may result in undefined behavior if the \u universal character name is not translated in to the expected character set, which in this case must be UTF-8.
Prior to Xcode 3.0, gcc only supported the use of ASCII characters (i.e., characters ≤ 127) in constant NSString literals. If one needed to include Unicode characters in an NSString, one would typically convert the string in to UTF-8, and then create a NSString at run time using the stringWithUTF8String: method, with the UTF-8 encoded C string passed as the argument. For example, "€1.99", which contains the € euro symbol, would be created using the following:
One of the obvious disadvantages of this approach is that it instantiates a new, autoreleased NSString each time it's used, unlike a constant NSString literal like @"$1.99". Beginning with Xcode 3.0 and gcc 4.0, constant NSString literals that contain Unicode characters can be specified directly in source-code using the standard @"" syntax. For example:
The compiler converts these strings to UTF-16 using the endianness of the target architecture. Since Mach-O object file format allows for multiple architectures, this allows each architecture to encode the string as native UTF-16 byte ordering for that architecture, so there are no issues with proper byte ordering. Within the object file itself, these strings are essentially identical to their ASCII-only counterparts: effectively they are pre-instantiated objects. The only real difference is that the compiler sets some internal CFString bits differently so that the CFString object knows that the strings data is encoded as UTF-16 and not simple 8-bit data.
This means that constant NSString objects created this way should work on Mac OS X version prior to 10.5. The author has tried this on Mac OS X 10.4 and did not encounter any problems on either architecture (ppc, i386). Since the author is unaware of any publicly available documentation regarding this feature it is difficult to say if there are any minimum requirements or other limitations when using constant NSString literals that contain Unicode characters.
This document will briefly add a red outline around the selected text as a visual aid in determining whether or not the selected text was modified before placing it in the clipboard. In addition to this, a HUD display will drop down briefly at the top of the documents window and will display the escaped text that was placed in to the clipboard. Here is an example of what would be displayed if the escape style was selected:
The selected text will only be escaped if it can be determined that it is a regular expression, and the selection only contains a regular expression. If the regular expression is part of an overall larger selection then the text that is copied to the clipboard is not modified.
Also available is the Regex Escape Tool which allows you to enter a regular expression and have it immediately escaped using the current preference settings. This can be useful if you have a complex regular expression that needs to be escaped before you can use it in a constant NSString literal. It can also be used to easily create a constant NSString literal that contains several Unicode characters that would otherwise have to be manually converted by hand.
Description | Regex | Examples |
---|---|---|
Integer | [+\-]?[0-9]+ | 123-42+23 |
Hex Number | 0[xX][0-9a-fA-F]+ | 0x00xdeadbeef0xF3 |
Floating Point | [+\-]?(?:[0-9]*\.[0-9]+|[0-9]+\.) | 123..123+.42 |
Floating Point with Exponent | [+\-]?(?:[0-9]*\.[0-9]+|[0-9]+\.)(?:[eE][+\-]?[0-9]+)? | 123..12310.0E131.23e-7 |
Comma Separated Number | [0-9]{1,3}(?:,[0-9]{3})* | 421,2341,234,567 |
Comma Separated Number | [0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)? | 421,2341,234,567.89 |
NSString includes several methods for converting the contents of the string in to a numeric value in the various C primitive types. The following demonstrates the matching of an int and double in a NSString, and then converting the matched string in to its base type.
The variable matchedInt now contains the value of 5542.
The variable matchedDouble now contains the value of 4321.9876. doubleValue can even convert numbers that are in scientific notation, which represent numbers as a times ten to the power of b:
The variable matchedDouble now contains the value of 101048.9.
Converting a string that contains a hex number in to a more basic type, such as an int, takes a little more work. Unfortunately, Foundation does not provide an easy way to convert a hex value in a string in to a more basic type as it does with intValue or doubleValue. Thankfully the standard C library provides a set of functions for performing such a conversion. For this example we will use the strtol() (string to long) function to convert the hex value we've extracted from searchString. We can not pass the pointer to the NSString object that contains the matched hex value since strtol() is part of the standard C library which can only work on pointers to C strings. We use the UTF8String method to get a pointer to a compatible C string of the matched hex value.
The full set of string to… functions are: strtol(), strtoll(), strtoul(), and strtoull(). These convert a string value, from base 2 to base 36, in to a long, long long, unsigned long, and unsigned long long respectively.
Since it seems to be a frequently asked question, and a common search engine query for RegexKit web site visitors, here is a NSString category addition that converts the receivers text in to a NSInteger value. This is the same functionality as intValue or doubleValue, except that it converts hexadecimal text values instead of decimal text values.
The example conversion code is fairly quick since it uses Core Foundation directly along with the stack to hold any temporary string conversions. Any whitespace at the beginning of the string will be skipped and the hexadecimal text to be converted may be optionally prefixed with either 0x or 0X. Returns 0 if the receiver does not begin with a valid hexadecimal text representation. Refer to strtol(3) for additional conversion details.
Description | Regex |
---|---|
Empty Line | (?m:^$) |
Empty or Whitespace Only Line | (?m-s:^\s*$) |
Strip Leading Whitespace | (?m-s:^\s*(.*?)$) |
Strip Trailing Whitespace | (?m-s:^(.*?)\s*$) |
Strip Leading and Trailing Whitespace | (?m-s:^\s*(.*?)\s*$) |
Quoted String, Can Span Multiple Lines, May Contain \" | "(?:[^"\\]*+|\\.)*" |
Quoted String, Single Line Only, May Contain \" | "(?:[^"\\\r\n]*+|\\[^\r\n])*" |
HTML Comment | (?s:<--.*?-->) |
Perl / Shell Comment | (?m-s:#.*$) |
C, C++, or ObjC Comment | (?m-s://.*$) |
C, C++, or ObjC Comment and Leading Whitespace | (?m-s:\s*//.*$) |
C, C++, or ObjC Comment | (?s:/\*.*?\*/) |
Unfortunately, when processing text files, there is no standard 'newline' character or character sequence. Today this most commonly surfaces when converting text between Microsoft Windows / MS-DOS and Unix / Mac OS X. The reason for the proliferation of newline standards is largely historical and goes back many decades. Below is a table of the dominant newline character sequence 'standards':
Description | Sequence | C String | Control | Common Uses |
---|---|---|---|---|
Line Feed | \u000A | \n | ^J | Unix, Amiga, Mac OS X |
Vertical Tab | \u000B | \v | ^K | |
Form Feed | \u000C | \f | ^L | |
Carriage Return | \u000D | \r | ^M | Apple ][, Mac OS ≤ 9 |
Next Line (NEL) | \u0085 | IBM / EBCDIC | ||
Line Separator | \u2028 | Unicode | ||
Paragraph Separator | \u2029 | Unicode | ||
Carriage Return + Line Feed | \u000D\u000A | \r\n | ^M^J | MS-DOS, Windows |
Ideally, one should be flexible enough to accept any of these character sequences if one has to process text files, especially if the origin of those text files is not known. Thankfully, regular expressions excel at just such a task. Below is a regular expression pattern that will match any of the above character sequences. This is also the character sequence that the metacharacter $ matches.
Description | Regex | Notes |
---|---|---|
Any newline | (?:\r\n|[\n\v\f\r\x85\p{Zl}\p{Zp}]) | UTS #18 recommended. Character sequence that $ matches. |
It is often necessary to work with the individual lines of a file. There are two regular expression metacharacters, ^ and $, that match the beginning and end of a line, respectively. However, exactly what is matched by ^ and $ depends on whether or not the multi-line option is enabled for the regular expression, which by default is disabled. It can be enabled for the entire regular expression by passing RKLMultiline via the options: method argument, or within the regular expression using the options syntax— (?m:…).
If multi-line is disabled, then ^ and $ match the beginning and end of the entire string. If there is a newline character sequence at the very end of the string, then $ will match the character just before the newline character sequence. Any newline character sequences in the middle of the string will not be matched.
If multi-line is enabled, then ^ and $ match the beginning and end of a line, where the end of a line is the newline character sequence. The metacharacter ^ matches either the first character in the string, or the first character following a newline character sequence. The metacharacter $ matches either the last character in the string, or the character just before a newline character sequence.
A common text processing pattern is to process a file one line at a time. Using the recommended regular expression for matching any newline and the componentsSeparatedByRegex: method, you can easily create a NSArray containing every line in a file and process it one line at a time:
The componentsSeparatedByRegex: method effectively 'chops off' the matched regular expression, or in this case any newline character. In the example above, within the for…in loop, lineString will not have a newline character at the end of the string.
Description | Regex |
---|---|
Split CSV line | ,(?=(?:(?:[^"\\]*+|\\")*"(?:[^"\\]*+|\\")*")*(?!(?:[^"\\]*+|\\")*"(?:[^"\\]*+|\\")*$)) |
This regular expression essentially works by ensuring that there are an even number of unescaped " quotes following a , comma. This is done by using look-head assertions. The first look-head assertion, (?=, is a pattern that matches zero or more strings that contain two " characters. Then, a negative look-head assertion matches a single, unpaired " quote character remaining at the $ end of the line. It also uses possessive matches in the form of *+ for speed, which prevents the regular expression engine from backtracking excessively. It's certainly not a beginners regular expression.
The following is used as a substitute for a CSV data file in the example below.
This example really highlights the power of regular expressions when it comes to processing text. It takes just 17 lines, which includes comments, to parse a CSV data file of any newline type and create a row by column of NSArray values of the results while correctly handling " quoted values, including escaped \" quotes.
Description | Regex |
---|---|
HTTP | \bhttps?://[a-zA-Z0-9\-.]+(?:(?:/[a-zA-Z0-9\-._?,'+\&%$=~*!():@\\]*)+)? |
HTTP | \b(https?)://([a-zA-Z0-9\-.]+)((?:/[a-zA-Z0-9\-._?,'+\&%$=~*!():@\\]*)+)? |
HTTP | \b(https?)://(?:(\S+?)(?::(\S+?))?@)?([a-zA-Z0-9\-.]+)(?::(\d+))?((?:/[a-zA-Z0-9\-._?,'+\&%$=~*!():@\\]*)+)? |
\b([a-zA-Z0-9%_.+\-]+)@([a-zA-Z0-9.\-]+?\.[a-zA-Z]{2,6})\b | |
Hostname | \b(?:[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}?[a-zA-Z0-9]\.)+[a-zA-Z]{2,6}\b |
IP | \b(?:\d{1,3}\.){3}\d{1,3}\b |
IP with Optional Netmask | \b((?:\d{1,3}\.){3}\d{1,3})(?:/(\d{1,2}))?\b |
IP or Hostname | \b(?:(?:\d{1,3}\.){3}\d{1,3}|(?:[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}?[a-zA-Z0-9]\.)+[a-zA-Z]{2,6})\b |
The following example demonstrates how to match several fields in a URL and create a NSDictionary with the extracted results. Only the capture groups that result in a successful match will create a corresponding key in the dictionary.
This example can form the basis of a function or method that takes a NSString as an argument and returns a NSDictionary as a result, maybe even as a category addition to NSString. The following is the output when the example above is compiled and run:
The following outlines the steps required to use RegexKitLite in your project.
Unfortunately, adding additional dynamic shared libraries that your application links to is not a straightforward process in Xcode, nor is there any recommended standard way. Two options are presented below— the first is the 'easy' way that alters your applications Xcode build settings to pass an additional command line argument directly to the linker. The second option attempts to add the ICU dynamic shared library to the list of resources for your project and configuring your executable to link against the added resource.
The 'easy' way is the recommended way to link against the ICU dynamic shared library.
First, determine the build settings layer of your project that should have altered linking configuration change applied to. The build settings in Xcode are divided in to layers and each layer inherits the build settings from the layer above it. The top, global layer is
, followed by , and finally the most specific layer . If your project is large enough to have multiple targets and executables, you probably have an idea which layer is appropriate. If you are unsure or unfamiliar with the different layers, is recommended.Select the appropriate layer from the
menu. If you are unsure, is recommended.Select Other Linker Flags build setting from the many build settings available and edit it. Add -licucore [dash ell icucore as a single word, without spaces]. If there are already other flags present, it is recommended that you add -licucore to the end of the existing flags.
from the tab near the top of the window. Find theFirst, add the ICU dynamic shared library to your Xcode project. You may choose to add the library to any group in your project, and which groups are created by default is dependent on the template type you chose when you created your project. For a typical Cocoa application project, a good choice is the Frameworks group. To add the ICU dynamic shared library, control/right-click on the Framework group and choose
Next, you will need to choose the ICU dynamic shared library file to add. Exactly which file to choose depends on your project, but a fairly safe choice is to select /Developer/SDKs/MacOSX10.5.sdk/usr/lib/libicucore.dylib. You may have installed your developer tools in a different location than the default /Developer directory, and the Mac OS X SDK version should be the one your project is targeting, typically the latest one available.
Then, in the dialog that follows, make sure that Copy items into… is unselected. Select the targets you will be using RegexKitLite in and then click to add the ICU dynamic shared library to your project.
Once the ICU dynamic shared library is added to your project, you will need to add it to the libraries that your executable is linked with. To do so, expand the Targets group, and then expand the executable targets you will be using RegexKitLite in. You will then need to select the libicucore.dylib file that you added in the previous step and drag it in to the Link Binary With Libraries group for each executable target that you will be using RegexKitLite in. The order of the files within the Link Binary With Libraries group is not important, and for a typical Cocoa application the group will contain the Cocoa.framework file.
Next, add the RegexKitLite source files to your Xcode project. In the Groups & Files outline view on the left, control/right-click on the group that would like to add the files to, then select
Select the RegexKitLite.h and / or RegexKitLite.m file from the file chooser dialog.
The next dialog will present you with several options. If you have not already copied the RegexKitLite files in to your projects directory, you may want to click on the Copy items into… option. Select the targets that you would like add the RegexKitLite functionality to.
Finally, you will need to include the RegexKitLite.h header file. The best way to do this is very dependent on your project. If your project consists of only half a dozen source files, you can add:
manually to each source file that makes uses of RegexKitLites features. If your project has grown beyond this, you've probably already organized a common "master" header to include to capture headers that are required by nearly all source files already.
Using RegexKitLite from the shell is also easy. Again, you need to add the header #import to the appropriate source files. Then, to link to the ICU library, you typically only need to add -licucore, just as you would any other library. Consider the following example:
Compiled and run from the shell:
RegexKitLite is not meant to be a full featured regular expression framework. Because of this, it provides only the basic primitives needed to create additional functionality. It is ideal for developers who:
RegexKitLite consists of only two files, the header file RegexKitLite.h and RegexKitLite.m. The only other requirement is to link with the ICU library that comes with Mac OS X. No new classes are created, all functionality is provided as a category extension to the NSString and NSMutableString classes.
The settings listed below are implemented using the C Preprocessor. Some of the setting are simple boolean enabled or disabled settings, while others specify a value, such as the number of cache slot entries. There are several ways to alter these settings, but if you are not familiar with this style of compile time configuration settings and how to alter them using the C Preprocessor, it is recommended that you use the default values provided.
Setting | Default | Description |
---|---|---|
NS_BLOCK_ASSERTIONS | n/a | RegexKitLite contains a number of extra run-time assertion checks that can be disabled with this flag. The standard NSException.h assertion macros are not used because of the multithreading lock. This flag is typically set for Release style builds where the additional error checking is no longer necessary. |
RKL_CACHE_SIZE | 23 | Controls the number of compiled regular expressions that are cached. This should always be a prime number to maximize the use of the available cache slots. |
RKL_FAST_MUTABLE_CHECK | Disabled | Enables the use of the undocumented, private Core Foundation __CFStringIsMutable() function to determine if the string to be searched is immutable. This can significantly increase the number of matches per second that can be performed on immutable strings since a number of mutation checks can be safely skipped. |
RKL_FIXED_LENGTH | 2048 | Sets the size of the fixed length UTF-16 conversion cache buffer. Strings that need to be converted to UTF-16 that are smaller than this size will use this buffer. Using a single fixed buffer for all small strings means less malloc() overhead, heap fragmentation, and reduces the chances of a memory leak occurring. |
RKL_STACK_LIMIT | 131072 | The maximum amount of stack space that will be used before switching to heap based allocations. This can be useful for multithreading programs where the stack size of secondary threads is much smaller than the main thread. |
RKL_METHOD_PREPEND | None | When set, this preprocessor define causes the RegexKitLite methods defined in RegexKitLite.h to have the value of RKL_METHOD_PREPEND prepended to them. For example, if RKL_METHOD_PREPEND is set to xyz_ (i.e., -Dxyz_), it would cause clearStringCache to become xyz_clearStringCache. |
RKL_REGISTER_FOR_IPHONE_LOWMEM_NOTIFICATIONS | Automatic | This preprocessor define controls whether or not extra code is included that attempts to automatically register with the NSNotificationCenter for the UIApplicationDidReceiveMemoryWarningNotification notification. This feature is automatically enabled if it can be determined at compile time that the iPhone is being targeted. This feature may be explicitly disabled under all circumstances by setting its value to 0. |
Setting RKL_FAST_MUTABLE_CHECK allows RegexKitLite to quickly check if a string to search is immutable or not. Every call to RegexKitLite requires checking a strings hash and length values to guard against a string mutating and using invalid cached data. If the same string is searched repeatedly and it is immutable, these checks aren't necessary since the string can never change while in use. While these checks are fairly quick, it can add approximately 15 to 20 percent of extra overhead, and not performing the checks is always faster.
Since checking a strings mutability requires calling an undocumented, private Core Foundation function, RegexKitLite takes extra precautions and does not use the function directly. Instead, an internal, local stub function is created and called to determine if a string is mutable. The first time this function is called, RegexKitLite uses dlsym() to look up the address of the __CFStringIsMutable() function. If the function is found, RegexKitLite will use it from that point on to determine if a string is immutable. However, if the function is not found, RegexKitLite has no way to determine if a string is mutable or not, so it assumes the worst case that all strings are potentially mutable. This means that the private Core Foundation __CFStringIsMutable() function can go away at any time and RegexKitLite will continue to work, although with slightly less performance.
This feature is disabled by default, but should be fairly safe to enable due to the extra precautions that are taken. If this feature is enabled and the __CFStringIsMutable() function is not found for some reason, RegexKitLite falls back to its default behavior which is the same as if this feature was not enabled.
The RKL_REGISTER_FOR_IPHONE_LOWMEM_NOTIFICATIONS preprocessor define controls whether or not extra code is compiled in that automatically registers for the iPhone UIKit UIApplicationDidReceiveMemoryWarningNotification notification. When enabled, an initialization function tagged with __attribute__((constructor)) is executed by the linker at load time which causes RegexKitLite to check if the low memory notification symbol is available. If the symbol is present then RegexKitLite registers to receive the notification. When the notification is received, RegexKitLite will automatically call clearStringCache to flush the caches and return the memory used to hold any cached data.
This feature is normally automatically enabled if it can be determined at compile time that the iPhone is being targeted. This feature is safe to enable even if the target is Mac OS X for the desktop. It can also be explicitly disabled, even when targeting the iPhone, by setting RKL_REGISTER_FOR_IPHONE_LOWMEM_NOTIFICATIONS to 0.
This documentation is available in the Xcode DocSet format. To add this documentation to Xcode, select
. Then, in the lower left hand corner of the documentation window, there should be a gear icon with a drop down menu indicator which you should select and choose and enter the following URL:feed://regexkit.sourceforge.net/RegexKitLiteDocSets.atom
Once you have added the URL, a new group should appear, inside which will be the RegexKitLite documentation with a Get button. Click on the Get button and follow the prompts. Xcode will ask you to enter an administrators password to install the documentation for the first time, which is explained here.
While RegexKitLite takes steps to ensure that the information it has cached is valid for the strings it searches, there exists the possibility that out of date cached information may be used when searching mutable strings. For each compiled regular expression, RegexKitLite caches the following information about the last NSString that was searched:
An ICU compiled regular expression must be "set" to the text to be searched. Before a compiled regular expression is used, the pointer to the string object to search, its hash, length, and the pointer to the UTF-16 buffer is compared with the values that the compiled regular expression was last "set" to. If any of these values are different, the compiled regular expression is reset and "set" to the new string.
If a NSMutableString is mutated between two uses of the same compiled regular expression and its hash, length, or UTF-16 buffer changes between uses, RegexKitLite will automatically reset the compiled regular expression with the new values of the mutated string. The results returned will correctly reflect the mutations that have taken place between searches.
It is possible that the mutations to a string can go undetected, however. If the mutation keeps the length the same, then the only way a change can be detected is if the strings hash value changes. For most mutations the hash value will change, but it is possible for two different strings to share the same hash. This is known as a hash collision. Should this happen, the results returned by RegexKitLite may not be correct.
Therefore, if you are using RegexKitLite to search NSMutableString objects, and those strings may have mutated in such a way that RegexKitLite is unable to detect that the string has changed, you must manually clear the internal cache to ensure that the results accurately reflect the mutations. You can clear the cache by calling the following class method:
Methods will raise an exception if their arguments are invalid, such as passing NULL for a required parameter. An invalid regular expression or RKLRegexOptions parameter will not raise an exception. Instead, a NSError object with information about the error will be created and returned via the address given with the optional error argument. If information about the problem is not required, error may be NULL. For convenience methods that do not have an error argument, the primary method is invoked with NULL passed as the argument for error.
This method should be used when performing searches on NSMutableString objects and there is the possibility that the string has mutated in between calls to RegexKitLite.
An example of clearing the cache:
Since the capture count of a regular expression does not depend on the string to be searched, this is a NSString class method. For example:
Returns -1 if an error occurs. Otherwise the number of captures in regex is returned, or 0 if regex does not contain any captures.
The optional error parameter, if set and an error occurs, will contain a NSError object that describes the problem. This may be set to NULL if information about any errors is not required.
Since the capture count of a regular expression does not depend on the string to be searched, this is a NSString class method. For example:
Returns -1 if an error occurs. Otherwise the number of captures in regex is returned, or 0 if regex does not contain any captures.
The substrings in the array appear in the order they did in the receiver. For example, this code fragment:
produces an array { @"Norman", @"Stanley", @"Fletcher" }.
If the receiver begins or ends with regex, then the first or last substring is, respectively, empty. For example, the string ", Norman, Stanley, Fletcher" creates an array that has these contents: { @"", @"Norman", @"Stanley", @"Fletcher" }.
If the receiver has no separators that are matched by regex—for example, "Norman"—the array contains the string itself, in this case { @"Norman" }.
If regex contains capture groups—for example, @",(\\s*)"—the array will contain the text matched by each capture group as a separate element appended to the normal result. An additional element will be created for each capture group. If an individual capture group does not match any text the result in the array will be a zero length string—@"". As an example—the regular expression @",(\\s*)" would produce the array { @"Norman", @" ", @"Stanley", @" ", @"Fletcher" }.
The optional error parameter, if set and an error occurs, will contain a NSError object that describes the problem. This may be set to NULL if information about any errors is not required.
A NSRange structure giving the location and length of the first match of regex in the receiver. Returns {NSNotFound, 0} if the receiver is not matched by regex or an error occurs.
A NSRange structure giving the location and length of capture number capture for the first match of regex in the receiver. Returns {NSNotFound, 0} if the receiver is not matched by regex or an error occurs.
A NSRange structure giving the location and length of the first match of regex within range of the receiver. Returns {NSNotFound, 0} if the receiver is not matched by regex within range or an error occurs.
A NSRange structure giving the location and length of capture number capture for the first match of regex within range of the receiver. Returns {NSNotFound, 0} if the receiver is not matched by regex within range or an error occurs.
This method modifies the receivers contents. An exception will be raised if it is sent to an immutable object.
This method modifies the receivers contents. An exception will be raised if it is sent to an immutable object.
This method modifies the receivers contents. An exception will be raised if it is sent to an immutable object.
A NSString containing the substring of the receiver matched by regex. Returns NULL if the receiver is not matched by regex or an error occurs.
A NSString containing the substring of the receiver matched by capture number capture of regex. Returns NULL if the receiver is not matched by regex or an error occurs.
A NSString containing the substring of the receiver matched by regex within range of the receiver. Returns NULL if the receiver is not matched by regex within range or an error occurs.
A NSString containing the substring of the receiver matched by capture number capture of regex within range of the receiver. Returns NULL if the receiver is not matched by regex within range or an error occurs.
A NSString containing the substring of the receiver matched by regex. Returns NULL if the receiver is not matched by regex or an error occurs.
See Regular Expression Options for possible values.
Options for controlling the behavior of a regular expression pattern can be controlled in two ways. When the method supports it, options may specified by combining RKLRegexOptions flags with the C bitwise OR operator. For example:
The other way is to specify the options within the regular expression itself, of which there are two ways. The first specifies the options for everything following it, and the other sets the options on a per capture group basis. Options are either enabled, or following a -, disabled. The syntax for both is nearly identical:
Option | Example | Description |
---|---|---|
(?ixsmw-ixsmw)… | (?i)… | Enables the RKLCaseless option for everything that follows it. Useful at the beginning of a regular expression to set the desired options. |
(?ixsmw-ixsmw:…) | (?iw-m:…) | Enables the RKLCaseless and RKLUnicodeWordBoundaries options and disables RKLMultiline for the capture group enclosed by the parenthesis. |
The following table lists the regular expression pattern option character and its corresponding RKLRegexOptions flag:
Character | Option |
---|---|
i | RKLCaseless |
x | RKLComments |
s | RKLDotAll |
m | RKLMultiline |
w | RKLUnicodeWordBoundaries |
Returns a user info dictionary populated with keys as defined in RegexKitLite NSError and NSException User Info Dictionary Keys.
The RKLICURegexLineErrorKey, RKLICURegexOffsetErrorKey, RKLICURegexPreContextErrorKey, and RKLICURegexPostContextErrorKey error keys may not be present for all errors. For example, errors returned by passing invalid RKLRegexOptions flags will not have the listed keys set.
Initial release.
Changes:
Bug fixes:
Changes:
New features:
Changes:
New NSString Methods:
New NSMutableString Methods:
Bug fixes:
This release contains several large documentation additions and a few bug fixes. No new major functionality was added.
Documentation Changes:
Changes:
Bug fixes:
Fixed a bug in stringByReplacingOccurrencesOfRegex:withString: and replaceOccurrencesOfRegex:withString: where if the receiver was an empty string (i.e., @""), then RegexKitLite would throw an exception because the ICU library reported an error. This turned out to be a bug in the ICU library itself in the uregex_reset() function (and the methods it calls to perform its work). A bug was opened with the ICU project: http://bugs.icu-project.org/trac/ticket/6545. A work-around for the buggy behavior was put in place so that if the ICU library reports a U_INDEX_OUTOFBOUNDS_ERROR error for a string with a length of zero, that error is ignored since it is spurious. Thanks go to Andy Kim for reporting this.
Fixed a bug with NSScannedOption when targeting the iPhone. NSScannedOption is not available on the iPhone, so C Pre-Processor statements were added to ensure that NSScannedOption is not referenced when Objective-C Garbage Collection is not enabled. Thanks go to Shaun Inman for reporting this first.
One noticeable break in style conventions is in line lengths. There was a time when 80 column limits made a lot of sense as it was the lowest common denominator. Today, a modern computer screen can display much more than just 80 columns. Even an iPhone, which has a screen size of 320x480, can display 96 columns by 24 rows of the usual Terminal.app Monaco 10pt font (5x13 pixels) in landscape mode. Because of this, my personal style is not to have an arbitrary limit on line lengths. This allows for much more code to fit on the screen at once, which I've heard referred to as "Man—Machine Interface Bandwidth". While you can always page up and down, the simple movement of your eye is almost always an order of magnitude faster. Paging through code also tends to break your concentration as you briefly try to mentally orientate yourself with the freshly displayed text and where the section of code is that you're looking for.
I try to group a line around relevancy so that based on the start of the line you can quickly determine if the rest of the line is applicable. Clearly the number of spaces used to indent a block plays a similar role, it allows you to quickly visually establish the logical boundaries of what lines of code are applicable. I also try to horizontally align related statements since your eye tends to be extremely sensitive to such visual patterns. For example, in variable declaration and initialization, I try to align the type declaration and the = (equal sign) across multiple lines. This tends to cause the declaration type, the variable name, and the value assigned to visually pop. Without the alignment, you typically have to scan back and forth along a line to separate and find a variable name and its initialization value. Sometimes line breaks and horizontal alignment are done purely on what's subjectively aesthetically pleasing and allows the eye to quickly flow over the code.
The source code of RegexKitLite isn't exactly what you'd call clean, there's more than a few crufty C barnacles in there. There is usually a choice between two polar opposites and in this case it's between elegant, easy to maintain and comprehend code, and speed. If you use regular expressions for very long, you will undoubtedly encounter a situation where you need to scan through tens of megabytes of text and the speed of your regular expression matching loop needs to be faster. A lot faster. RegexKitLite was written to go fast, and the source code style reflects this choice.
A significant amount of time was spent using Shark to optimize the critical sections of RegexKitLite. This included tweaking even the most insignificant details, such as the order of boolean expressions in if() statements to minimize the number of branches that would have to be evaluated to determine if the statement is true or false. Wherever possible, Core Foundation is used directly. This avoids the overhead of an Objective-C message dispatch, which would invariably end up calling the exact same Core Foundation function anyways.
Even the cache has what is essentially a cache— the last regular expression used. Each time a regular expression is used, the compiled ICU regular expression must be retrieved from the cache, or if it does not exist in the cache, instantiated. Checking the cache involves calculating the remainder of the regular expression strings hash modulo the cache size prime, which is a moderately expensive division and multiplication operation. However, checking if the regular expression being retrieved this time is exactly the same as the last regular expression retrieved is just a fast and simple comparison check. As it turns out, this is very often the case. Even the functions are arranged in such a way that the compiler will often inline everything in to one large aggregate function, eliminating the overhead of a function call in many places.
Normally, this kind of micro-optimization is completely unjustified. A rough rule of thumb is that 80% of your programs execution time is spent in 20% of your programs code. The only code worth optimizing is the 20% that is heavily executed. If your program makes heavy use of regular expressions, such as a loop that scans megabytes of text using regular expressions, RegexKitLite is almost guaranteed to be a part of the 20% of code where most of the execution time is spent. The release notes for CotEditor would seem to indicate that all this effort has paid off for at least one user:
More than 10.4 when running on the color process to review the details of the definition of a regular expression search RegexKitLite adopted, 0.9.3, as compared to the speed of color from 1.5 to 4 times as much improved.
Clearly documentation has been a high priority for this project. Documentation is always hard to write and good documentation is exponentially harder still. The vast majority of 'development effort and time' is spent on the documentation. It's always hard to judge the quality and effectiveness of something you wrote, so hopefully the extra effort is worth it and appreciated.
As an aside and a small rant, I have no idea how anyone manages to build so-called 'web applications'. I waste a truly unbelievable amount of time trying to accomplish the simplest of things in HTML, which is then multiplied when I check for 'compatibility' with different browsers and their various popular versions. What a joke, and that's just for this 'simple' documentation. Though I do have to give kudos to the Safari / WebKit guys, it always seems to be a lot easier to get the result you're looking for with the WebKit engine. The little things add so much: shadows, round rects, gradients, CSS animations, the canvas element, etc.
It's a well known license. If you are part of a Large Corporate Organization, chances are the Corporate Lawyers have Decided whether or not the use of source code licensed under the BSD License is Acceptable or not. This can be a godsend for anyone who has to deal with such situations.
It also expresses a few things I think are perfectly reasonable:
The first point is prescribed by most professional ethics already. Plus, it's always nice to see where your stuff ends up and how it's being used. The other points make explicit what should already be obvious. After all, you get what you pay for.
RegexKitLite is distributed under the terms of the BSD License, as specified below.
Copyright © 2008, John Engelhart
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.