REGCMP(3x,L) AIX Technical Reference REGCMP(3x,L) ------------------------------------------------------------------------------- regcmp, regex PURPOSE Compiles and matches regular-expression patterns. LIBRARY Programmers Workbench Library (libPW.a) SYNTAX char *regcmp (str [, str,...], (char *) 0) char *str, *str,...; char *regex (pat, subject [, ret,...]) char *pat, *subject, *ret,...; extern char *__loc1; DESCRIPTION The regcmp subroutine compiles a regular expression (or pattern) and returns a pointer to the compiled form. The str parameters specify the pattern to be compiled. If more than one str parameter is given, then regcmp treats them as if they were concatenated together. It returns a NULL pointer if it encounters an incorrect parameter. You can use the regcmp command to compile regular expressions into your C program, frequently eliminating the need to call the regcmp subroutine at run time. The regex subroutine compares a compiled pattern to the subject string. Additional parameters are used to receive values. Upon successful completion, the regex subroutine returns a pointer to the next unmatched character. If the regex subroutine fails, a NULL pointer is returned. A global character pointer, __loc1, points to where the match began. The regcmp and regex subroutines are borrowed from the ed command; however, the syntax and semantics have been changed slightly. You can use the following symbols with the regcmp and regex subroutines: "[ ] * . ^" These symbols have the same meaning as they do in the ed command. "-" For regex, the minus within brackets means "through" according to the Processed July 12, 1991 REGCMP(3x,L) 1 REGCMP(3x,L) AIX Technical Reference REGCMP(3x,L) current collating sequence. For example, depending on the default collating sequence, "[a-z]" can be equivalent to "[abcd"..."xyz]" or "[aBbCc"..."xYyZz]". You can use the "-" by itself if the "-" is the last or first character. For example, the character class expression "[]-]" matches the "]" (right bracket) and "-" (minus) characters. "$" Matches the end of the string. Use "\n" to match a new-line character. "+" A regular expression followed by "+" means one or more times. For example, "[0-9]+" is equivalent to "[0-9][0-9]*". "{"m"}" "{"m,"}" "{"m,u"}" Integer values enclosed in "{" "}" indicate the number of times to apply the preceding regular expression. m is the minimum number and u is the maximum number. u must be less than 256. If you specify only m, it indicates the exact number of times to apply the regular expression. "{"m,"}" is equivalent to "{"m,infinity"}" and matches m or more occurrences of the expression. The plus "+" (plus) and "*" (asterisk) operations are equivalent to "{1,}" and "{0,}", respectively. "("...")$"n This stores the value matched by the enclosed regular expression in the (n+1) (th) ret parameter. Ten enclosed regular expressions are allowed. regex makes the assignments unconditionally. "("...")" Parentheses group subexpressions. An operator, such as "*", "+", or "{" "}" works on a single character or on a regular expression enclosed in parenthesis. For example, "(a*(cb+)*)$0". All of the above defined symbols are special. You must precede them with a "\" (backslash) if you want to match the special symbol itself. For example, "\$" matches a dollar sign. The following special symbols are defined for internationalized regular expressions. Each is valid only within a range expression, (that is, between brackets). "[:alnum:]" Matches any alphanumeric, as defined by the NLctype.h macro "iswalnum". "[:alpha:]" Matches any alpha, like "iswalpha". "[:digit:]" Matches any digit, like "iswdigit". "[:lower:]" Matches any lower, like "iswlower". Processed July 12, 1991 REGCMP(3x,L) 2 REGCMP(3x,L) AIX Technical Reference REGCMP(3x,L) "[:print:]" Matches any printable, like "iswprint". "[:punct:]" Matches any punctuation, like "iswpunct". "[:space:]" Matches any white space, like "iswspace". "[:upper:]" Matches any upper case letter, like "iswupper". "[:xdigit:]" Matches any hex digit, like "iswxdigit". "[=X=]" matches any character in the same equivalence class as "X", as defined by "wceqvmap". "[.XY.]" Matches the multiple character collating sequence XY as a single character (as defined by "_wcxcol". For example, some Latin languages collate the sequence "ch" as a single character which falls between the letters c and d. The regular expression "[c[.ch.]d]amp" would match the words "camp", "champ", and "damp". The ctype sequences, such as "[:alpha:]", cannot be used as end points of a range. Note: regcmp uses the malloc subroutine to make the space for the vector. Always free the vectors that are not required. If you do not free the unrequired vectors, you may run out of memory if regcmp is called repeatedly. Use the following as a replacement for malloc to reuse the same vector, thus saving time and space: /* ...Your Program... */ malloc(n) int n; { static int rebuf[256]; return ((n <= sizeof(rebuf)) ? rebuf : NULL); } EXAMPLES 1. To perform a simple match: Processed July 12, 1991 REGCMP(3x,L) 3 REGCMP(3x,L) AIX Technical Reference REGCMP(3x,L) char *cursor, *newcursor, *ptr; ... newcursor = regex((ptr = regcmp("^\n", 0)), cursor); free(ptr); This matches a leading new-line character in the subject string pointed to by "cursor". 2. To extract a substring that matches a pattern: char ret0[9]; char *newcursor, *name; ... name = regcmp("([A-Za-z][A-Za-z0-9]{0,7})$0", 0); newcursor = regex(name, "123Testing321", ret0); This matches the eight-character identifier "Testing3" and returns the address of the character after the last matched character (which is stored in "newcursor"). The string "Testing3" is copied into the character array "ret0". RELATED INFORMATION In this book: "malloc, free, realloc, calloc, valloc, alloca, mallopt, mallinfo," "NCcollate, NCcoluniq, NCeqvmap, _NCxcol, _NLxcol," "wc_collate, wc_coluniq, wc_eqvmap, _wcxcol, _mbxcol, _wcxcolu, _mbxcolu," "setlocale," and "regexp: compile, step, advance." The ed and regcmp commands in AIX Operating System Commands Reference. "Introduction to International Character Support" in Managing the AIX Operating System. AIX Guide to Multibyte Character Set (MBCS) Support. Processed July 12, 1991 REGCMP(3x,L) 4