REGEXP(3x,L) AIX Technical Reference REGEXP(3x,L) ------------------------------------------------------------------------------- regexp: compile, step, advance PURPOSE Compiles and matches regular-expression patterns. LIBRARY None SYNTAX #define INIT declarations #define GETC( ) getc_code #define PEEKC( ) peekc_code #define UNGETC(c) ungetc_code #define RETURN(pointer) return_code #define ERROR(val) error_code #include char *compile (instring, ep, endbuf, seof) int step (p1, p2) char *instring, *ep, *endbuf char *string, *expbuf; int seof; int advance (lp, ep) char *string, *expbuf; DESCRIPTION The regexp.h header file defines several general purpose subroutines that perform regular-expression pattern matching. Programs that perform regular-expression pattern matching such as ed, sed, grep, bs, and expr use this source file. In this way, only this file needs to be changed in order to maintain regular expression compatibility between programs. The NLregexp.h functions compile, step and advance operate on file code strings. The following macros must be defined by the programmer prior to including NLregexp.h. INIT This macro is used for dependent declarations and initializations. It is placed right after the declaration and opening "{" (left brace) of the compile subroutine. The definition of INIT must end with a ";" (semicolon). INIT is frequently used to set a register variable to point the beginning of Processed July 12, 1991 REGEXP(3x,L) 1 REGEXP(3x,L) AIX Technical Reference REGEXP(3x,L) the regular expression so that this register variable can be used in the declarations for GETC, PEEKC, and UNGETC. Otherwise, you can use INIT to declare external variables that GETC, PEEKC, and UNGETC need. #define INIT register char *sp = instring; \ int sp_len; \ mbchar_t sp_peekc; GETC( ) This macro returns the value of the next character (as an mbchar_t) in the regular expression pattern. Successive calls to the GETC macro should return successive characters of the pattern. # define GETC() (PEEK(),sp+=sp_len,sp_peekc) PEEKC( ) This macro returns the next character (as an mbchar_t) in the regular expression. Successive calls to the PEEKC macro should return the same character, which should also be the next character returned by the GETC macro. The special value ERR should be returned if there is an error in the character. #define PEEKC() ( (-1==(sp_len=mbstomb (&sp_peekc,sp,MB_LEN_MAX) ) ) \ ? sp_peekc=ERR\ : sp_peekc) UNGETC(c) This macro causes the parameter c to be returned by the next call to the GETC and PEEKC macros. No more than one character of pushback is ever needed and this character is guaranteed to be that last character read by the GETC macro. The return value of the UNGETC macro is always ignored. #define UNGETC (c) (sp-=sp_len) RETURN(pointer) This macro is used on normal exit of the compile subroutine. The pointer parameter points to the first character immediately following the compiled regular expression. This is useful to programs that have memory allocation to manage. #define RETURN(p) return ERROR(val) This macro is used on abnormal exit from the compile subroutine. It should never contain a return statement. The val parameter is an error number. The error values and their meanings are: #define ERROR(c) regerr (c) Processed July 12, 1991 REGEXP(3x,L) 2 REGEXP(3x,L) AIX Technical Reference REGEXP(3x,L) Error Name Value Meaning BIG_RANGE 11 Range endpoint too large. BAD_NUM 16 Bad number. BAD_BACK 25 "\" digit out of range. BAD_DELIM 36 Illegal or missing delimiter. NO_SAVED 41 No remembered search string. BAD_LEFTP 42 "\(\)" imbalance. BAD_RIGHTP 43 Too many "\(". EX_COMMA 44 More than two numbers given in \{ \}. NO_CLOSE 45 "}" expected after "\". MAX_MIN 46 First number exceeds second in \{ \}. BAD_BRAK 49 "[ ]" imbalance. TOO_BIG 50 Regular expression overflow. STACK_EMPTY 51 Backtrack stack empty. STACK_FULL 52 Backtrack stack full. BAD_CHAR 60 Strange multibyte character. The compile subroutine compiles the regular expression for later use. The instring parameter is never used explicitly by the compile subroutine, but you can use it in your macros. For instance, you may want to pass the string containing the pattern as the instring parameter to compile and use the INIT macro to set a pointer to the beginning of this string. (The following example uses this technique.) If your macros do not use instring, then call compile with a value of ((char *) 0) for this parameter. The expbuf parameter points to a character array where the compiled regular expression is to be placed. The endbuf parameter points to the location that immediately follows the character array where the compiled regular expression is to be placed. If the compiled expression cannot fit in (endbuf-expbuf) bytes, the call ERROR(50) is made. The eof parameter is the character that marks the end of the regular expression. For example, in ed this character is usually "'/'" (slash). Processed July 12, 1991 REGEXP(3x,L) 3 REGEXP(3x,L) AIX Technical Reference REGEXP(3x,L) The regexp.h header file defines other subroutines that perform actual regular-expression pattern matching. One of these is the step subroutine. The string parameter of step is a pointer to a null-terminated string of characters to be checked for a match. The expbuf parameter points to the compiled regular expression, which was obtained by a call to the compile subroutine. The step subroutine returns the value 1 if the given string matches the pattern, and 0 if it does not match. If it matches, then step also sets two global character pointers: loc1, which points to the first character that matches the pattern, and loc2, which points to the character immediately following the last character that matches the pattern. Thus, if the regular expression matches the entire string, then loc1 points to the first character of string and loc2 points to the null character at the end of string. The step subroutine uses the global variable circf, which is set by compile if the regular expression begins with a "^" (circumflex). If this variable is set, then step only tries to match the regular expression to the beginning of the string. If you compile more than one regular expression before executing the first one, then save the value of circf for each compiled expression and set circf to that saved value before each call to step. The step subroutine calls a subroutine named advance with the same parameters that it was passed. The step function increments through the string parameter and calls advance until advance returns a 1, indicating a match, or until the end of string is reached. To constrain string to the beginning of the string in all cases, call the advance subroutine directly instead of calling step. When advance encounters an "*" (asterisk) or a "\{ \}" sequence in the regular expression, it advances its pointer to the string to be matched as far as possible and recursively calls itself trying to match the rest of the string to the rest of the regular expression. As long as there is no match, advance backs up along the string until it finds a match or reaches the point in the string that initially matched the "*" or "\{ \}". It is sometimes desirable to stop this backing-up before the initial point in the string is reached. If the global character pointer locs is equal to the point in the string sometime during the backing up process, advance breaks out of the loop that backs up and returns 0. This is used by ed and sed for global substitutions on the whole line so that expressions like "s/y*//g" do not loop forever. EXAMPLE The following is an example of the regular expression macros and calls from the grep command. Processed July 12, 1991 REGEXP(3x,L) 4 REGEXP(3x,L) AIX Technical Reference REGEXP(3x,L) #define INIT register char *sp=instring; #define GETC() (*sp++) #define PEEKC() (*sp) #define UNGETC(c) (--sp) #define RETURN(c) return; #define ERROR(c) regerr() #include ... compile (patstr, expbuf, &expbuf[ESIZE], '\0'); ... if (step (linebuf, expbuf)) succeed ( ); ... RELATED INFORMATION In this book: "NCcollate, NCcoluniq, NCeqvmap, _NCxcol, _NLxcol" and "regcmp, regex." The ed, grep, and sed commands in AIX Operating System Commands Reference. "Introduction to International Character Support" in Managing the AIX Operating System. AIX Guide to Multibyte Character Set (MBCS) Support. Processed July 12, 1991 REGEXP(3x,L) 5