SAS® and R

Best of Both Worlds

More on perl regular expression

leave a comment »

Beginner of perl regular expression (PRX), confused by the syntax, might not appreciate the simple, compact solution PRX functions provide, and would prefer traditional string functions, which at the outset seems to suffice the needs.

Here is an example that shows some basic tools in perl regular expressions. Click “show source” to see the code.

/* Example from R Cody book */
data cat;
  input original $30.;
datalines;
there is a cat in this line.
does not match dog
cat in the beginning
at the end, a cat
Cat
;
run;

/* Substrings */
data prxword;
  input num word $ 3-15 explain $ 17-70;
datalines;
1 / cat /       has a blank space in front and after 'cat'
2 /cat/         looking for appearance of 'cat'
3 /cat/i        looking for appearance of 'cat', regardless of case
4 /^cat/        beginning of the string, case sensitive
5 /^cat/i       beginning of the string, case insensitive
6 /^cat|cat$/   '$' means end of the string, '|' means 'or'
;
run;

/* Cartesian join */
proc sql;
  create table prx as
  select * from cat, prxword;
quit;

proc sort data=prx; by num; run;

/* Output */
data _null_;
  set prx;
  by  num;
  if  first.num then pattern=prxparse(word);
  retain pattern;

  pos = prxmatch(pattern, strip(original));

  if  first.num then put "Example " num+(-1) ': ' word +(-1) ' -- ' explain;
  put original @45 pos;
  if  last.num then put//;
run;

He is the log. The numeric value represents the position of the first occurrence of the string (encased by / /) you are looking for.

Example 1: / cat / -- has a blank space in front and after 'cat'
there is a cat in this line.                11
does not match dog                          0
cat in the beginning                        0
at the end, a cat                           0
Cat                                         0

Example 2: /cat/ -- looking for appearance of 'cat'
there is a cat in this line.                12
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           15
Cat                                         0

Example 3: /cat/i -- looking for appearance of 'cat', regardless of case
there is a cat in this line.                12
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           15
Cat                                         1

Example 4: /^cat/ -- beginning of the string, case sensitive
there is a cat in this line.                0
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           0
Cat                                         0

Example 5: /^cat/i -- beginning of the string, case insensitive
there is a cat in this line.                0
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           0
Cat                                         1

Example 6: /^cat|cat$/ -- '$' means end of the string, '|' means 'or'
there is a cat in this line.                0
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           15
Cat                                         0

So you might shrug at the result – what’s so good about regular expression when I can pretty much achieve the same result by using find() without the hassle. As shown in the below example, find() function would perform just as well for all the tasks, saved for one.

1    data _null_;
2      set cat;
3      pos1 = find(strip(original), ' cat ');
4      pos2 = find(strip(original), 'cat');
5      pos3 = find(strip(original), 'cat', 'i');
6      pos4 = (find(strip(original), 'cat')=1);
7      pos5 = (find(strip(original), 'cat', 'i')=1);
8
9      if _n_=1 then put 'STRING' @35 'Ex 1' +6 'Ex 2' +6 'Ex 3' +6 'Ex 4' +6 'Ex 5';
10     put original @35 pos1 @45 pos2 @55 pos3 @65 pos4 @75 pos5;
11   run;
STRING                            Ex 1      Ex 2      Ex 3      Ex 4      Ex 5
there is a cat in this line.      11        12        12        0         0
does not match dog                0         0         0         0         0
cat in the beginning              0         1         1         1         1
at the end, a cat                 0         15        15        0         0
Cat                               0         0         1         0         1

But regular expression is much more flexible when you need to tame your character strings because it can use wildcards and metacharacters. Ron Cody’s SAS Functions by Example book has a table that documents the most frequently used metacharacters. You can check it out yourself.

Below is another example I took from one of the SUGI papers, which I think illustrates the power of regular expression very well. In this case, the “P.O. BOX” was not entered uniformly. We need to select all the lines that have ‘PO Box’ , ‘P.O.Box’ , ‘Box’ , ‘P O Box’,etc, and replace each one of them with ‘P. O. BOX’ in the strings. In the regular expression below, ‘ *’ matches the preceding sub expression zero or more times, and ‘?’ matches the previous subexpression zero or one time. ‘\s’ matches a white space character, including a space or a tab. I think this example shows essentially what these regular expression functions do – they make irregular expression regular.

1    data _null_;
2    retain pattern;
3    if _n_ =1 then pattern = prxparse ("s/P?\s*\.*\s*O?\s*\.*\s*BOX\s*\.*\s*/P.O. BOX /i");
4    input before $40. ;
5    length after $40.;
6    after=PRXCHANGE(PATTERN, 5, before);
7    if _n_ = 1 then put 'BEFORE' @38 'AFTER'/
8                        33*'-'   @38 33*'-';
9    put before @38 after;
10   datalines;
BEFORE                               AFTER
---------------------------------    ---------------------------------
1250 Health Plaza, P.O.BOX 495       1250 Health Plaza, P.O. BOX 495
P O BOX 2235, 35 Gene Pl             P.O. BOX 2235, 35 Gene Pl
PO BOX 56, 1st DNA Avenue            P.O. BOX 56, 1st DNA Avenue
123 Mitochondria Blvd, P BOX 223     123 Mitochondria Blvd, P.O. BOX 223
11 Wellness Ave, p o box             11 Wellness Ave, P.O. BOX
1600 Pennsylvania Ave, pBOX 2228     1600 Pennsylvania Ave, P.O. BOX 2228
P Box 121                            P.O. BOX 121
p box144                             P.O. BOX 144
pobox 169                            P.O. BOX 169
Pbox 225                             P.O. BOX 225
P. O. box. 1000-1111                 P.O. BOX 1000-1111
22   ;
23   run;

/***********************/
/* End of Illustration */
/***********************/
Advertisements

Written by sasandr

July 26, 2012 at 3:36 pm

Posted in SAS

Tagged with ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: