SAS® and R

Best of Both Worlds

Posts Tagged ‘SAS function

More on perl regular expression

leave a comment »

Beginner of perl regular expression (PRX), confused by the syntax, might not appreciate the simple, compact solution PRX functions provide, and would prefer traditional string functions, which at the outset seems to suffice the needs.

Here is an example that shows some basic tools in perl regular expressions. Click “show source” to see the code.

/* Example from R Cody book */
data cat;
  input original $30.;
there is a cat in this line.
does not match dog
cat in the beginning
at the end, a cat

/* Substrings */
data prxword;
  input num word $ 3-15 explain $ 17-70;
1 / cat /       has a blank space in front and after 'cat'
2 /cat/         looking for appearance of 'cat'
3 /cat/i        looking for appearance of 'cat', regardless of case
4 /^cat/        beginning of the string, case sensitive
5 /^cat/i       beginning of the string, case insensitive
6 /^cat|cat$/   '$' means end of the string, '|' means 'or'

/* Cartesian join */
proc sql;
  create table prx as
  select * from cat, prxword;

proc sort data=prx; by num; run;

/* Output */
data _null_;
  set prx;
  by  num;
  if  first.num then pattern=prxparse(word);
  retain pattern;

  pos = prxmatch(pattern, strip(original));

  if  first.num then put "Example " num+(-1) ': ' word +(-1) ' -- ' explain;
  put original @45 pos;
  if  last.num then put//;

He is the log. The numeric value represents the position of the first occurrence of the string (encased by / /) you are looking for.

Example 1: / cat / -- has a blank space in front and after 'cat'
there is a cat in this line.                11
does not match dog                          0
cat in the beginning                        0
at the end, a cat                           0
Cat                                         0

Example 2: /cat/ -- looking for appearance of 'cat'
there is a cat in this line.                12
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           15
Cat                                         0

Example 3: /cat/i -- looking for appearance of 'cat', regardless of case
there is a cat in this line.                12
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           15
Cat                                         1

Example 4: /^cat/ -- beginning of the string, case sensitive
there is a cat in this line.                0
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           0
Cat                                         0

Example 5: /^cat/i -- beginning of the string, case insensitive
there is a cat in this line.                0
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           0
Cat                                         1

Example 6: /^cat|cat$/ -- '$' means end of the string, '|' means 'or'
there is a cat in this line.                0
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           15
Cat                                         0

So you might shrug at the result – what’s so good about regular expression when I can pretty much achieve the same result by using find() without the hassle. As shown in the below example, find() function would perform just as well for all the tasks, saved for one.

1    data _null_;
2      set cat;
3      pos1 = find(strip(original), ' cat ');
4      pos2 = find(strip(original), 'cat');
5      pos3 = find(strip(original), 'cat', 'i');
6      pos4 = (find(strip(original), 'cat')=1);
7      pos5 = (find(strip(original), 'cat', 'i')=1);
9      if _n_=1 then put 'STRING' @35 'Ex 1' +6 'Ex 2' +6 'Ex 3' +6 'Ex 4' +6 'Ex 5';
10     put original @35 pos1 @45 pos2 @55 pos3 @65 pos4 @75 pos5;
11   run;
STRING                            Ex 1      Ex 2      Ex 3      Ex 4      Ex 5
there is a cat in this line.      11        12        12        0         0
does not match dog                0         0         0         0         0
cat in the beginning              0         1         1         1         1
at the end, a cat                 0         15        15        0         0
Cat                               0         0         1         0         1

But regular expression is much more flexible when you need to tame your character strings because it can use wildcards and metacharacters. Ron Cody’s SAS Functions by Example book has a table that documents the most frequently used metacharacters. You can check it out yourself.

Below is another example I took from one of the SUGI papers, which I think illustrates the power of regular expression very well. In this case, the “P.O. BOX” was not entered uniformly. We need to select all the lines that have ‘PO Box’ , ‘P.O.Box’ , ‘Box’ , ‘P O Box’,etc, and replace each one of them with ‘P. O. BOX’ in the strings. In the regular expression below, ‘ *’ matches the preceding sub expression zero or more times, and ‘?’ matches the previous subexpression zero or one time. ‘\s’ matches a white space character, including a space or a tab. I think this example shows essentially what these regular expression functions do – they make irregular expression regular.

1    data _null_;
2    retain pattern;
3    if _n_ =1 then pattern = prxparse ("s/P?\s*\.*\s*O?\s*\.*\s*BOX\s*\.*\s*/P.O. BOX /i");
4    input before $40. ;
5    length after $40.;
6    after=PRXCHANGE(PATTERN, 5, before);
7    if _n_ = 1 then put 'BEFORE' @38 'AFTER'/
8                        33*'-'   @38 33*'-';
9    put before @38 after;
10   datalines;
BEFORE                               AFTER
---------------------------------    ---------------------------------
1250 Health Plaza, P.O.BOX 495       1250 Health Plaza, P.O. BOX 495
P O BOX 2235, 35 Gene Pl             P.O. BOX 2235, 35 Gene Pl
PO BOX 56, 1st DNA Avenue            P.O. BOX 56, 1st DNA Avenue
123 Mitochondria Blvd, P BOX 223     123 Mitochondria Blvd, P.O. BOX 223
11 Wellness Ave, p o box             11 Wellness Ave, P.O. BOX
1600 Pennsylvania Ave, pBOX 2228     1600 Pennsylvania Ave, P.O. BOX 2228
P Box 121                            P.O. BOX 121
p box144                             P.O. BOX 144
pobox 169                            P.O. BOX 169
Pbox 225                             P.O. BOX 225
P. O. box. 1000-1111                 P.O. BOX 1000-1111
22   ;
23   run;

/* End of Illustration */

Written by sasandr

July 26, 2012 at 3:36 pm

Posted in SAS

Tagged with ,

All those CATs

with one comment

The title is a bit misleading since we are not talking about the furry animals here. This post is about the conCATenation functions in SAS, which are useful, but CAN get hairy at times.

The word concatenate comes directly from Latin concatenare, which in turn is formed from “con-,” meaning “with” or “together,” and “catena,” meaning “chain.” And the CATx functions, in short, “chain” words or numbers together to make a new string. The different CATx functions dictate how you want to chain them together.

There are five concatenate functions in SAS (CAT, CATS, CATT, CATX, CATQ).  The main difference among these functions involves the handling of leading/trailing blanks as well as separator characters between the concatenated items. People from SAS old school might still prefer traditional concatenation operator (||), but the new functions just take far less work to accomplish the same task. The following output gives an overview of the difference between the CATx functions. (click the graph to enlarge)

As shown below, we have four strings (brackets on both sides to show the leading/trailing blanks), and you can see the resulting strings using the different CATx functions (period at the end to show where the string ends). One advantage of using the CATx functions is the items to be concatenated may be character or numeric. And, if you include numeric values, they are treated as if they were actually character values and no numeric-to-character conversion messages are printed to the SAS log.

  All Those CATx, MEOW!
String 1 has no leading/trailing blank            [ALL]
String 2 has three trailing blanks                [THOSE   ]
String 3 has two leading blanks                   [  CATS]
String 4 has three leading and two trailing blanks[   MEOW  ]
- CAT acts like '||' with minor difference.¹
cat(String1,String2,String3,String4) shows:       ALLTHOSE     CATS   MEOW  .
- CATS Removes trailing and leading blanks.
cats(String1,String2,String3,String4):            ALLTHOSECATSMEOW.
cats(12,34,56,78)                                 12345678.
- CATT only trims trailing blanks.
catt(String1,String2,String3,String4)             ALLTHOSE  CATS   MEOW.
- CATX trims both leading and trailing blanks, and inserts separator character.
catx(' ',String1,String2,String3,String4)         ALL THOSE CATS MEOW.
catx(',',String1,String2,String3,String4)         ALL,THOSE,CATS,MEOW.
catx('-',908,782,6562)                            908-782-6562.
- CATQ joins strings together as defined by the modifier.²
catq(' ',String1,String2,String3,String4)         ALL "THOSE   " "  CATS" "   MEOW  ".
catq('a',String1,String2,String3,String4)         "ALL" "THOSE   " "  CATS" "   MEOW  ".
catq('s',String1,String2,String3,String4)         ALL THOSE CATS MEOW.
catq('as',String1,String2,String3,String4)        "ALL" "THOSE" "CATS" "MEOW".
catq('asd','~~', String1,String2,String3,String4) "ALL"~~"THOSE"~~"CATS"~~"MEOW".
¹ The default length of the result when you use the || operator is the sum of
  the lengths of the strings being concatenated, the default length of the
  result when you use the CAT function is 200.
² Check SAS Language Reference for details on all the available modifiers.
Remember to define string length before calling CATx,
otherwise the default length of the resulting string is 200.

For a complete list of CATQ modifiers, go to CATQ function language reference.

Please note that it is always a good practice to initialize your character variable by specifying the length of the resulting variable directly under the data statement, and making sure to set the length of created variables long enough to accommodate the longest string created by concatenation. If not, the resulting string will be truncated, and you will see an error message in the log.

Written by sasandr

May 22, 2012 at 10:55 am

Posted in SAS

Tagged with

More about PRXCHANGE function

with one comment

In a way, CALL PRXCHANGE is very similar to TRANWRD, it searches for regular expression that matches and replaces them with another character string.  However, it has the advantage of performing searches with wildcards, thus giving it more flexibility.

Let’s take a look at the example from post The flexibility of PRX functions in SAS:

prxchange('s/(observation|treatment)//io', -1, visit)

Starting from the first parameter, “s/” in the beginning tells us we are doing substitution.  Inside the parentheses are the words were are looking to replace.  So in this case, we are looking for either “observation” or “treatment”.  What PRX function does here is it searches for these words, and put them in a buffer, and replace them with the words specified after the ‘/‘ sign.  In our case, we are trying to remove these words, so we leave substitution as empty.

“/i” is useful because it specifies that the search is case insensitive.  It’s basically an “ignore case option”.  The regular expression is recompiled on every loop of the data step.  In our situation, we do not need that so we can add the “o” option to the end of the regular expression to tell it to just compile it once.

The second parameter to the prxchange() function is -1 and just tells the function to keep searching the source, finding and replacing every occurrence till you get to the end of source.

The third parameter “visit” just tells the function which source variable to search.

So here you have it, a simple solution once you understand what’s behind that “expression.”

Written by sasandr

May 7, 2012 at 9:34 pm

Posted in SAS

Tagged with ,

The flexibility of PRX functions in SAS

with 3 comments

PRX here stands for Perl Regular Expression.  SAS regular expressions (the RX functions) and Perl Regular Expressions (the PRX functions) are a set of functions added to SAS since version 9.  However, many SAS users are unfamiliar with SAS/Perl Regular Expressions.  This tool is very useful and flexible because it enables you to locate patterns in text strings, and it sometimes provide a much more compact solution to a complicated string manipulation task.

Here is one example.  Function PRXCHANGE substitutes one string for another.  Say in this dataset, you want to compress the long form of VISIT variable to a short form, ridding the words ‘Observation’ and ‘Treatment’.

If you use PRXCHANGE function, there is only one line of code.  And you will get the desired result.

/* PRXCHANGE demo */

prxchange function

I will introduce more examples in my next blog post.

Written by sasandr

April 25, 2012 at 1:32 pm

Posted in SAS

Tagged with ,