SAS® and R

Best of Both Worlds

Archive for the ‘SAS’ Category

proc sort nodup

leave a comment »

proc sort noduprecs
by id
rid the pesky duplicates
Advertisements

Written by sasandr

April 1, 2014 at 9:46 pm

Posted in SAS

Tagged with ,

Check consistency in VISITNUM/VISIT in your SDTM domains

leave a comment »

One of the requirements during CDISC SDTM data validation is VISITNUM/VISIT pair in findings domains doesn’t match TV domain data. Here is an easy way to find out how visit is defined in all findings domain.

/* Select a list of SDTM domains that use VISITNUM/VISIT */
proc sql;
 select distinct(memname) into :gdmlist
 separated by '*'
 from   sashelp.vcolumn
 where  upcase(libname)="GDMDATA" and upcase(name)="VISIT";
quit;
 
%macro check;
  %let i = 1 ;
 
  %do %while(%scan(&gdmlist,&i,*) ^= %str( )) ;
    %let indsn = %scan(&gdmlist,&i,*) ;
    title "gdmdata.&indsn";
    proc freq data=gdmdata.&indsn noprint;
      tables visitnum*visit/list missing out=test(where=(visitnum>.z and ^missing(visit)));
    run;
   
    proc transpose data=test out=vtest prefix=v;
      var visit;
      id  visitnum;
    run;
    
    data out;
      set vtest;
      length source $ 2;
      
      source=%upcase("&indsn");
    run;
    
    proc append base=final data=out force;
    run;
    
    %let i = %eval(&i + 1 ) ;
  %end ;
%mend check;
 
%check;

And the print out of the data shows how VISITNUM/VISIT are defined, and it is easy to identify problems. Actually TV domain is inconsistent with all the rest of the findings domains.

Obs    SOURCE      V4         V5         V6         V7         V8         V9         V10          V11          V888

  1      DA      Visit 4    Visit 5    Visit 6    Visit 7    Visit 8    Visit 9    Visit 10    Visit 11a    Unscheduled
  2      EG      Visit 4    Visit 5    Visit 6               Visit 8                           Visit 11a    Unscheduled
  3      EX      Visit 4    Visit 5    Visit 6    Visit 7    Visit 8    Visit 9    Visit 10    Visit 11a    Unschedule 
  4      HO                 Visit 5    Visit 6    Visit 7    Visit 8               Visit 10    Visit 11a    Unscheduled
  5      IE                                                                                                            
  6      LB      Visit 4               Visit 6               Visit 8                           Visit 11a    Unscheduled
  7      QS      Visit 4               Visit 6    Visit 7    Visit 8    Visit 9    Visit 10    Visit 11a    Unscheduled
  8      TV      Visit 4    Visit 5    Visit 6    Visit 7    Visit 8    Visit 9    Visit 10    Visit 11                
  9      VS      Visit 4    Visit 5    Visit 6    Visit 7    Visit 8    Visit 9    Visit 10    Visit 11a    Unscheduled
 10      ZU                 Visit 5    Visit 6    Visit 7    Visit 8    Visit 9    Visit 10    Visit 11a    Unscheduled

Written by sasandr

March 28, 2014 at 4:06 pm

Posted in Data, SAS

Tagged with , , ,

More on perl regular expression

leave a comment »

Beginner of perl regular expression (PRX), confused by the syntax, might not appreciate the simple, compact solution PRX functions provide, and would prefer traditional string functions, which at the outset seems to suffice the needs.

Here is an example that shows some basic tools in perl regular expressions. Click “show source” to see the code.

/* Example from R Cody book */
data cat;
  input original $30.;
datalines;
there is a cat in this line.
does not match dog
cat in the beginning
at the end, a cat
Cat
;
run;

/* Substrings */
data prxword;
  input num word $ 3-15 explain $ 17-70;
datalines;
1 / cat /       has a blank space in front and after 'cat'
2 /cat/         looking for appearance of 'cat'
3 /cat/i        looking for appearance of 'cat', regardless of case
4 /^cat/        beginning of the string, case sensitive
5 /^cat/i       beginning of the string, case insensitive
6 /^cat|cat$/   '$' means end of the string, '|' means 'or'
;
run;

/* Cartesian join */
proc sql;
  create table prx as
  select * from cat, prxword;
quit;

proc sort data=prx; by num; run;

/* Output */
data _null_;
  set prx;
  by  num;
  if  first.num then pattern=prxparse(word);
  retain pattern;

  pos = prxmatch(pattern, strip(original));

  if  first.num then put "Example " num+(-1) ': ' word +(-1) ' -- ' explain;
  put original @45 pos;
  if  last.num then put//;
run;

He is the log. The numeric value represents the position of the first occurrence of the string (encased by / /) you are looking for.

Example 1: / cat / -- has a blank space in front and after 'cat'
there is a cat in this line.                11
does not match dog                          0
cat in the beginning                        0
at the end, a cat                           0
Cat                                         0

Example 2: /cat/ -- looking for appearance of 'cat'
there is a cat in this line.                12
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           15
Cat                                         0

Example 3: /cat/i -- looking for appearance of 'cat', regardless of case
there is a cat in this line.                12
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           15
Cat                                         1

Example 4: /^cat/ -- beginning of the string, case sensitive
there is a cat in this line.                0
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           0
Cat                                         0

Example 5: /^cat/i -- beginning of the string, case insensitive
there is a cat in this line.                0
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           0
Cat                                         1

Example 6: /^cat|cat$/ -- '$' means end of the string, '|' means 'or'
there is a cat in this line.                0
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           15
Cat                                         0

So you might shrug at the result – what’s so good about regular expression when I can pretty much achieve the same result by using find() without the hassle. As shown in the below example, find() function would perform just as well for all the tasks, saved for one.

1    data _null_;
2      set cat;
3      pos1 = find(strip(original), ' cat ');
4      pos2 = find(strip(original), 'cat');
5      pos3 = find(strip(original), 'cat', 'i');
6      pos4 = (find(strip(original), 'cat')=1);
7      pos5 = (find(strip(original), 'cat', 'i')=1);
8
9      if _n_=1 then put 'STRING' @35 'Ex 1' +6 'Ex 2' +6 'Ex 3' +6 'Ex 4' +6 'Ex 5';
10     put original @35 pos1 @45 pos2 @55 pos3 @65 pos4 @75 pos5;
11   run;
STRING                            Ex 1      Ex 2      Ex 3      Ex 4      Ex 5
there is a cat in this line.      11        12        12        0         0
does not match dog                0         0         0         0         0
cat in the beginning              0         1         1         1         1
at the end, a cat                 0         15        15        0         0
Cat                               0         0         1         0         1

But regular expression is much more flexible when you need to tame your character strings because it can use wildcards and metacharacters. Ron Cody’s SAS Functions by Example book has a table that documents the most frequently used metacharacters. You can check it out yourself.

Below is another example I took from one of the SUGI papers, which I think illustrates the power of regular expression very well. In this case, the “P.O. BOX” was not entered uniformly. We need to select all the lines that have ‘PO Box’ , ‘P.O.Box’ , ‘Box’ , ‘P O Box’,etc, and replace each one of them with ‘P. O. BOX’ in the strings. In the regular expression below, ‘ *’ matches the preceding sub expression zero or more times, and ‘?’ matches the previous subexpression zero or one time. ‘\s’ matches a white space character, including a space or a tab. I think this example shows essentially what these regular expression functions do – they make irregular expression regular.

1    data _null_;
2    retain pattern;
3    if _n_ =1 then pattern = prxparse ("s/P?\s*\.*\s*O?\s*\.*\s*BOX\s*\.*\s*/P.O. BOX /i");
4    input before $40. ;
5    length after $40.;
6    after=PRXCHANGE(PATTERN, 5, before);
7    if _n_ = 1 then put 'BEFORE' @38 'AFTER'/
8                        33*'-'   @38 33*'-';
9    put before @38 after;
10   datalines;
BEFORE                               AFTER
---------------------------------    ---------------------------------
1250 Health Plaza, P.O.BOX 495       1250 Health Plaza, P.O. BOX 495
P O BOX 2235, 35 Gene Pl             P.O. BOX 2235, 35 Gene Pl
PO BOX 56, 1st DNA Avenue            P.O. BOX 56, 1st DNA Avenue
123 Mitochondria Blvd, P BOX 223     123 Mitochondria Blvd, P.O. BOX 223
11 Wellness Ave, p o box             11 Wellness Ave, P.O. BOX
1600 Pennsylvania Ave, pBOX 2228     1600 Pennsylvania Ave, P.O. BOX 2228
P Box 121                            P.O. BOX 121
p box144                             P.O. BOX 144
pobox 169                            P.O. BOX 169
Pbox 225                             P.O. BOX 225
P. O. box. 1000-1111                 P.O. BOX 1000-1111
22   ;
23   run;

/***********************/
/* End of Illustration */
/***********************/

Written by sasandr

July 26, 2012 at 3:36 pm

Posted in SAS

Tagged with ,

Floating point arithmetic

with 4 comments

One time I was trying different cut-off points for classification to define a dichotomous variable for logistic regression, and I kept getting erroneous result when I looked at the data print-out. Values which should have been set to “Yes” according to my algorithm fell into the “No” column, and I just couldn’t figure out what went wrong.

Here is a simple example to illustrate my problem. We start with a SOURCE data set with four variables X, Y, (X-Y) and a certain constant as the cut-off value.

Floating-Point Arithmetic

Obs    x     y     x - y    cutoff
 1     2    1.1      0.9      0.9
 2     2    1.2      0.8      0.8
 3     2    1.3      0.7      0.7
 4     2    1.4      0.6      0.6
 5     2    1.5      0.5      0.5
 6     2    1.6      0.4      0.4
 7     2    1.7      0.3      0.3
 8     2    1.8      0.2      0.2
 9     2    1.9      0.1      0.1

Now a flag variables is created to indicate if (x-y) is equal to the cut-off (1 for Yes, 0 for No).

data FLOAT;
  set source;
  /* No rounding */
  flag1 = (z = cutoff);
  /* round() to the rescue */
  flag2 = (round(z,0.1) = cutoff);
run;

The print-out of dataset FLOAT shows that with round() function, flag variable is set correctly; but we get erratic result sans rounding. This is because in SAS, numeric values are represented as 64-bit floating point numbers, and rules of algebra may not apply to floating point numbers. A paper from the SAS® Institute explains this phenomena in great details. You can check it out yourself.

                                       Without      With
Obs    x     y     x - y    cutoff    rounding    rounding

 1     2    1.1      0.9      0.9         0           1
 2     2    1.2      0.8      0.8         1           1
 3     2    1.3      0.7      0.7         1           1
 4     2    1.4      0.6      0.6         0           1
 5     2    1.5      0.5      0.5         1           1
 6     2    1.6      0.4      0.4         0           1
 7     2    1.7      0.3      0.3         0           1
 8     2    1.8      0.2      0.2         0           1
 9     2    1.9      0.1      0.1         0           1

And as you see from the above example, to circumvent this problem, the quick and dirty way is using round() function to set precision before comparison.

And upon further checking, this is also an issue in R. You can see that, without rounding function, values for some of the comparisons are not returning “TRUE” even though on paper (x-y) and z might look the same.

> x <- rep(2, 9)
> y <- seq(1.1, 1.9, by=0.1)
> z <- seq(0.9, 0.1, by=-0.1)
> x-y
[1] 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
> z
[1] 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
> # Without rounding
> x-y == z
[1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
>
> # With rounding
> round(x-y, 1) == round(z, 1)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

So the take home lesson here is always check your calculation when you are dealing with values with decimal points, use various rounding and truncation functions to ensure decimal precision. And do remember, even though your number/data is continuous, computer only recognizes 0 and 1.

Written by sasandr

July 6, 2012 at 2:26 pm

Posted in R, SAS

Tagged with , ,

SAS tip: Save and load system options

leave a comment »

It is good statistical programming practice to delete all temporary data sets at the end of a macro run, not just to save considerable work space and memory, but also reduce the chance for errors such as reusing the data set from previous run, or warning signs such as naming conflict.

But how about system options? In your macro call, you might need to change several system option settings, and it would be a hassle to reset them back to the original state one-by-one. Thankfully we have OPTSAVE and OPTLOAD procedures at our disposal, which can be used to save SAS option values, and restore them at a later time. In the following example the OPTSAVE procedure writes the values of all the SAS options that can be altered from within a SAS session to a SAS data set [your_saved_options]. The OPTLOAD procedure later restores the SAS session option values from the [your_saved_options] data set.

/* Save your SAS system options */
proc optsave out=[your_saved_options];
run;

/* Reload your SAS system options */
proc optload data=[your_saved_dateset];
run;

This way, SAS options can be reset back to the original values, before exiting the macro.

Written by sasandr

June 6, 2012 at 4:08 pm

Posted in SAS

Tagged with

All those CATs

with one comment

The title is a bit misleading since we are not talking about the furry animals here. This post is about the conCATenation functions in SAS, which are useful, but CAN get hairy at times.

The word concatenate comes directly from Latin concatenare, which in turn is formed from “con-,” meaning “with” or “together,” and “catena,” meaning “chain.” And the CATx functions, in short, “chain” words or numbers together to make a new string. The different CATx functions dictate how you want to chain them together.

There are five concatenate functions in SAS (CAT, CATS, CATT, CATX, CATQ).  The main difference among these functions involves the handling of leading/trailing blanks as well as separator characters between the concatenated items. People from SAS old school might still prefer traditional concatenation operator (||), but the new functions just take far less work to accomplish the same task. The following output gives an overview of the difference between the CATx functions. (click the graph to enlarge)

As shown below, we have four strings (brackets on both sides to show the leading/trailing blanks), and you can see the resulting strings using the different CATx functions (period at the end to show where the string ends). One advantage of using the CATx functions is the items to be concatenated may be character or numeric. And, if you include numeric values, they are treated as if they were actually character values and no numeric-to-character conversion messages are printed to the SAS log.

  All Those CATx, MEOW!
 
String 1 has no leading/trailing blank            [ALL]
String 2 has three trailing blanks                [THOSE   ]
String 3 has two leading blanks                   [  CATS]
String 4 has three leading and two trailing blanks[   MEOW  ]
 
 
- CAT acts like '||' with minor difference.¹
cat(String1,String2,String3,String4) shows:       ALLTHOSE     CATS   MEOW  .
 
- CATS Removes trailing and leading blanks.
cats(String1,String2,String3,String4):            ALLTHOSECATSMEOW.
cats(12,34,56,78)                                 12345678.
 
- CATT only trims trailing blanks.
catt(String1,String2,String3,String4)             ALLTHOSE  CATS   MEOW.
 
- CATX trims both leading and trailing blanks, and inserts separator character.
catx(' ',String1,String2,String3,String4)         ALL THOSE CATS MEOW.
catx(',',String1,String2,String3,String4)         ALL,THOSE,CATS,MEOW.
catx('-',908,782,6562)                            908-782-6562.
 
- CATQ joins strings together as defined by the modifier.²
catq(' ',String1,String2,String3,String4)         ALL "THOSE   " "  CATS" "   MEOW  ".
catq('a',String1,String2,String3,String4)         "ALL" "THOSE   " "  CATS" "   MEOW  ".
catq('s',String1,String2,String3,String4)         ALL THOSE CATS MEOW.
catq('as',String1,String2,String3,String4)        "ALL" "THOSE" "CATS" "MEOW".
catq('asd','~~', String1,String2,String3,String4) "ALL"~~"THOSE"~~"CATS"~~"MEOW".
 
¹ The default length of the result when you use the || operator is the sum of
  the lengths of the strings being concatenated, the default length of the
  result when you use the CAT function is 200.
² Check SAS Language Reference for details on all the available modifiers.
 
Remember to define string length before calling CATx,
otherwise the default length of the resulting string is 200.

For a complete list of CATQ modifiers, go to CATQ function language reference.

Please note that it is always a good practice to initialize your character variable by specifying the length of the resulting variable directly under the data statement, and making sure to set the length of created variables long enough to accommodate the longest string created by concatenation. If not, the resulting string will be truncated, and you will see an error message in the log.

Written by sasandr

May 22, 2012 at 10:55 am

Posted in SAS

Tagged with

Abbr. [Update]

leave a comment »

Have you been using the same SAS procedures or SAS functions over and over again, but for some reason never been able to remember the correct syntax, or do you wish you can cut down on repetitive typing, and focus more on thinking?  SAS Enhanced Editor actually has a little tool built in to help you achieve just that.

A SAS abbreviation is a character string defined by you so that when you type the string in the Enhanced Editor window, the string is automatically substituted with a longer text string. Abbreviations are actually keyboard macros that insert one or more lines of text.

1. Create Abbreviation
Just press “Ctrl + Shift + A”, or use the menu “Tools –> Add Abbreviation”In the Abbreviation field, type the name of the abbreviation. In the Text to insert for abbreviation field, type the text that the abbreviation will expand into. Then click OK. For example, here we type in the syntax for function IFN.  We name this abbreviation ifn, and that is the “code word” you type in the Enhanced Editor to invoke this keyboard macro.

Whenever you want to use an abbreviation, simply type in the name of the abbreviation while in the Enhanced Editor. As soon as the last letter of the abbreviation has been entered, a small pop-up ‘tip’ text box containing the first few words of the abbreviation is displayed. If at that point you press the TAB/Enter key the name of the abbreviation will be replaced by the text that you stored.

2. Export/import abbreviation
You can also export or import your abbreviations so you have access to them on multiple machines. Go to “Tools –> Keyboard Macros –> Macros”.

A window opens where you select the abbreviations to export. If you want to select more than one, hold down the CTRL key on your keyboard as you click on each abbreviation. Click “Export”, SAS will automatically select the necessary file type Keyboard Macro Files (*.kmf). Name the file anything you want. SAS will then create the export file for you. And if you want to import KMF file, click “Import.”Voila! After you set up the abbreviation once, you can recall it again and again. This might be especially useful in adding program header block, and it can also reduce effort in looking up syntax for functions, statements, or procedures you are prone to forget to avoid interruption to your programming flow. Hope this trick is useful to you.

[Update] Here is a video from SAS® software solution consulting firm Amadeus Software about Using Abbreviations.

Written by sasandr

May 16, 2012 at 2:37 pm

Posted in SAS

Tagged with