SAS® and R

Best of Both Worlds

Check controlled SDTM terminology

leave a comment »


%macro chk_ct (indsn=, var=, codelist=);

%if &codelist=%str() %then %let codelist=%upcase(&var);

proc freq data=&indsn noprint;
tables &var / missing out=&var;
run;

proc sql noprint;
create table check as
select &var, b.cdisc_submission_value, b.codelist
from   &var left join metadata.sdtm_ct_20150626(where=(codelist="%upcase(&codelist)")) as b
on     &var=b.cdisc_submission_value;
quit;

data final;
set check;
length flag $ 3;
if  &var^=cdisc_submission_value then flag='No';
else if cmiss(&var, cdisc_submission_value)=0 then flag='Yes';
run;

proc sql noprint;
select count(*) into: totcnt from final
quit;

%if &totcnt=0 %then %do;
data final;
xx='No Data'; output;
set final; output;
run;
%end;

title "Check if variable values match CDISC submission values";
proc report data=final nowd missing headline headskip spacing=2 split='@' formchar(2)='_';
column flag codelist &var cdisc_submission_value;
define flag     / order width=3 ' ';
define codelist / order "Codelist@(&codelist)" width=20;
define &var     /  width=40 flow "Variable@(%upcase(&var))";
define cdisc_submission_value / width=40 flow "CDISC@Submission@Value";

break after flag / skip;
%if &totcnt=0 %then %do;
compute after;
line @1 ' ';
line @1 ' ';
line @10 'Variable values matched CDISC submission values';
endcomp;
%end;
run;

%mend chk_ct;
Advertisements

Written by sasandr

March 10, 2016 at 11:54 am

Posted in Uncategorized

proc sort nodup

leave a comment »

proc sort noduprecs
by id
rid the pesky duplicates

Written by sasandr

April 1, 2014 at 9:46 pm

Posted in SAS

Tagged with ,

Check consistency in VISITNUM/VISIT in your SDTM domains

leave a comment »

One of the requirements during CDISC SDTM data validation is VISITNUM/VISIT pair in findings domains doesn’t match TV domain data. Here is an easy way to find out how visit is defined in all findings domain.

/* Select a list of SDTM domains that use VISITNUM/VISIT */
proc sql;
 select distinct(memname) into :gdmlist
 separated by '*'
 from   sashelp.vcolumn
 where  upcase(libname)="GDMDATA" and upcase(name)="VISIT";
quit;
 
%macro check;
  %let i = 1 ;
 
  %do %while(%scan(&gdmlist,&i,*) ^= %str( )) ;
    %let indsn = %scan(&gdmlist,&i,*) ;
    title "gdmdata.&indsn";
    proc freq data=gdmdata.&indsn noprint;
      tables visitnum*visit/list missing out=test(where=(visitnum>.z and ^missing(visit)));
    run;
   
    proc transpose data=test out=vtest prefix=v;
      var visit;
      id  visitnum;
    run;
    
    data out;
      set vtest;
      length source $ 2;
      
      source=%upcase("&indsn");
    run;
    
    proc append base=final data=out force;
    run;
    
    %let i = %eval(&i + 1 ) ;
  %end ;
%mend check;
 
%check;

And the print out of the data shows how VISITNUM/VISIT are defined, and it is easy to identify problems. Actually TV domain is inconsistent with all the rest of the findings domains.

Obs    SOURCE      V4         V5         V6         V7         V8         V9         V10          V11          V888

  1      DA      Visit 4    Visit 5    Visit 6    Visit 7    Visit 8    Visit 9    Visit 10    Visit 11a    Unscheduled
  2      EG      Visit 4    Visit 5    Visit 6               Visit 8                           Visit 11a    Unscheduled
  3      EX      Visit 4    Visit 5    Visit 6    Visit 7    Visit 8    Visit 9    Visit 10    Visit 11a    Unschedule 
  4      HO                 Visit 5    Visit 6    Visit 7    Visit 8               Visit 10    Visit 11a    Unscheduled
  5      IE                                                                                                            
  6      LB      Visit 4               Visit 6               Visit 8                           Visit 11a    Unscheduled
  7      QS      Visit 4               Visit 6    Visit 7    Visit 8    Visit 9    Visit 10    Visit 11a    Unscheduled
  8      TV      Visit 4    Visit 5    Visit 6    Visit 7    Visit 8    Visit 9    Visit 10    Visit 11                
  9      VS      Visit 4    Visit 5    Visit 6    Visit 7    Visit 8    Visit 9    Visit 10    Visit 11a    Unscheduled
 10      ZU                 Visit 5    Visit 6    Visit 7    Visit 8    Visit 9    Visit 10    Visit 11a    Unscheduled

Written by sasandr

March 28, 2014 at 4:06 pm

Posted in Data, SAS

Tagged with , , ,

leave a comment »

Letter from a teacher to parents. I’d say respect goes both ways, and communication is always key to resolving any problems.

Teachable Moments

There are so many things I wish I could say, but alas, I am not allowed. Ok, anyone who actually knows me, knows that I would NEVER say these things, but this is how I really feel, and this is the truth; at least it’s my truth.

Dear Parent/Guardian,

I am looking forward to a great year with your student. I have spent my summer learning new strategies, and looking at ways to improve student learning. Not a day has gone by that I didn’t at least think about something I would like to do or change in order to enhance student learning. I have ignored statements made about what I do with my summers, ignored reports of testing on the news, and done what I know is right. I know, that if I do my best to help and teach all students, tests will take care of themselves.

Every…

View original post 281 more words

Written by sasandr

July 27, 2012 at 10:20 am

Posted in Misc

More on perl regular expression

leave a comment »

Beginner of perl regular expression (PRX), confused by the syntax, might not appreciate the simple, compact solution PRX functions provide, and would prefer traditional string functions, which at the outset seems to suffice the needs.

Here is an example that shows some basic tools in perl regular expressions. Click “show source” to see the code.

/* Example from R Cody book */
data cat;
  input original $30.;
datalines;
there is a cat in this line.
does not match dog
cat in the beginning
at the end, a cat
Cat
;
run;

/* Substrings */
data prxword;
  input num word $ 3-15 explain $ 17-70;
datalines;
1 / cat /       has a blank space in front and after 'cat'
2 /cat/         looking for appearance of 'cat'
3 /cat/i        looking for appearance of 'cat', regardless of case
4 /^cat/        beginning of the string, case sensitive
5 /^cat/i       beginning of the string, case insensitive
6 /^cat|cat$/   '$' means end of the string, '|' means 'or'
;
run;

/* Cartesian join */
proc sql;
  create table prx as
  select * from cat, prxword;
quit;

proc sort data=prx; by num; run;

/* Output */
data _null_;
  set prx;
  by  num;
  if  first.num then pattern=prxparse(word);
  retain pattern;

  pos = prxmatch(pattern, strip(original));

  if  first.num then put "Example " num+(-1) ': ' word +(-1) ' -- ' explain;
  put original @45 pos;
  if  last.num then put//;
run;

He is the log. The numeric value represents the position of the first occurrence of the string (encased by / /) you are looking for.

Example 1: / cat / -- has a blank space in front and after 'cat'
there is a cat in this line.                11
does not match dog                          0
cat in the beginning                        0
at the end, a cat                           0
Cat                                         0

Example 2: /cat/ -- looking for appearance of 'cat'
there is a cat in this line.                12
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           15
Cat                                         0

Example 3: /cat/i -- looking for appearance of 'cat', regardless of case
there is a cat in this line.                12
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           15
Cat                                         1

Example 4: /^cat/ -- beginning of the string, case sensitive
there is a cat in this line.                0
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           0
Cat                                         0

Example 5: /^cat/i -- beginning of the string, case insensitive
there is a cat in this line.                0
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           0
Cat                                         1

Example 6: /^cat|cat$/ -- '$' means end of the string, '|' means 'or'
there is a cat in this line.                0
does not match dog                          0
cat in the beginning                        1
at the end, a cat                           15
Cat                                         0

So you might shrug at the result – what’s so good about regular expression when I can pretty much achieve the same result by using find() without the hassle. As shown in the below example, find() function would perform just as well for all the tasks, saved for one.

1    data _null_;
2      set cat;
3      pos1 = find(strip(original), ' cat ');
4      pos2 = find(strip(original), 'cat');
5      pos3 = find(strip(original), 'cat', 'i');
6      pos4 = (find(strip(original), 'cat')=1);
7      pos5 = (find(strip(original), 'cat', 'i')=1);
8
9      if _n_=1 then put 'STRING' @35 'Ex 1' +6 'Ex 2' +6 'Ex 3' +6 'Ex 4' +6 'Ex 5';
10     put original @35 pos1 @45 pos2 @55 pos3 @65 pos4 @75 pos5;
11   run;
STRING                            Ex 1      Ex 2      Ex 3      Ex 4      Ex 5
there is a cat in this line.      11        12        12        0         0
does not match dog                0         0         0         0         0
cat in the beginning              0         1         1         1         1
at the end, a cat                 0         15        15        0         0
Cat                               0         0         1         0         1

But regular expression is much more flexible when you need to tame your character strings because it can use wildcards and metacharacters. Ron Cody’s SAS Functions by Example book has a table that documents the most frequently used metacharacters. You can check it out yourself.

Below is another example I took from one of the SUGI papers, which I think illustrates the power of regular expression very well. In this case, the “P.O. BOX” was not entered uniformly. We need to select all the lines that have ‘PO Box’ , ‘P.O.Box’ , ‘Box’ , ‘P O Box’,etc, and replace each one of them with ‘P. O. BOX’ in the strings. In the regular expression below, ‘ *’ matches the preceding sub expression zero or more times, and ‘?’ matches the previous subexpression zero or one time. ‘\s’ matches a white space character, including a space or a tab. I think this example shows essentially what these regular expression functions do – they make irregular expression regular.

1    data _null_;
2    retain pattern;
3    if _n_ =1 then pattern = prxparse ("s/P?\s*\.*\s*O?\s*\.*\s*BOX\s*\.*\s*/P.O. BOX /i");
4    input before $40. ;
5    length after $40.;
6    after=PRXCHANGE(PATTERN, 5, before);
7    if _n_ = 1 then put 'BEFORE' @38 'AFTER'/
8                        33*'-'   @38 33*'-';
9    put before @38 after;
10   datalines;
BEFORE                               AFTER
---------------------------------    ---------------------------------
1250 Health Plaza, P.O.BOX 495       1250 Health Plaza, P.O. BOX 495
P O BOX 2235, 35 Gene Pl             P.O. BOX 2235, 35 Gene Pl
PO BOX 56, 1st DNA Avenue            P.O. BOX 56, 1st DNA Avenue
123 Mitochondria Blvd, P BOX 223     123 Mitochondria Blvd, P.O. BOX 223
11 Wellness Ave, p o box             11 Wellness Ave, P.O. BOX
1600 Pennsylvania Ave, pBOX 2228     1600 Pennsylvania Ave, P.O. BOX 2228
P Box 121                            P.O. BOX 121
p box144                             P.O. BOX 144
pobox 169                            P.O. BOX 169
Pbox 225                             P.O. BOX 225
P. O. box. 1000-1111                 P.O. BOX 1000-1111
22   ;
23   run;

/***********************/
/* End of Illustration */
/***********************/

Written by sasandr

July 26, 2012 at 3:36 pm

Posted in SAS

Tagged with ,

God particle, 5 sigma, and p-value

with 2 comments

About a week ago, CERN announced the discovery of a new sub-atomic particle that’s consistent with the properties of the elusive Higgs Boson, a.k.a. God Particle. CERN scientists say it is a 5 sigma result. It is interesting that almost all the news reports I read converted this 5 sigma to a percentage, and none seemed to be able to explain what exactly 5 sigma is. Some even mistakenly claimed that scientists “99.999% sure God Particle has been found.

Actually 5 sigma is just another way of stating the probability value, in another word, p-value. So what is p-value anyway? P-value is the probability that the data would be at least as extreme as those observed, if the null hypothesis were true.

Standard normal distribution N(0,1) has μ=0 and σ2=1. As you can see from the above graph, a little more than 2/3 of values drawn from a normal distribution are within one standard deviation (one sigma) away from the mean (red area). Approximately 95% of the values are with two sigma (1.96) of the mean. Three sigma covers about 99.7% AUC (Area under the density curve).

Five sigma? That’s about 0.9999997, which means the significance level (alpha) is 0.0000003. In short, there is the null hypothesis (no God particle), and the alternative hypothesis (God particle exists). Five sigma means that there is a very slim chance (less than one in a million) the null hypothesis is true. Note that this is not the equivalent of scientists being 99.99997% sure the alternative hypothesis is correct.

> pnorm(5)
[1] 0.999999713348428
> 1-pnorm(5)
[1] 0.000000286651571923535

Here is the code for the graph. I wrote it in a hurry, any suggestion to make it better is welcome.

my.color   <- rainbow(10)
my.symbol2 <- expression(mu)
my.axis    <- c(-6,-5,-4,-3,-2,-1,0,1,2,3,4,5,6)
my.label   <- c('-6', '-5','-4','-3','-2','-1',my.symbol2,'1','2','3','4','5','6')

x = seq(-6, 6, length = 600)
y = dnorm(x)

plot(x, y, type = "n", xlab=my.symbol, ylab=' ', axes=FALSE)

plotsigma <- function(start, end, color){
  sigmax = seq(start, end, length=100)
  sigmay = c(0, dnorm(sigmax), 0)
  sigmax = c(start, sigmax, end)
  polygon(sigmax, sigmay, col = color, border = NA)
}

for (i in 5:1){
  plotsigma(-i, i, my.color[i])
}

axis(1,at=my.axis,labels=my.label)
lines(x, y)
segments(0,0.4,0,0, col='white')
segments(5,0.2,5,0, lty=3)
text(5,0.22, expression(paste(5, sigma, sep='')))

Written by sasandr

July 14, 2012 at 12:08 am

Posted in Misc, Stat

Tagged with ,

Floating point arithmetic

with 4 comments

One time I was trying different cut-off points for classification to define a dichotomous variable for logistic regression, and I kept getting erroneous result when I looked at the data print-out. Values which should have been set to “Yes” according to my algorithm fell into the “No” column, and I just couldn’t figure out what went wrong.

Here is a simple example to illustrate my problem. We start with a SOURCE data set with four variables X, Y, (X-Y) and a certain constant as the cut-off value.

Floating-Point Arithmetic

Obs    x     y     x - y    cutoff
 1     2    1.1      0.9      0.9
 2     2    1.2      0.8      0.8
 3     2    1.3      0.7      0.7
 4     2    1.4      0.6      0.6
 5     2    1.5      0.5      0.5
 6     2    1.6      0.4      0.4
 7     2    1.7      0.3      0.3
 8     2    1.8      0.2      0.2
 9     2    1.9      0.1      0.1

Now a flag variables is created to indicate if (x-y) is equal to the cut-off (1 for Yes, 0 for No).

data FLOAT;
  set source;
  /* No rounding */
  flag1 = (z = cutoff);
  /* round() to the rescue */
  flag2 = (round(z,0.1) = cutoff);
run;

The print-out of dataset FLOAT shows that with round() function, flag variable is set correctly; but we get erratic result sans rounding. This is because in SAS, numeric values are represented as 64-bit floating point numbers, and rules of algebra may not apply to floating point numbers. A paper from the SAS® Institute explains this phenomena in great details. You can check it out yourself.

                                       Without      With
Obs    x     y     x - y    cutoff    rounding    rounding

 1     2    1.1      0.9      0.9         0           1
 2     2    1.2      0.8      0.8         1           1
 3     2    1.3      0.7      0.7         1           1
 4     2    1.4      0.6      0.6         0           1
 5     2    1.5      0.5      0.5         1           1
 6     2    1.6      0.4      0.4         0           1
 7     2    1.7      0.3      0.3         0           1
 8     2    1.8      0.2      0.2         0           1
 9     2    1.9      0.1      0.1         0           1

And as you see from the above example, to circumvent this problem, the quick and dirty way is using round() function to set precision before comparison.

And upon further checking, this is also an issue in R. You can see that, without rounding function, values for some of the comparisons are not returning “TRUE” even though on paper (x-y) and z might look the same.

> x <- rep(2, 9)
> y <- seq(1.1, 1.9, by=0.1)
> z <- seq(0.9, 0.1, by=-0.1)
> x-y
[1] 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
> z
[1] 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
> # Without rounding
> x-y == z
[1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
>
> # With rounding
> round(x-y, 1) == round(z, 1)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

So the take home lesson here is always check your calculation when you are dealing with values with decimal points, use various rounding and truncation functions to ensure decimal precision. And do remember, even though your number/data is continuous, computer only recognizes 0 and 1.

Written by sasandr

July 6, 2012 at 2:26 pm

Posted in R, SAS

Tagged with , ,