## Check controlled SDTM terminology

%macro chk_ct (indsn=, var=, codelist=); %if &codelist=%str() %then %let codelist=%upcase(&var); proc freq data=&indsn noprint; tables &var / missing out=&var; run; proc sql noprint; create table check as select &var, b.cdisc_submission_value, b.codelist from &var left join metadata.sdtm_ct_20150626(where=(codelist="%upcase(&codelist)")) as b on &var=b.cdisc_submission_value; quit; data final; set check; length flag $ 3; if &var^=cdisc_submission_value then flag='No'; else if cmiss(&var, cdisc_submission_value)=0 then flag='Yes'; run; proc sql noprint; select count(*) into: totcnt from final quit; %if &totcnt=0 %then %do; data final; xx='No Data'; output; set final; output; run; %end; title "Check if variable values match CDISC submission values"; proc report data=final nowd missing headline headskip spacing=2 split='@' formchar(2)='_'; column flag codelist &var cdisc_submission_value; define flag / order width=3 ' '; define codelist / order "Codelist@(&codelist)" width=20; define &var / width=40 flow "Variable@(%upcase(&var))"; define cdisc_submission_value / width=40 flow "CDISC@Submission@Value"; break after flag / skip; %if &totcnt=0 %then %do; compute after; line @1 ' '; line @1 ' '; line @10 'Variable values matched CDISC submission values'; endcomp; %end; run; %mend chk_ct;

## Check consistency in VISITNUM/VISIT in your SDTM domains

One of the requirements during CDISC SDTM data validation is VISITNUM/VISIT pair in findings domains doesn’t match TV domain data. Here is an easy way to find out how visit is defined in all findings domain.

/* Select a list of SDTM domains that use VISITNUM/VISIT */ proc sql; select distinct(memname) into :gdmlist separated by '*' from sashelp.vcolumn where upcase(libname)="GDMDATA" and upcase(name)="VISIT"; quit; %macro check; %let i = 1 ; %do %while(%scan(&gdmlist,&i,*) ^= %str( )) ; %let indsn = %scan(&gdmlist,&i,*) ; title "gdmdata.&indsn"; proc freq data=gdmdata.&indsn noprint; tables visitnum*visit/list missing out=test(where=(visitnum>.z and ^missing(visit))); run; proc transpose data=test out=vtest prefix=v; var visit; id visitnum; run; data out; set vtest; length source $ 2; source=%upcase("&indsn"); run; proc append base=final data=out force; run; %let i = %eval(&i + 1 ) ; %end ; %mend check; %check;

And the print out of the data shows how VISITNUM/VISIT are defined, and it is easy to identify problems. Actually TV domain is inconsistent with all the rest of the findings domains.

Obs SOURCE V4 V5 V6 V7 V8 V9 V10 V11 V888 1 DA Visit 4 Visit 5 Visit 6 Visit 7 Visit 8 Visit 9 Visit 10 Visit 11a Unscheduled 2 EG Visit 4 Visit 5 Visit 6 Visit 8 Visit 11a Unscheduled 3 EX Visit 4 Visit 5 Visit 6 Visit 7 Visit 8 Visit 9 Visit 10 Visit 11a Unschedule 4 HO Visit 5 Visit 6 Visit 7 Visit 8 Visit 10 Visit 11a Unscheduled 5 IE 6 LB Visit 4 Visit 6 Visit 8 Visit 11a Unscheduled 7 QS Visit 4 Visit 6 Visit 7 Visit 8 Visit 9 Visit 10 Visit 11a Unscheduled 8 TV Visit 4 Visit 5 Visit 6 Visit 7 Visit 8 Visit 9 Visit 10 Visit 11 9 VS Visit 4 Visit 5 Visit 6 Visit 7 Visit 8 Visit 9 Visit 10 Visit 11a Unscheduled 10 ZU Visit 5 Visit 6 Visit 7 Visit 8 Visit 9 Visit 10 Visit 11a Unscheduled

Letter from a teacher to parents. I’d say respect goes both ways, and communication is always key to resolving any problems.

There are so many things I wish I could say, but alas, I am not allowed. Ok, anyone who actually knows me, knows that I would NEVER say these things, but this is how I really feel, and this is the truth; at least it’s my truth.

Dear Parent/Guardian,

I am looking forward to a great year with your student. I have spent my summer learning new strategies, and looking at ways to improve student learning. Not a day has gone by that I didn’t at least think about something I would like to do or change in order to enhance student learning. I have ignored statements made about what I do with my summers, ignored reports of testing on the news, and done what I know is right. I know, that if I do my best to help and teach all students, tests will take care of themselves.

Every…

View original post 281 more words

## More on perl regular expression

Beginner of perl regular expression (PRX), confused by the syntax, might not appreciate the simple, compact solution PRX functions provide, and would prefer traditional string functions, which at the outset seems to suffice the needs.

Here is an example that shows some basic tools in perl regular expressions. Click “show source” to see the code.

/* Example from R Cody book */ data cat; input original $30.; datalines; there is a cat in this line. does not match dog cat in the beginning at the end, a cat Cat ; run; /* Substrings */ data prxword; input num word $ 3-15 explain $ 17-70; datalines; 1 / cat / has a blank space in front and after 'cat' 2 /cat/ looking for appearance of 'cat' 3 /cat/i looking for appearance of 'cat', regardless of case 4 /^cat/ beginning of the string, case sensitive 5 /^cat/i beginning of the string, case insensitive 6 /^cat|cat$/ '$' means end of the string, '|' means 'or' ; run; /* Cartesian join */ proc sql; create table prx as select * from cat, prxword; quit; proc sort data=prx; by num; run; /* Output */ data _null_; set prx; by num; if first.num then pattern=prxparse(word); retain pattern; pos = prxmatch(pattern, strip(original)); if first.num then put "Example " num+(-1) ': ' word +(-1) ' -- ' explain; put original @45 pos; if last.num then put//; run;

He is the log. The numeric value represents the position of the first occurrence of the string (encased by / /) you are looking for.

Example 1: / cat / -- has a blank space in front and after 'cat' there is a cat in this line. 11 does not match dog 0 cat in the beginning 0 at the end, a cat 0 Cat 0 Example 2: /cat/ -- looking for appearance of 'cat' there is a cat in this line. 12 does not match dog 0 cat in the beginning 1 at the end, a cat 15 Cat 0 Example 3: /cat/i -- looking for appearance of 'cat', regardless of case there is a cat in this line. 12 does not match dog 0 cat in the beginning 1 at the end, a cat 15 Cat 1 Example 4: /^cat/ -- beginning of the string, case sensitive there is a cat in this line. 0 does not match dog 0 cat in the beginning 1 at the end, a cat 0 Cat 0 Example 5: /^cat/i -- beginning of the string, case insensitive there is a cat in this line. 0 does not match dog 0 cat in the beginning 1 at the end, a cat 0 Cat 1 Example 6: /^cat|cat$/ -- '$' means end of the string, '|' means 'or' there is a cat in this line. 0 does not match dog 0 cat in the beginning 1 at the end, a cat 15 Cat 0

So you might shrug at the result – what’s so good about regular expression when I can pretty much achieve the same result by using find() without the hassle. As shown in the below example, find() function would perform just as well for all the tasks, saved for one.

1 data _null_; 2 set cat; 3 pos1 = find(strip(original), ' cat '); 4 pos2 = find(strip(original), 'cat'); 5 pos3 = find(strip(original), 'cat', 'i'); 6 pos4 = (find(strip(original), 'cat')=1); 7 pos5 = (find(strip(original), 'cat', 'i')=1); 8 9 if _n_=1 then put 'STRING' @35 'Ex 1' +6 'Ex 2' +6 'Ex 3' +6 'Ex 4' +6 'Ex 5'; 10 put original @35 pos1 @45 pos2 @55 pos3 @65 pos4 @75 pos5; 11 run; STRING Ex 1 Ex 2 Ex 3 Ex 4 Ex 5 there is a cat in this line. 11 12 12 0 0 does not match dog 0 0 0 0 0 cat in the beginning 0 1 1 1 1 at the end, a cat 0 15 15 0 0 Cat 0 0 1 0 1

But regular expression is much more flexible when you need to tame your character strings because it can use wildcards and metacharacters. Ron Cody’s *SAS Functions by Example* book has a table that documents the most frequently used metacharacters. You can check it out yourself.

Below is another example I took from one of the SUGI papers, which I think illustrates the power of regular expression very well. In this case, the “P.O. BOX” was not entered uniformly. We need to select all the lines that have ‘PO Box’ , ‘P.O.Box’ , ‘Box’ , ‘P O Box’,etc, and replace each one of them with ‘P. O. BOX’ in the strings. In the regular expression below, ‘ *’ matches the preceding sub expression zero or more times, and ‘?’ matches the previous subexpression zero or one time. ‘\s’ matches a white space character, including a space or a tab. I think this example shows essentially what these regular expression functions do – they make irregular expression regular.

1 data _null_; 2 retain pattern; 3 if _n_ =1 then pattern = prxparse ("s/P?\s*\.*\s*O?\s*\.*\s*BOX\s*\.*\s*/P.O. BOX /i"); 4 input before $40. ; 5 length after $40.; 6 after=PRXCHANGE(PATTERN, 5, before); 7 if _n_ = 1 then put 'BEFORE' @38 'AFTER'/ 8 33*'-' @38 33*'-'; 9 put before @38 after; 10 datalines; BEFORE AFTER --------------------------------- --------------------------------- 1250 Health Plaza, P.O.BOX 495 1250 Health Plaza, P.O. BOX 495 P O BOX 2235, 35 Gene Pl P.O. BOX 2235, 35 Gene Pl PO BOX 56, 1st DNA Avenue P.O. BOX 56, 1st DNA Avenue 123 Mitochondria Blvd, P BOX 223 123 Mitochondria Blvd, P.O. BOX 223 11 Wellness Ave, p o box 11 Wellness Ave, P.O. BOX 1600 Pennsylvania Ave, pBOX 2228 1600 Pennsylvania Ave, P.O. BOX 2228 P Box 121 P.O. BOX 121 p box144 P.O. BOX 144 pobox 169 P.O. BOX 169 Pbox 225 P.O. BOX 225 P. O. box. 1000-1111 P.O. BOX 1000-1111 22 ; 23 run; /***********************/ /* End of Illustration */ /***********************/

## God particle, 5 sigma, and p-value

About a week ago, CERN announced the discovery of a new sub-atomic particle that’s consistent with the properties of the elusive Higgs Boson, a.k.a. God Particle. CERN scientists say it is a 5 sigma result. It is interesting that almost all the news reports I read converted this 5 sigma to a percentage, and none seemed to be able to explain what exactly 5 sigma is. Some even mistakenly claimed that scientists “99.999% sure God Particle has been found.”

Actually 5 sigma is just another way of stating the probability value, in another word, p-value. So what is p-value anyway? P-value is the probability that the data would be at least as extreme as those observed, if the null hypothesis were true.

Standard normal distribution N(0,1) has μ=0 and σ^{2}=1. As you can see from the above graph, a little more than 2/3 of values drawn from a normal distribution are within one standard deviation (one sigma) away from the mean (red area). Approximately 95% of the values are with two sigma (1.96) of the mean. Three sigma covers about 99.7% AUC (Area under the density curve).

Five sigma? That’s about 0.9999997, which means the significance level (alpha) is 0.0000003. In short, there is the null hypothesis (no God particle), and the alternative hypothesis (God particle exists). Five sigma means that there is a very slim chance (less than one in a million) the null hypothesis is true. Note that this is not the equivalent of scientists being 99.99997% sure the alternative hypothesis is correct.

> pnorm(5) [1] 0.999999713348428 > 1-pnorm(5) [1] 0.000000286651571923535

Here is the code for the graph. I wrote it in a hurry, any suggestion to make it better is welcome.

my.color <- rainbow(10) my.symbol2 <- expression(mu) my.axis <- c(-6,-5,-4,-3,-2,-1,0,1,2,3,4,5,6) my.label <- c('-6', '-5','-4','-3','-2','-1',my.symbol2,'1','2','3','4','5','6') x = seq(-6, 6, length = 600) y = dnorm(x) plot(x, y, type = "n", xlab=my.symbol, ylab=' ', axes=FALSE) plotsigma <- function(start, end, color){ sigmax = seq(start, end, length=100) sigmay = c(0, dnorm(sigmax), 0) sigmax = c(start, sigmax, end) polygon(sigmax, sigmay, col = color, border = NA) } for (i in 5:1){ plotsigma(-i, i, my.color[i]) } axis(1,at=my.axis,labels=my.label) lines(x, y) segments(0,0.4,0,0, col='white') segments(5,0.2,5,0, lty=3) text(5,0.22, expression(paste(5, sigma, sep='')))

## Floating point arithmetic

One time I was trying different cut-off points for classification to define a dichotomous variable for logistic regression, and I kept getting erroneous result when I looked at the data print-out. Values which should have been set to “Yes” according to my algorithm fell into the “No” column, and I just couldn’t figure out what went wrong.

Here is a simple example to illustrate my problem. We start with a SOURCE data set with four variables X, Y, (X-Y) and a certain constant as the cut-off value.

Floating-Point Arithmetic Obs x y x - y cutoff 1 2 1.1 0.9 0.9 2 2 1.2 0.8 0.8 3 2 1.3 0.7 0.7 4 2 1.4 0.6 0.6 5 2 1.5 0.5 0.5 6 2 1.6 0.4 0.4 7 2 1.7 0.3 0.3 8 2 1.8 0.2 0.2 9 2 1.9 0.1 0.1

Now a flag variables is created to indicate if (x-y) is equal to the cut-off (1 for Yes, 0 for No).

data FLOAT; set source; /* No rounding */ flag1 = (z = cutoff); /* round() to the rescue */ flag2 = (round(z,0.1) = cutoff); run;

The print-out of dataset FLOAT shows that with round() function, flag variable is set correctly; but we get erratic result sans rounding. This is because in SAS, numeric values are represented as 64-bit floating point numbers, and rules of algebra may not apply to floating point numbers. A paper from the SAS® Institute explains this phenomena in great details. You can check it out yourself.

Without With Obs x y x - y cutoff rounding rounding 1 2 1.1 0.9 0.9 0 1 2 2 1.2 0.8 0.8 1 1 3 2 1.3 0.7 0.7 1 1 4 2 1.4 0.6 0.6 0 1 5 2 1.5 0.5 0.5 1 1 6 2 1.6 0.4 0.4 0 1 7 2 1.7 0.3 0.3 0 1 8 2 1.8 0.2 0.2 0 1 9 2 1.9 0.1 0.1 0 1

And as you see from the above example, to circumvent this problem, the quick and dirty way is using round() function to set precision before comparison.

And upon further checking, this is also an issue in R. You can see that, without rounding function, values for some of the comparisons are not returning “TRUE” even though on paper (x-y) and z might look the same.

> x <- rep(2, 9) > y <- seq(1.1, 1.9, by=0.1) > z <- seq(0.9, 0.1, by=-0.1) > x-y [1] 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 > z [1] 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 > # Without rounding > x-y == z [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE > > # With rounding > round(x-y, 1) == round(z, 1) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

So the take home lesson here is always check your calculation when you are dealing with values with decimal points, use various rounding and truncation functions to ensure decimal precision. And do remember, even though your number/data is continuous, computer only recognizes 0 and 1.