## Posts Tagged ‘**R**’

## God particle, 5 sigma, and p-value

About a week ago, CERN announced the discovery of a new sub-atomic particle that’s consistent with the properties of the elusive Higgs Boson, a.k.a. God Particle. CERN scientists say it is a 5 sigma result. It is interesting that almost all the news reports I read converted this 5 sigma to a percentage, and none seemed to be able to explain what exactly 5 sigma is. Some even mistakenly claimed that scientists “99.999% sure God Particle has been found.”

Actually 5 sigma is just another way of stating the probability value, in another word, p-value. So what is p-value anyway? P-value is the probability that the data would be at least as extreme as those observed, if the null hypothesis were true.

Standard normal distribution N(0,1) has μ=0 and σ^{2}=1. As you can see from the above graph, a little more than 2/3 of values drawn from a normal distribution are within one standard deviation (one sigma) away from the mean (red area). Approximately 95% of the values are with two sigma (1.96) of the mean. Three sigma covers about 99.7% AUC (Area under the density curve).

Five sigma? That’s about 0.9999997, which means the significance level (alpha) is 0.0000003. In short, there is the null hypothesis (no God particle), and the alternative hypothesis (God particle exists). Five sigma means that there is a very slim chance (less than one in a million) the null hypothesis is true. Note that this is not the equivalent of scientists being 99.99997% sure the alternative hypothesis is correct.

> pnorm(5) [1] 0.999999713348428 > 1-pnorm(5) [1] 0.000000286651571923535

Here is the code for the graph. I wrote it in a hurry, any suggestion to make it better is welcome.

my.color <- rainbow(10) my.symbol2 <- expression(mu) my.axis <- c(-6,-5,-4,-3,-2,-1,0,1,2,3,4,5,6) my.label <- c('-6', '-5','-4','-3','-2','-1',my.symbol2,'1','2','3','4','5','6') x = seq(-6, 6, length = 600) y = dnorm(x) plot(x, y, type = "n", xlab=my.symbol, ylab=' ', axes=FALSE) plotsigma <- function(start, end, color){ sigmax = seq(start, end, length=100) sigmay = c(0, dnorm(sigmax), 0) sigmax = c(start, sigmax, end) polygon(sigmax, sigmay, col = color, border = NA) } for (i in 5:1){ plotsigma(-i, i, my.color[i]) } axis(1,at=my.axis,labels=my.label) lines(x, y) segments(0,0.4,0,0, col='white') segments(5,0.2,5,0, lty=3) text(5,0.22, expression(paste(5, sigma, sep='')))

## Floating point arithmetic

One time I was trying different cut-off points for classification to define a dichotomous variable for logistic regression, and I kept getting erroneous result when I looked at the data print-out. Values which should have been set to “Yes” according to my algorithm fell into the “No” column, and I just couldn’t figure out what went wrong.

Here is a simple example to illustrate my problem. We start with a SOURCE data set with four variables X, Y, (X-Y) and a certain constant as the cut-off value.

Floating-Point Arithmetic Obs x y x - y cutoff 1 2 1.1 0.9 0.9 2 2 1.2 0.8 0.8 3 2 1.3 0.7 0.7 4 2 1.4 0.6 0.6 5 2 1.5 0.5 0.5 6 2 1.6 0.4 0.4 7 2 1.7 0.3 0.3 8 2 1.8 0.2 0.2 9 2 1.9 0.1 0.1

Now a flag variables is created to indicate if (x-y) is equal to the cut-off (1 for Yes, 0 for No).

data FLOAT; set source; /* No rounding */ flag1 = (z = cutoff); /* round() to the rescue */ flag2 = (round(z,0.1) = cutoff); run;

The print-out of dataset FLOAT shows that with round() function, flag variable is set correctly; but we get erratic result sans rounding. This is because in SAS, numeric values are represented as 64-bit floating point numbers, and rules of algebra may not apply to floating point numbers. A paper from the SAS® Institute explains this phenomena in great details. You can check it out yourself.

Without With Obs x y x - y cutoff rounding rounding 1 2 1.1 0.9 0.9 0 1 2 2 1.2 0.8 0.8 1 1 3 2 1.3 0.7 0.7 1 1 4 2 1.4 0.6 0.6 0 1 5 2 1.5 0.5 0.5 1 1 6 2 1.6 0.4 0.4 0 1 7 2 1.7 0.3 0.3 0 1 8 2 1.8 0.2 0.2 0 1 9 2 1.9 0.1 0.1 0 1

And as you see from the above example, to circumvent this problem, the quick and dirty way is using round() function to set precision before comparison.

And upon further checking, this is also an issue in R. You can see that, without rounding function, values for some of the comparisons are not returning “TRUE” even though on paper (x-y) and z might look the same.

> x <- rep(2, 9) > y <- seq(1.1, 1.9, by=0.1) > z <- seq(0.9, 0.1, by=-0.1) > x-y [1] 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 > z [1] 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 > # Without rounding > x-y == z [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE > > # With rounding > round(x-y, 1) == round(z, 1) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

So the take home lesson here is always check your calculation when you are dealing with values with decimal points, use various rounding and truncation functions to ensure decimal precision. And do remember, even though your number/data is continuous, computer only recognizes 0 and 1.

## Happy Mother’s Day

Here is the R code for this graph.

par(bty='n', ann=FALSE, col.axis='white') n=10000 r=0.7 r_e=(1-r*r)^.5 X=rnorm(n) Y=X*r+r_e*rnorm(n) Y=ifelse(X>0,Y,-Y) a<-sample(c(2,5,6,7),n,T) b<-sample(c(76,79,86,69),n,T) plot(X,Y,col='green', axes=FALSE) text(X,Y,"LOVE",col=a) text(0, 0.6, "Happy\nMother's\nDay", adj=c(0.5, 0.5), col='white', cex=4)

## Rolling the dice II

Continuing from the previous post, for a pair of dice, the sample space becomes a little “bigger.” That is 36 (6X6) possible outcomes. Again, for a third-grader, it is much more easier to display the list visually, like this:

Now, what’s the probability of dots added up to 7? By the assumption of fairness, we know that each one of those 36 outcomes is equally likely. My daughter finally found all the combinations, and figured out the the probability is 1 out of 6 (6/36).

I heard it somewhere that smart men use math formula to solve problems, but ‘lazy’ person (layperson :-)) like me would like to use simulations with R (or other computer languages for the matter), and let the machines do what they are good at – really fast, and intense computation. People issue commands, machines follow them. This time, we let the machine toss two dice 1000 times, and count the numbers of getting each possible outcomes. Notice the shape of the histogram? That’s very close to the probability distribution of the sum of the two dice. This pretty much illustrated the Bernoulli Theorem, which states that relative frequency of an event in a sequence of independent trials converges in probability to the probability of the event.

R has a package “TeachingDemos” that’s really useful if you want to demonstrate elementary statistical concepts. I did all the dice simulation using the dice() function. You can find out more here.

## Rolling the dice

Over the weekend, I tried to teach my 9-year-old daughter some basic concept of probability. And what better to start her with than examples of rolling the dice since for most of human history, probability, the formal study of the laws of chance, was used for only one thing: gambling.

First I tried to introduce the basic definition of “sample space”. With one die, my daughter was able to determine the sample space is six, and figure out the probability of getting each is 1 out of 6. In the beginning we actually rolled the dice together, and recorded the outcomes on a piece of paper. But soon she got bored, and I don’t blame her. It gets tedious and time-consuming to manually sample a process. Also entirely possible that the dice I got are not fair, thus introducing bias into the experiment.

Thankfully with computer, it is now possible to do simulations – rolling dice thousands of times in matter of seconds. Here are four examples of rolling the dice 1000 times, we can see at the end of the long sequence, the proportion of getting face of one is near 1/6, but it is still not exactly 1/6, which reminds us even this long sampling process is finite, and there is no guarantee that the relative frequency of an event will match the true underlying probability of it happening.

The idea of these graphs came from Dr. John K. Kruschke’s book – Doing Bayesian Data Analysis: A Tutorial with R and BUGS.