## Archive for the ‘**Stat**’ Category

## God particle, 5 sigma, and p-value

About a week ago, CERN announced the discovery of a new sub-atomic particle that’s consistent with the properties of the elusive Higgs Boson, a.k.a. God Particle. CERN scientists say it is a 5 sigma result. It is interesting that almost all the news reports I read converted this 5 sigma to a percentage, and none seemed to be able to explain what exactly 5 sigma is. Some even mistakenly claimed that scientists “99.999% sure God Particle has been found.”

Actually 5 sigma is just another way of stating the probability value, in another word, p-value. So what is p-value anyway? P-value is the probability that the data would be at least as extreme as those observed, if the null hypothesis were true.

Standard normal distribution N(0,1) has μ=0 and σ^{2}=1. As you can see from the above graph, a little more than 2/3 of values drawn from a normal distribution are within one standard deviation (one sigma) away from the mean (red area). Approximately 95% of the values are with two sigma (1.96) of the mean. Three sigma covers about 99.7% AUC (Area under the density curve).

Five sigma? That’s about 0.9999997, which means the significance level (alpha) is 0.0000003. In short, there is the null hypothesis (no God particle), and the alternative hypothesis (God particle exists). Five sigma means that there is a very slim chance (less than one in a million) the null hypothesis is true. Note that this is not the equivalent of scientists being 99.99997% sure the alternative hypothesis is correct.

> pnorm(5) [1] 0.999999713348428 > 1-pnorm(5) [1] 0.000000286651571923535

Here is the code for the graph. I wrote it in a hurry, any suggestion to make it better is welcome.

my.color <- rainbow(10) my.symbol2 <- expression(mu) my.axis <- c(-6,-5,-4,-3,-2,-1,0,1,2,3,4,5,6) my.label <- c('-6', '-5','-4','-3','-2','-1',my.symbol2,'1','2','3','4','5','6') x = seq(-6, 6, length = 600) y = dnorm(x) plot(x, y, type = "n", xlab=my.symbol, ylab=' ', axes=FALSE) plotsigma <- function(start, end, color){ sigmax = seq(start, end, length=100) sigmay = c(0, dnorm(sigmax), 0) sigmax = c(start, sigmax, end) polygon(sigmax, sigmay, col = color, border = NA) } for (i in 5:1){ plotsigma(-i, i, my.color[i]) } axis(1,at=my.axis,labels=my.label) lines(x, y) segments(0,0.4,0,0, col='white') segments(5,0.2,5,0, lty=3) text(5,0.22, expression(paste(5, sigma, sep='')))

## Freakonomics survey

Freakonomics just put up a survey: Which social science should die?

The four candidates for elimination are: Psychology, Political science, Economics, and Sociology. If you ask me what are these four disciplines in common, on top of my list would be they all tend to misuse statistical test of significance, with sociology and political science leading the charge. I’d be interested to know the survey result. However, I think for the survey to be meanful in any discussion, they should take the survey taker’s profession into consideration.

Please go take the survey, the result is out next week.

## Djokovic, Federer drawn to meet in Wimbledon semis

I thought I’d update my old post about the men’s (in Wimbledon’s case, gentlemen’s) draw since now we have one more data point. The Wimbledon draw is out – as Associate Press puts it – “*Random* as Grand Slam tournament draws are meant to be, Novak Djokovic and Roger Federer keep bumping into each other in major semifinals, and it could happen again at Wimbledon.” Could it be anything but random?

So Federer and Djokovic somehow always end up in the same half – 19 times out of 27 draws in the past seven years. Statistically speaking, what is the probability of 19 times or more being in the same half out of a total of 27? How about less than 3%?

## Roland Garros 2012

French Open draw is out. What’s new? Not much. You get the same old, same old Nadal/Murray vs. Djokovic/Federer set up. So what’s the odds of Djokovic and Federer always in the same half? That’s worth looking into.

For those who are not familiar with the draw process, in Grand Slam tennis, there are 128 players in the main draw. After the seeding is decided, the top 2 players will be placed in each half of the draw, ensuring that best players only meet in the final. Since Federer and Djokovic was never seeded top two in Grand Slam tournaments, theoretically speaking, Djokovic has 50% of the chance to be drawn in Federer’s half of the draw (Federer vice versa), and this seems to be the case in 2006 and most of 2007, before Djokovic ascended in ranking to #4. After Djokovic became a consistent presence in the top 4, therefore a credible contender to the Grand Slam titles, curiously enough he appeared in Federer’s half most of the times. As you can see in the graph, during the 4-year-span from 2008 to 2011, only two times out of a total of 16 they were in different half. And the proportion gets even higher for the so-called fast courts (12 out of 12). And it seems only on clay court (French Open), the two has a more even chance of facing each other in the final.

More astonishingly, since 2008 Wimbledon, they were drawn in the same half for 7 times consecutively. So what’s the probability of that? It’s difficult to calculate that since all these events are not independent. But I did a very crude simulation based on the binomial distribution. It seems the chance of being in the same side of the draw 18 out of 26 times is quite rare, as the case of 7 times in a row (< 5%) being in the same half. So this, combined with a study conducted by ESPN, conspiracy-minded tennis buff would like to know if indeed the draw is fixed to make the game more exciting, or this is really nothing more than the occurrence of small probability events.

## Rolling the dice II

Continuing from the previous post, for a pair of dice, the sample space becomes a little “bigger.” That is 36 (6X6) possible outcomes. Again, for a third-grader, it is much more easier to display the list visually, like this:

Now, what’s the probability of dots added up to 7? By the assumption of fairness, we know that each one of those 36 outcomes is equally likely. My daughter finally found all the combinations, and figured out the the probability is 1 out of 6 (6/36).

I heard it somewhere that smart men use math formula to solve problems, but ‘lazy’ person (layperson :-)) like me would like to use simulations with R (or other computer languages for the matter), and let the machines do what they are good at – really fast, and intense computation. People issue commands, machines follow them. This time, we let the machine toss two dice 1000 times, and count the numbers of getting each possible outcomes. Notice the shape of the histogram? That’s very close to the probability distribution of the sum of the two dice. This pretty much illustrated the Bernoulli Theorem, which states that relative frequency of an event in a sequence of independent trials converges in probability to the probability of the event.

R has a package “TeachingDemos” that’s really useful if you want to demonstrate elementary statistical concepts. I did all the dice simulation using the dice() function. You can find out more here.

## Rolling the dice

Over the weekend, I tried to teach my 9-year-old daughter some basic concept of probability. And what better to start her with than examples of rolling the dice since for most of human history, probability, the formal study of the laws of chance, was used for only one thing: gambling.

First I tried to introduce the basic definition of “sample space”. With one die, my daughter was able to determine the sample space is six, and figure out the probability of getting each is 1 out of 6. In the beginning we actually rolled the dice together, and recorded the outcomes on a piece of paper. But soon she got bored, and I don’t blame her. It gets tedious and time-consuming to manually sample a process. Also entirely possible that the dice I got are not fair, thus introducing bias into the experiment.

Thankfully with computer, it is now possible to do simulations – rolling dice thousands of times in matter of seconds. Here are four examples of rolling the dice 1000 times, we can see at the end of the long sequence, the proportion of getting face of one is near 1/6, but it is still not exactly 1/6, which reminds us even this long sampling process is finite, and there is no guarantee that the relative frequency of an event will match the true underlying probability of it happening.

The idea of these graphs came from Dr. John K. Kruschke’s book – Doing Bayesian Data Analysis: A Tutorial with R and BUGS.