SAS® and R

Best of Both Worlds

Archive for the ‘R’ Category

Floating point arithmetic

with 4 comments

One time I was trying different cut-off points for classification to define a dichotomous variable for logistic regression, and I kept getting erroneous result when I looked at the data print-out. Values which should have been set to “Yes” according to my algorithm fell into the “No” column, and I just couldn’t figure out what went wrong.

Here is a simple example to illustrate my problem. We start with a SOURCE data set with four variables X, Y, (X-Y) and a certain constant as the cut-off value.

Floating-Point Arithmetic

Obs    x     y     x - y    cutoff
 1     2    1.1      0.9      0.9
 2     2    1.2      0.8      0.8
 3     2    1.3      0.7      0.7
 4     2    1.4      0.6      0.6
 5     2    1.5      0.5      0.5
 6     2    1.6      0.4      0.4
 7     2    1.7      0.3      0.3
 8     2    1.8      0.2      0.2
 9     2    1.9      0.1      0.1

Now a flag variables is created to indicate if (x-y) is equal to the cut-off (1 for Yes, 0 for No).

data FLOAT;
  set source;
  /* No rounding */
  flag1 = (z = cutoff);
  /* round() to the rescue */
  flag2 = (round(z,0.1) = cutoff);

The print-out of dataset FLOAT shows that with round() function, flag variable is set correctly; but we get erratic result sans rounding. This is because in SAS, numeric values are represented as 64-bit floating point numbers, and rules of algebra may not apply to floating point numbers. A paper from the SAS® Institute explains this phenomena in great details. You can check it out yourself.

                                       Without      With
Obs    x     y     x - y    cutoff    rounding    rounding

 1     2    1.1      0.9      0.9         0           1
 2     2    1.2      0.8      0.8         1           1
 3     2    1.3      0.7      0.7         1           1
 4     2    1.4      0.6      0.6         0           1
 5     2    1.5      0.5      0.5         1           1
 6     2    1.6      0.4      0.4         0           1
 7     2    1.7      0.3      0.3         0           1
 8     2    1.8      0.2      0.2         0           1
 9     2    1.9      0.1      0.1         0           1

And as you see from the above example, to circumvent this problem, the quick and dirty way is using round() function to set precision before comparison.

And upon further checking, this is also an issue in R. You can see that, without rounding function, values for some of the comparisons are not returning “TRUE” even though on paper (x-y) and z might look the same.

> x <- rep(2, 9)
> y <- seq(1.1, 1.9, by=0.1)
> z <- seq(0.9, 0.1, by=-0.1)
> x-y
[1] 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
> z
[1] 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
> # Without rounding
> x-y == z
> # With rounding
> round(x-y, 1) == round(z, 1)

So the take home lesson here is always check your calculation when you are dealing with values with decimal points, use various rounding and truncation functions to ensure decimal precision. And do remember, even though your number/data is continuous, computer only recognizes 0 and 1.


Written by sasandr

July 6, 2012 at 2:26 pm

Posted in R, SAS

Tagged with , ,

Happy Mother’s Day

leave a comment »


Here is the R code for this graph.

par(bty='n', ann=FALSE, col.axis='white')
plot(X,Y,col='green', axes=FALSE)
text(0, 0.6, "Happy\nMother's\nDay", adj=c(0.5, 0.5), col='white', cex=4)

Written by sasandr

May 12, 2012 at 8:57 pm

Posted in R

Tagged with ,

Rolling the dice II

leave a comment »

Continuing from the previous post, for a pair of dice, the sample space becomes a little “bigger.”  That is 36 (6X6) possible outcomes.  Again, for a third-grader, it is much more easier to display the list visually, like this:

sample space for a pair of dice

Now, what’s the probability of dots added up to 7?  By the assumption of fairness, we know that each one of those 36 outcomes is equally likely.  My daughter finally found all the combinations, and figured out the the probability is 1 out of 6 (6/36).

Two dice with seven as outcomeI heard it somewhere that smart men use math formula to solve problems, but ‘lazy’ person (layperson :-)) like me would like to use simulations with R (or other computer languages for the matter), and let the machines do what they are good at – really fast, and intense computation.  People issue commands, machines follow them.  This time, we let the machine toss two dice 1000 times, and count the numbers of getting each possible outcomes.  Notice the shape of the histogram?  That’s very close to the probability distribution of the sum of the two dice.  This pretty much illustrated the Bernoulli Theorem, which states that relative frequency of an event in a sequence of independent trials converges in probability to the probability of the event.

R has a package “TeachingDemos” that’s really useful if you want to demonstrate elementary statistical concepts.  I did all the dice simulation using the dice() function.  You can find out more here.

Written by sasandr

May 4, 2012 at 8:02 pm

Posted in R, Stat

Tagged with , ,

Rolling the dice

leave a comment »

Over the weekend, I tried to teach my 9-year-old daughter some basic concept of probability.  And what better to start her with than examples of rolling the dice since for most of human history, probability, the formal study of the laws of chance, was used for only one thing: gambling.

First I tried to introduce the basic definition of “sample space”.  With one die, my daughter was able to determine the sample space is six, and figure out the probability of getting each is 1 out of 6.  In the beginning we actually rolled the dice together, and recorded the outcomes on a piece of paper.  But soon she got bored, and I don’t blame her.  It gets tedious and time-consuming to manually sample a process.  Also entirely possible that the dice I got are not fair, thus introducing bias into the experiment.

Thankfully with computer, it is now possible to do simulations – rolling dice thousands of times in matter of seconds.  Here are four examples of rolling the dice 1000 times, we can see at the end of the long sequence, the proportion of getting face of one is near 1/6, but it is still not exactly 1/6, which reminds us even this long sampling process is finite, and there is no guarantee that the relative frequency of an event will match the true underlying probability of it happening.

The idea of these graphs came from Dr. John K. Kruschke’s book – Doing Bayesian Data Analysis: A Tutorial with R and BUGS.


Doing Bayesian Data Analysis

Written by sasandr

May 3, 2012 at 9:41 pm

Posted in R, Stat

Tagged with , ,