PMean: Missing values in R talk

I’m talking a bit about missing values in R this afternoon for the Kansas City R Users Group. Here is what I’ll be talking about.

 

When you import data into R, the first thing you should do is

KNOW YOUR COUNT

This means that you need to know how many rows are in your data, and for all of the key variables, how many missing values you have. In a nested or hierarchical model, you need to also know the counts at each level of nesting or each level of the hierarchy.

Here is a data set, freely available on the web, that shows sleep patterns and other statistics on various mammals. It has a handful of missing values and can help illustrate how you might tackle a new dataset. You can find a description of this data set at http://www.statsci.org/data/general/sleep.html.

sl <- read.table(file="http://www.statsci.org/data/general/sleep.txt",header=TRUE)
dim(sl)
## [1] 62 11
names(sl)
##  [1] "Species"     "BodyWt"      "BrainWt"     "NonDreaming" "Dreaming"   
##  [6] "TotalSleep"  "LifeSpan"    "Gestation"   "Predation"   "Exposure"   
## [11] "Danger"

This data set has 62 rows and 11 variables. Since the data set doesn’t have too many variables, let’s print out the first few and last few rows.

head(sl)
##                  Species   BodyWt BrainWt NonDreaming Dreaming TotalSleep
## 1        Africanelephant 6654.000  5712.0          NA       NA        3.3
## 2 Africangiantpouchedrat    1.000     6.6         6.3      2.0        8.3
## 3              ArcticFox    3.385    44.5          NA       NA       12.5
## 4   Arcticgroundsquirrel    0.920     5.7          NA       NA       16.5
## 5          Asianelephant 2547.000  4603.0         2.1      1.8        3.9
## 6                 Baboon   10.550   179.5         9.1      0.7        9.8
##   LifeSpan Gestation Predation Exposure Danger
## 1     38.6       645         3        5      3
## 2      4.5        42         3        1      3
## 3     14.0        60         1        1      1
## 4       NA        25         5        2      3
## 5     69.0       624         3        5      4
## 6     27.0       180         4        4      4
tail(sl)
##                 Species BodyWt BrainWt NonDreaming Dreaming TotalSleep
## 57               Tenrec  0.900     2.6        11.0      2.3       13.3
## 58            Treehyrax  2.000    12.3         4.9      0.5        5.4
## 59            Treeshrew  0.104     2.5        13.2      2.6       15.8
## 60               Vervet  4.190    58.0         9.7      0.6       10.3
## 61         Wateropossum  3.500     3.9        12.8      6.6       19.4
## 62 Yellow-belliedmarmot  4.050    17.0          NA       NA         NA
##    LifeSpan Gestation Predation Exposure Danger
## 57      4.5        60         2        1      2
## 58      7.5       200         3        1      3
## 59      2.3        46         3        2      2
## 60     24.0       210         4        3      4
## 61      3.0        14         2        1      1
## 62     13.0        38         3        1      1

The summary function is useful, as it will display percentiles and the mean for numeric variables, but also (very importantly) display the number of missing values.

For factors, it will display a table of the first seven most frequently occuring values

summary(sl)
##                    Species       BodyWt        BrainWt      NonDreaming   
##  Africanelephant       : 1   Min.   :   0   Min.   :   0   Min.   : 2.10  
##  Africangiantpouchedrat: 1   1st Qu.:   1   1st Qu.:   4   1st Qu.: 6.25  
##  ArcticFox             : 1   Median :   3   Median :  17   Median : 8.35  
##  Arcticgroundsquirrel  : 1   Mean   : 199   Mean   : 283   Mean   : 8.67  
##  Asianelephant         : 1   3rd Qu.:  48   3rd Qu.: 166   3rd Qu.:11.00  
##  Baboon                : 1   Max.   :6654   Max.   :5712   Max.   :17.90  
##  (Other)               :56                                 NA's   :14     
##     Dreaming      TotalSleep       LifeSpan        Gestation    
##  Min.   :0.00   Min.   : 2.60   Min.   :  2.00   Min.   : 12.0  
##  1st Qu.:0.90   1st Qu.: 8.05   1st Qu.:  6.62   1st Qu.: 35.8  
##  Median :1.80   Median :10.45   Median : 15.10   Median : 79.0  
##  Mean   :1.97   Mean   :10.53   Mean   : 19.88   Mean   :142.4  
##  3rd Qu.:2.55   3rd Qu.:13.20   3rd Qu.: 27.75   3rd Qu.:207.5  
##  Max.   :6.60   Max.   :19.90   Max.   :100.00   Max.   :645.0  
##  NA's   :12     NA's   :4       NA's   :4        NA's   :4      
##    Predation       Exposure        Danger    
##  Min.   :1.00   Min.   :1.00   Min.   :1.00  
##  1st Qu.:2.00   1st Qu.:1.00   1st Qu.:1.00  
##  Median :3.00   Median :2.00   Median :2.00  
##  Mean   :2.87   Mean   :2.42   Mean   :2.61  
##  3rd Qu.:4.00   3rd Qu.:4.00   3rd Qu.:4.00  
##  Max.   :5.00   Max.   :5.00   Max.   :5.00  
## 

The first variable, Species, has 62 unique values. You can tell this by the fact that the first seven most frequently occuring values all had frequency 1. There are several other ways to check for uniqueness.

table(table(sl$Species))
## 
##  1 
## 62
length(unique(sl$Species))
## [1] 62

The table function does not show NAs, at least with the default option

table(sl$Gestation)
## 
##   12   14   16   17   19   21 21.5   25   28   30   31   33   35   38   42 
##    1    1    1    1    1    1    1    1    2    2    1    1    1    1    3 
##   45   46   50   52   60   63   68   90  100  112  115  120  140  148  150 
##    1    1    1    1    2    3    1    1    1    1    1    2    1    1    1 
##  151  164  170  180  200  210  225  230  252  267  281  310  336  365  392 
##    1    1    1    1    1    1    2    1    1    1    1    1    1    1    1 
##  400  440  624  645 
##    1    1    1    1

You would notice this if you added a total to the end of the table.

addmargins(table(sl$Gestation))
## 
##   12   14   16   17   19   21 21.5   25   28   30   31   33   35   38   42 
##    1    1    1    1    1    1    1    1    2    2    1    1    1    1    3 
##   45   46   50   52   60   63   68   90  100  112  115  120  140  148  150 
##    1    1    1    1    2    3    1    1    1    1    1    2    1    1    1 
##  151  164  170  180  200  210  225  230  252  267  281  310  336  365  392 
##    1    1    1    1    1    1    2    1    1    1    1    1    1    1    1 
##  400  440  624  645  Sum 
##    1    1    1    1   58

The total is only 58, which is less than the magic number of 62. This means that there are 4 missing values.

You can also force the table to show counts for missing values.

table(sl$Gestation,useNA="always")
## 
##   12   14   16   17   19   21 21.5   25   28   30   31   33   35   38   42 
##    1    1    1    1    1    1    1    1    2    2    1    1    1    1    3 
##   45   46   50   52   60   63   68   90  100  112  115  120  140  148  150 
##    1    1    1    1    2    3    1    1    1    1    1    2    1    1    1 
##  151  164  170  180  200  210  225  230  252  267  281  310  336  365  392 
##    1    1    1    1    1    1    2    1    1    1    1    1    1    1    1 
##  400  440  624  645 <NA> 
##    1    1    1    1    4

The deault option for useNA is “never” and a third option is “ifany”. Notice how “ifany” differs from “always”

table(sl$Gestation,useNA="ifany")
## 
##   12   14   16   17   19   21 21.5   25   28   30   31   33   35   38   42 
##    1    1    1    1    1    1    1    1    2    2    1    1    1    1    3 
##   45   46   50   52   60   63   68   90  100  112  115  120  140  148  150 
##    1    1    1    1    2    3    1    1    1    1    1    2    1    1    1 
##  151  164  170  180  200  210  225  230  252  267  281  310  336  365  392 
##    1    1    1    1    1    1    2    1    1    1    1    1    1    1    1 
##  400  440  624  645 <NA> 
##    1    1    1    1    4
table(sl$Danger,useNA="ifany")
## 
##  1  2  3  4  5 
## 19 14 10 10  9
table(sl$Danger,useNA="always")
## 
##    1    2    3    4    5 <NA> 
##   19   14   10   10    9    0

Many othe functions in R offer multiple ways to handle NAs.

mean(sl$Gestation,na.rm=FALSE)
## [1] NA
mean(sl$Gestation,na.rm=TRUE)
## [1] 142.4
mod1 <- lm(TotalSleep~Gestation,data=sl,na.action=na.omit)
summary(mod1)
## 
## Call:
## lm(formula = TotalSleep ~ Gestation, data = sl, na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.593 -2.604  0.232  1.408  7.632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.30579    0.68828   19.33  < 2e-16 ***
## Gestation   -0.02075    0.00353   -5.87  3.1e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.62 on 52 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.399,  Adjusted R-squared:  0.387 
## F-statistic: 34.5 on 1 and 52 DF,  p-value: 3.08e-07
# summary(lm(TotalSleep~Gestation,data=sl,na.action=na.fail))

The alternate option for na.action in the lm function is na.fail, but I cannot run this directly in knitr, because the resulting error message causes knitr to stop.

Logic values involving NAs are a bit tricky. You cannot get at NAs directly using the == comparison operator.

which(sl$Gestation==NA)
## integer(0)

What’s happening here is that most logic comparisons invloving NA will evaluate not to TRUE or FALSE, but rather to NA.

If you want to select missing values, you must use the is.na function.

which(is.na(sl$Gestation))
## [1] 13 19 20 56

The is.finite function represents “opposite” function, though there are a few subtle differences.

which(is.finite(sl$Gestation))
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 14 15 16 17 18 21 22 23 24 25 26
## [24] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## [47] 50 51 52 53 54 55 57 58 59 60 61 62

Logic involving missing values can seem at times a bit illogical.

sl$Gestation==NA
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [24] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [47] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
sl$Gestation==645
##  [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE    NA FALSE FALSE FALSE FALSE FALSE    NA    NA FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56]    NA FALSE FALSE FALSE FALSE FALSE FALSE
tst.na <- NA
tst.na==tst.na
## [1] NA

Why would NA==NA not evaluate to TRUE? You could argue that a missing value might be 3 or 7 or 923.427, so you would never know if it equaled something else, even another missing value.

This produces a three valued logic system.

tst.tvl <- c(TRUE,FALSE,NA)
not.table <- !tst.tvl
names(not.table) <- tst.tvl
not.table
##  TRUE FALSE  <NA> 
## FALSE  TRUE    NA
and.table <- outer(tst.tvl,tst.tvl,"&")
dimnames(and.table) <- list(tst.tvl,tst.tvl)
and.table
##        TRUE FALSE  <NA>
## TRUE   TRUE FALSE    NA
## FALSE FALSE FALSE FALSE
## <NA>     NA FALSE    NA
or.table <- outer(tst.tvl,tst.tvl,"|")
dimnames(or.table) <- list(tst.tvl,tst.tvl)
or.table
##       TRUE FALSE <NA>
## TRUE  TRUE  TRUE TRUE
## FALSE TRUE FALSE   NA
## <NA>  TRUE    NA   NA

Other programs, such as SAS, use a two valued logic system. What SAS does is to set missing values equal to negative infinity. This seems simpler because it allows you to keep a two-valued logic system. But it has its own problems.

Note how R handles the less than comparison operator with missing values.

sl$Gestation<50
##  [1] FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [12] FALSE    NA FALSE  TRUE  TRUE  TRUE FALSE    NA    NA FALSE FALSE
## [23]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [34] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE
## [45] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56]    NA FALSE FALSE  TRUE FALSE  TRUE  TRUE

In SAS, this comparison would evaluate to TRUE for the four missing values.

The only way to select missing values is with the is.na function.

There are three more values in R that are closely related to missing.

1/0
## [1] Inf
-1/0
## [1] -Inf
sqrt(-1)
## Warning: NaNs produced
## [1] NaN

The difference between Inf and NaN is in how comparisons are made.

Inf>3
## [1] TRUE
NaN>3
## [1] NA
Inf==Inf
## [1] TRUE
NaN==NaN
## [1] NA

R allows you to make a few computations using Inf, but others are not allowed.

Inf+1
## [1] Inf
1/Inf
## [1] 0
0*Inf
## [1] NaN
Inf-Inf
## [1] NaN

It’s also worth noting that you can force R to use complex numbers for the square root of -1, although this is not the default.

sqrt(as.complex(-1))
## [1] 0+1i

The is.na and is.finite functions are not perfect opposites.

tst.cases <- c(3,NA,NaN,Inf,-Inf)
is.na.table <- is.na(tst.cases)
names(is.na.table) <- tst.cases
is.na.table
##     3  <NA>   NaN   Inf  -Inf 
## FALSE  TRUE  TRUE FALSE FALSE
is.finite.table <- is.finite(tst.cases)
names(is.finite.table) <- tst.cases
is.finite.table
##     3  <NA>   NaN   Inf  -Inf 
##  TRUE FALSE FALSE FALSE FALSE