I wanted to use R to show some simple approaches to imputing missing values. These approaches are difficult to support because they require that you make some questionable and unverifiable assumptions about your data. They still may prove useful as a sensitivity check or as a springboard into more complex approaches for imputing missing values. I have a link to the code that generated most of these results.

Impute.Rmd

This program was written in R Markdown written by Steve Simon. It requires

R (no particular version) and the mice package, and
an Internet connection (to access the Titanic data set).

It shows some simple examples of single imputation on an artificial data set and on the Titanic data set.

# start without any extraneous variables
save.image("backup.RData")
rm(list=ls())

First, you need to generate some simple binary data values.

set.seed(14814)
zeros_and_ones <- rbinom(100,1,0.5)
print(zeros_and_ones)

##   [1] 1 1 1 0 1 0 1 1 1 1 0 1 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 0 1 0 1
##  [36] 1 0 0 0 1 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0
##  [71] 1 1 0 0 0 1 1 1 1 0 1 1 0 1 0 1 1 0 1 0 0 1 1 1 0 1 1 0 1 1

Arrange the data in a matrix with 20 rows and then convert it to a data frame. Give names to each column.

da <- as.data.frame(matrix(zeros_and_ones, nrow=20))
names(da) <- paste("t", 1:5, sep="")
print(da)

##    t1 t2 t3 t4 t5
## 1   1  1  1  1  1
## 2   1  1  0  1  1
## 3   1  0  0  1  0
## 4   0  1  0  1  1
## 5   1  0  1  1  0
## 6   0  1  0  1  1
## 7   1  1  0  1  1
## 8   1  1  1  0  0
## 9   1  0  0  1  1
## 10  1  1  1  0  0
## 11  0  1  0  1  0
## 12  1  0  0  1  1
## 13  0  1  0  0  1
## 14  1  0  1  0  1
## 15  0  1  1  0  0
## 16  1  1  0  1  1
## 17  1  0  0  1  1
## 18  0  0  0  1  0
## 19  0  0  0  1  1
## 20  0  1  1  0  1

Find ten random rows and convert the fifth value to NA.

delete_fifth_value <- sample(1:20,10)
print(sort(delete_fifth_value))

##  [1]  1  2  3  4  5  8  9 11 12 16

da[delete_fifth_value,5] <- NA

Find five random rows among these rows and convert the fourth value to NA.

delete_fourth_value <- sample(delete_fifth_value,5)
print(sort(delete_fourth_value))

## [1]  2  3  9 12 16

da[delete_fourth_value,4] <- NA

Find two random rows among these rows and convert the third value to NA.

delete_third_value <- sample(delete_fourth_value,2)
print(sort(delete_third_value))

## [1] 2 3

da[delete_third_value,3] <- NA
print(da)

##    t1 t2 t3 t4 t5
## 1   1  1  1  1 NA
## 2   1  1 NA NA NA
## 3   1  0 NA NA NA
## 4   0  1  0  1 NA
## 5   1  0  1  1 NA
## 6   0  1  0  1  1
## 7   1  1  0  1  1
## 8   1  1  1  0 NA
## 9   1  0  0 NA NA
## 10  1  1  1  0  0
## 11  0  1  0  1 NA
## 12  1  0  0 NA NA
## 13  0  1  0  0  1
## 14  1  0  1  0  1
## 15  0  1  1  0  0
## 16  1  1  0 NA NA
## 17  1  0  0  1  1
## 18  0  0  0  1  0
## 19  0  0  0  1  1
## 20  0  1  1  0  1

Now, we’re ready to start.

For a data set this small, you can see the missing data pattern. This is an example of monotone missing data. If data is missing for one column, it is missing for any subsequent column.

library("mice")

## Loading required package: Rcpp
## Loading required package: lattice
## mice 2.22 2014-06-10

mp <- md.pattern(da)
print(mp)

##    t1 t2 t3 t4 t5   
## 10  1  1  1  1  1  0
##  5  1  1  1  1  0  1
##  3  1  1  1  0  0  2
##  2  1  1  0  0  0  3
##     0  0  2  5 10 17

What looks like the first and unlabelled column is actually the row names for the matrix of missing value patterns. These row names represent the number of times that a particular missing value pattern occurs. The first, and most common missing value pattern appears at the top. It occurs 10 times. The particular pattern is indicated by a sequence of 0s and 1s indicating what is missing (0) and what is not.

The missing value pattern for the most common pattern is 1, 1, 1, 1, 1. This sequence of all 1’s means that for ten of the rows of the data frame, the missing pattern is nothing missing.

The second most common missing value pattern is 1, 1, 1, 1, 0 which occurs 5 times. This pattern with all 1’s except for the last value means that there are five rows where only t5 is missing.

The next missing value pattern, 1, 1, 1, 0, 0, represents rows where t4, t5 are missing. This pattern occurs 3 times.

The final missing value pattern, 1, 1, 0, 0, 0, represents the 2 times that a row has t3, t4, t5 missing.

The last row of the missing pattern matrix tells you how many missing values total there are for each variable. There are 0 missing values for t1, 0 for t2, 2 for t3, and so forth.

The final column tells you how many variables are missing for each missing value pattern. The first missing value pattern, for example, has 0 variables with missing values, and the last missing value pattern has 3 variables with missing values. You could have looked at the sequence of 0’s and 1’s to figure this out, but this solumn is a nice convenience wheh you have lots of variables, because it is easy to miscount a long string of 0s and 1s.

Simple imputation

Assume that the variable is 1 if an event occured and 0 otherwise. Let’s assume that the event is something bad like a side effect for a drug. Let’s also assume that the five columns represent five time points when you checked each patient to see if they had the side effect.

What is the probability of observing an adverse event at each time point? Any time you are concerned about missing values, add the useNA=“always” option to the table command.

for (v in names(da)) {
  cat("\n\nAdverse events for", v)
  print(table(da[,v], useNA="always"))
}

## 
## 
## Adverse events for t1
##    0    1 <NA> 
##    8   12    0 
## 
## 
## Adverse events for t2
##    0    1 <NA> 
##    8   12    0 
## 
## 
## Adverse events for t3
##    0    1 <NA> 
##   11    7    2 
## 
## 
## Adverse events for t4
##    0    1 <NA> 
##    6    9    5 
## 
## 
## Adverse events for t5
##    0    1 <NA> 
##    3    7   10

So what is your estimate of the probability of a side effect at each time point? For t1 and t2, the answer is obviously 12/20 or 60%. But what about the others. Should you take the take the 7 side effects measured at t3 and divide by the 20 patients to get a probability of 35%? Or should you divide by 18, the number of non-missing values, to get 39%?

There are several simple approaches that you can try, but they all make assumptions that might be difficult to support.

No news is good news.

One assumption that you can sometimes make is that no news is good news. When you fail to mention whether something bad occured, it could be that the absence of something is easy to forget to document. Kind of like Sherlock Holmes’s obseervation of a fact that everyone else overlooked, the dog that didn’t bark in the nighttime. In this example, “good news” corresponds to a zero value.

Here’s how you would do this in R. The is.na function identifies which entries in a matrix are missing, and you just replace them with zeros.

im1 <- da
im1[is.na(im1)] <- 0
for (v in names(im1)) {
  cat("\n\nAdverse events for", v)
  cat(" imputing 0 for missing values.")
  print(prop.table(table(im1[,v])))
}

## 
## 
## Adverse events for t1 imputing 0 for missing values.
##   0   1 
## 0.4 0.6 
## 
## 
## Adverse events for t2 imputing 0 for missing values.
##   0   1 
## 0.4 0.6 
## 
## 
## Adverse events for t3 imputing 0 for missing values.
##    0    1 
## 0.65 0.35 
## 
## 
## Adverse events for t4 imputing 0 for missing values.
##    0    1 
## 0.55 0.45 
## 
## 
## Adverse events for t5 imputing 0 for missing values.
##    0    1 
## 0.65 0.35

No news is bad news.

The opposite approach is to assume the worst. It may not make sense in this setting, but there are times where assuming the worst is not too unreasonable. Suppose you were conducting a smoking cessation study and your participants came back weekly for their pack of nicotine gum, which gives you an opportunity to ask whether they are still smoke free. You might even get a test of nicotine levels to verify what they tell you.

The patients who fail to show up for their appointment might be skipping out because they have quit cold turkey and they don’t even need your nicotine gum anymore. But it is far more probable that they are skipping out because they’ve given up on the study and are smoking like a chimney again. A diet study might also be a setting where someone who drops out does so because the diet isn’t working.

You do this with almost the exact same code.

im2 <- da
im2[is.na(im2)] <- 1
for (v in names(im2)) {
  cat("\n\nAdverse events for", v)
  cat(" imputing 1 for missing values.")
  print(prop.table(table(im2[,v])))
}

## 
## 
## Adverse events for t1 imputing 1 for missing values.
##   0   1 
## 0.4 0.6 
## 
## 
## Adverse events for t2 imputing 1 for missing values.
##   0   1 
## 0.4 0.6 
## 
## 
## Adverse events for t3 imputing 1 for missing values.
##    0    1 
## 0.55 0.45 
## 
## 
## Adverse events for t4 imputing 1 for missing values.
##   0   1 
## 0.3 0.7 
## 
## 
## Adverse events for t5 imputing 1 for missing values.
##    0    1 
## 0.15 0.85

No news is average news

Sometimes, you can substitute the mean value for the missing value. This is like saying that the values that are missing are no more likely to be larger on average or smaller on average than the values that are not missing. When you remove the patients with missing values, you are often implicitly imputing the missing values to be equal to the mean of the non-missing values.

Here’s how you would do it in R.

im3 <- da
for (v in names(im3)) {
  mn <- mean(da[ , v], na.rm=TRUE)
  im3[is.na(im3[, v]), v] <- mn
  cat("\n\nReplacing missing values in", v, "with",round(mn,2))
}

## 
## 
## Replacing missing values in t1 with 0.6
## 
## Replacing missing values in t2 with 0.6
## 
## Replacing missing values in t3 with 0.39
## 
## Replacing missing values in t4 with 0.6
## 
## Replacing missing values in t5 with 0.7

# table and prop.table won't work here because the imputed
# value destroys the binary-ness of the variable. You can get the same 
# result by omitting the missing values.
for (v in names(im3)) {
  cat("\n\nAdverse events for", v)
  cat(" imputing the mean for missing values.")
  print(prop.table(table(im3[,v])))
}

## 
## 
## Adverse events for t1 imputing the mean for missing values.
##   0   1 
## 0.4 0.6 
## 
## 
## Adverse events for t2 imputing the mean for missing values.
##   0   1 
## 0.4 0.6 
## 
## 
## Adverse events for t3 imputing the mean for missing values.
##                 0 0.388888888888889                 1 
##              0.55              0.10              0.35 
## 
## 
## Adverse events for t4 imputing the mean for missing values.
##    0  0.6    1 
## 0.30 0.25 0.45 
## 
## 
## Adverse events for t5 imputing the mean for missing values.
##    0  0.7    1 
## 0.15 0.50 0.35

for (v in names(im3)) {
  cat("\n\nAdverse events for", v)
  cat(" imputing the mean for missing values.")
  print(prop.table(table(da[,v], useNA="no")))
}

## 
## 
## Adverse events for t1 imputing the mean for missing values.
##   0   1 
## 0.4 0.6 
## 
## 
## Adverse events for t2 imputing the mean for missing values.
##   0   1 
## 0.4 0.6 
## 
## 
## Adverse events for t3 imputing the mean for missing values.
##         0         1 
## 0.6111111 0.3888889 
## 
## 
## Adverse events for t4 imputing the mean for missing values.
##   0   1 
## 0.4 0.6 
## 
## 
## Adverse events for t5 imputing the mean for missing values.
##   0   1 
## 0.3 0.7

No news is old news

When you have a sequence of observations, if a value is missing, you can substitute the previous value. The is called “Last Observation Carried Forward” or “LOCF”. This approach is used a lot in actual research studies, but it is very controversial.

im4 <- da
for (j in 3:5) {
  cat("\n\nAdverse events for", v)
  cat(" imputing using LOCF.")
  missing_locations <- is.na(im4[,j])
  im4[missing_locations, j] <- im4[missing_locations, j-1]
  print(prop.table(table(im3[,v])))
}

## 
## 
## Adverse events for t5 imputing using LOCF.
##    0  0.7    1 
## 0.15 0.50 0.35 
## 
## 
## Adverse events for t5 imputing using LOCF.
##    0  0.7    1 
## 0.15 0.50 0.35 
## 
## 
## Adverse events for t5 imputing using LOCF.
##    0  0.7    1 
## 0.15 0.50 0.35

print(im4)

##    t1 t2 t3 t4 t5
## 1   1  1  1  1  1
## 2   1  1  1  1  1
## 3   1  0  0  0  0
## 4   0  1  0  1  1
## 5   1  0  1  1  1
## 6   0  1  0  1  1
## 7   1  1  0  1  1
## 8   1  1  1  0  0
## 9   1  0  0  0  0
## 10  1  1  1  0  0
## 11  0  1  0  1  1
## 12  1  0  0  0  0
## 13  0  1  0  0  1
## 14  1  0  1  0  1
## 15  0  1  1  0  0
## 16  1  1  0  0  0
## 17  1  0  0  1  1
## 18  0  0  0  1  0
## 19  0  0  0  1  1
## 20  0  1  1  0  1

All of these approaches fall under the category of single imputation, because you impute a single value. While they might be acceptable in a limited number of settings, for most data analyses, single imputation relies on assumptions that are untestable and which often are at odds with your intuition. Let’s look at an example of why single imputation has limited utility.

Titanic data set

There is a famous data set on mortality trends and patterns on the Titanic. The Titanic sunk during an era where people really did believe in the concept of women and children first. Let’s read in the data set and look at how imputation might be done.

fn <- "http://www.statsci.org/data/general/titanic.txt"
ti <- read.delim(fn)
head(ti)

##                                            Name PClass   Age    Sex
## 1                  Allen, Miss Elisabeth Walton    1st 29.00 female
## 2                   Allison, Miss Helen Loraine    1st  2.00 female
## 3           Allison, Mr Hudson Joshua Creighton    1st 30.00   male
## 4 Allison, Mrs Hudson JC (Bessie Waldo Daniels)    1st 25.00 female
## 5                 Allison, Master Hudson Trevor    1st  0.92   male
## 6                            Anderson, Mr Harry    1st 47.00   male
##   Survived
## 1        1
## 2        0
## 3        0
## 4        0
## 5        1
## 6        1

dim(ti)

## [1] 1313    5

summary(ti)

##                            Name      PClass         Age       
##  Carlsson, Mr Frans Olof     :   2   1st:322   Min.   : 0.17  
##  Connolly, Miss Kate         :   2   2nd:280   1st Qu.:21.00  
##  Kelly, Mr James             :   2   3rd:711   Median :28.00  
##  Abbing, Mr Anthony          :   1             Mean   :30.40  
##  Abbott, Master Eugene Joseph:   1             3rd Qu.:39.00  
##  Abbott, Mr Rossmore Edward  :   1             Max.   :71.00  
##  (Other)                     :1304             NA's   :557    
##      Sex         Survived     
##  female:462   Min.   :0.0000  
##  male  :851   1st Qu.:0.0000  
##               Median :0.0000  
##               Mean   :0.3427  
##               3rd Qu.:1.0000  
##               Max.   :1.0000  
##

Notice that age has 557 missing values. Let’s see what happens if you use mean imputation.

mn <- mean(ti$Age, na.rm=TRUE)
imputed_age <- ti$Age
imputed_age[is.na(ti$Age)] <- mn
par(mfrow=c(2, 1))
hist(ti$Age, main=paste("Unimputed age has a standard deviation of",round(sd(ti$Age, na.rm=TRUE),1)))
hist(imputed_age, main=paste("Imputed age has a standard deviation of",round(sd(imputed_age, na.rm=TRUE),1)))

This is an extreme example, but it illustrates an important weakness of single imputation. All single imputation approaches will distort your data by underestimating the true variation of your data. There are a few single imputation approaches that don’t distort things too much, but all of the approaches mentioned above will grossly understate the true variation in your data, unless the proportion of missing values is trivially small.

The correct approach is to impute your missing data multiple times, make sure that the imputations model the correct amount of variation and then pool the results of data analyses conducted across these multiple imputations. This is a topic for a future blog entry.

Before going, you should save everything in case you need it later.

# save results for later use.
save.image("impute.RData")

PMean

A blog about statistics, evidence-based medicine, and research ethics

PMean: Some simple examples of single imputation