Facing Missing Data

December 8, 2017

Lin Dong, PhD Candidate

This is a detour from my last post about education. Turns out that I have been working on a project about sequential decision making in the face of missing data for several months, so why not talk about that. Missing data arises in all sorts of data. For data with sequential feature, like data from sequential multiple assignment randomized trials (SMART), the problem is that patients are often subject to drop out. Q-learning or other techniques for solving the optimal strategy cannot be directly applied to data sets containing missing values, so we need a way to get around having missing values.

The first question we may ask is–why is missing data a problem? Further, can we just throw out the missing entries? Why do people care so much about it and develop sophisticated methods to deal with it? Things are not that simple.

Missing data is not a big issue if the data are missing completely at random (MCAR). Yes, that’s jargon. MCAR means that missingness is completely random and is independent of the data. Suppose we have a typical n by p data matrix in which you have n rows corresponding to the subjects and p variables. A quick and dirty way to handle the missing data is to throw away all the rows containing missing values. This is not a great idea if a large proportion of your data contains missing values. Suppose you have an unbiased estimator for the full data. Under MCAR, the estimator remains unbiased, but you may lose a lot of efficiency (you are less certain about your estimation).

Another type of missingness is called missing at random (MAR), which means missingness is not completely random but only depends on the observed data. If you throw away missing data under this scenario, you will obtain a biased estimator. For example, cautious and wealthy people tend to avoid giving responses to questions about their income. For this reason, an income estimate would be lower than the truth because your sample only covers less wealthy respondents. Nonetheless, MAR is actually a very handy assumption because the missing event is tractable and thus can be modeled. The methods I will introduce later are all based on MAR.

If one refuses to assume MCAR or MAR for their data, we have a third and final missingness assumption called missing not at random (MNAR)–it says that the missing data depend on the things you did not observe. A very important paper that introduced these assumptions is Rubin 1976 [1].

So, we cannot simply throw away missing entries. What then are the alternatives? One can use the general class of imputation methods. Imputation methods are intuitive and work by filling in the missing entries based on the researcher’s best knowledge. The simplest imputation is to fill with the mean/median of the covariate. If we are willing to assume MAR, a more advanced way is to build a model for the variable. We can get a model-based estimator that can serve as the fill-in value. Instead of filling in with one estimator, one can estimate the conditional distribution of each variable given all other observed variables and then draw samples from the estimated conditional distribution to fill in the missing value. To account for the uncertainty in drawing samples, we can repeat the sampling procedure several times so that we have multiple imputed data sets. The inference is then performed on each of the imputed data sets. We combine the multiple results into one final estimator, for example, by averaging them. This is called multiple imputation and is a very popular approach to deal with missing data.

Another method, which is less known, models the missing mechanism directly. It is called the inverse probability weighted estimator, where probability refers to the probability of missing. When missingness is not MCAR, bias is introduced because the complete cases left are no longer a representative sample of the population. A method to fix that is to give each complete row a weight, which is 1 over the probability that it would be missing. Then we get a re-weighted sample that mimics the full and representative sample. The estimation of interest can be performed on the re-weighted sample, which only uses the complete rows. The key to this method is to estimate the probability of missing – the missing mechanism. Luckily, one can model the probability of missing under the MAR assumption.

Not until recently did I realized that I also encountered and studied the missing data issue in my undergraduate years. We were dealing with sensitive questionnaires, where people were being asked about very sensitive questions that they might be reluctant to answer. So we believed that they were likely being deceitful. The mechanism we used to address this was the following: I wanted to ask about a binary and sensitive status, and I coded it as {No = 0, Yes = 1}. Instead of asking directly, I listed a non-sensitive and independent question, e.g. how many times did you catch a flight in the last 3 months (an integer). Then, I asked the respondent to report only the sum of the number of flights and the answer to the sensitive question. For example, if a respondent traveled by air 3 times in the last three month and their sensitive status is “Yes”, he/she should write down 3+1 = 4. As the researcher, we only observe 4, which could be that the respondent flew 4 times with no sensitive status. In this way, it is believed that the compliance of respondents will increase. Utilizing this method, we translated the sensitive status into missing data as it was not directly observed. Typically, the researchers are only interested in population level of the sensitive status. Then, we applied the maximum likelihood to estimate the expected value of the missing value (In this case, a proportion). More details of this idea can be found in [2].

A perfect world would have no missing values. The real world, however, is so flawed that missing data arises wherever data are generated. Working on this issue gives me the illusion that I am helping to fix the world! A great reference for the general missing data issue is introduced in Prof. Marie Davidian’s course [3].

[1] Rubin, D. B. (1976). Inference and missing data. Biometrika 63, 581–592.
[2] GL Tian, ML Tang, Q Wu, Y Liu (2017). Poisson and negative binomial item count techniques for surveys with sensitive question. Statistical Methods in Medical Research. Vol 26, Issue 2, pp. 931 – 947
[3] http://www4.stat.ncsu.edu/~davidian/st790/

Lin is a PhD Candidate whose research interests include dynamic treatment regimes, reinforcement learning, and survival analysis. Her current research focuses on shared decision making in resource allocation problems. We asked a fellow Laber Labs colleague to ask Lin a probing question.

  • Explain your favorite statistical method, but from the perspective of a crooked politician running a smear campaign against it.

    Linear regression. This is definitely my favorite model. It is so simple, pure yet powerful. You can generalize it, penalize it and even interpret it.
    Human brain should be linear–not some complicated, intricate, twisted, impenetrable, nonlinear, *deep* networks. Believe me, the whole world should be linear.

This is Lin’s second post! To learn more about her research, check out her first article here!