Variable Selection Using LASSO

July 25, 2017

Wenhao Hu, PhD Candidate

How to identify a gene related to cancer? What factors are correlated to graduation rates in all NCAA universities? To answer those questions, statisticians usually use a method called variable selection. Variable selection is a technique to identity significant factors related to the response, e.g., graduation rates. One of the most widely used variable selection methods is called LASSO. LASSO is a standard tool among quantitative researchers working across nearly all areas of science.

LASSO can handle data with lots of factors, e.g., thousands of genes. In the era of big data, this is extremely useful. For example, suppose that there are 50 patients with cancer and another 50 healthy people. And scientists sequence each subject’s gene at ~100k positions. To identify the gene related to cancer, one needs to check those ~100k positions. Traditional regression methods fail in this case because they usually require that the number of subjects be larger than the number of genes. LASSO avoids this problem by introducing regularization, which then has been used by many others machine learning and deep learning algorithms. LASSO has been implemented in most statistical software environments. For example, R has a package called glmnet. SAS has a PROC called glmselect.

To achieve good performances for LASSO, it is vital to choose an appropriate tuning parameter, which balances the model complexity and model fitting. Classical methods usually focus on selecting on a single optimal tuning parameter that minimizes some criterion, e.g., AIC, BIC. However, researchers usually ignore uncertainties in tuning parameter selection. Our research studies the distribution of the tuning parameter, and thus provides scientists with information about the variability of model selection. Furthermore, we are developing an interactive R package for LASSO. By using the package, scientists can dynamically see the model selected and corresponding false selection rates. This allows them to explore the dataset and to incorporate their own subject knowledge into model selection.

Illustration of the interactive R package under development for variable selection.

Wenhao is a PhD Candidate whose research interests include variable selection and statistical learning. We thought this posting was a great excuse to get to know a little more about him, so we we asked him a few questions!

  • What do you find most interesting/compelling about your research?

    My research provides me a better understanding about the theory of linear models, which is one of the most widely-used statistical methods.

  • What do you see are the biggest or most pressing challenges in your research area?

    One biggest challenge is model interpretability and inference after model selection. Meanwhile, users usually have little freedom to incorporate their domain knowledge into the process of model selection.

  • Finish this parable:

    A Tiger is walking through the jungle whereupon he sees a python strangling a lemur. The Tiger asks the python, “why must you kill in this way?” it is slow and painful. We all must eat, but have you no compassion for your fellow animals? To which the python replied, “Why must you kill with teeth and fangs? The gore and violence of it is scarring to all who are unfortunate enough to see it.” The tiger considered this for a moment and finally said, “Let us ask the Lemur. Lemur, which is your preferred way to go?”

    The python relaxed his grip slightly so that the Lemur could speak, “I don’t know which way is better. But if I can choose, I prefer to be killed by the strongest animal. Is Python or Tiger stronger?” The Tiger answer confidently, ‘I am the strongest animal in the jungle. Python, you should leave the Lemur to me.’ The Python felt very unhappy and started to debate with the Tiger and the Lemur. After several minutes, the Python and Tiger started fighting with each other. The Lemur escaped…