SAE

SAE - Shrinkage Averaging Estimation: some useful functions and an package

$\rightarrow$ Background

$\rightarrow$ Software

$\rightarrow$ Further considerations and literature

Background

It is now well-known that estimators post-model selection are typically both biased and overoptimistic in the sense that the uncertainty associated with the model selection process is typically not reflected in confidence intervals. This is because the selected model is conditional on the selection process; a different dataset, or a different selection method, may yield different conclusions.

For many shrinkage estimators, the choice of one or many tuning parameters is required. Again, inference is conditional on the chosen tuning parameter. With model averaging, different “good” models are combined and the idea of shrinkage averaging estimation is to combine estimates resulting from different tuning parameter choices. Based on ideas from the optimal model averaging literature, one can think of clever combinations such that a cross-validation error is minimized. My 2012 paper explains the basic idea (see here), but our 2019 paper contains (see here) some more mature thoughts. These ideas are implemented in both my SAE package as well as the lae() function contained in my MAMI package (see here). More details are given below.

LASSO averaging estimation

Consider the LASSO estimator for a simple linear model: \begin{eqnarray} \hat{\beta}_{\text{LE}}({\lambda}) &=& \text{arg} \min {\sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p x_{ij}{{\beta}}_j)^2 + {\lambda} \sum_{j=1}^p |\beta_j| },. \end{eqnarray} The complexity parameter $\lambda \geq 0$ tunes the amount of shrinkage and is typically estimated via the generalized cross validation criterion (GCV) or any other cross validation criterion (CV). The larger the value of $\lambda$, the greater the amount of shrinkage since the estimated coefficients are shrunk towards zero.

Consider a sequence of candidate tuning parameters $\boldsymbol{\lambda}={\lambda_1,\ldots,\lambda_L}$. If each estimator $\hat{\beta}_{\text{LE}}({\lambda_i})$ obtains a specific weight $w_{\lambda_i}$, then a LASSO averaging estimator takes the form \begin{eqnarray} \hat{\bar{\beta}}_{\text{LAE}}&=& \sum_{i=1}^L w_{\lambda_i} \hat{\beta}_{\text{LE}}({\lambda_i}) = \mathbf{w}_\lambda \hat{\boldsymbol{B}}_{\text{LE}} , \end{eqnarray} where $\lambda_i \in [0,c]$, $c>0$ is a suitable constant, $\hat{{\boldsymbol{B}}}_{\text{LE}}=(\hat{\beta}_{\text{LE}}(\lambda_1),\ldots,\hat{\beta}_{\text{LE}}(\lambda_L))'$ is the $L \times p$ matrix of the LASSO estimators, $\mathbf{w}_\lambda=(w_{\lambda_1},\ldots,w_{\lambda_L})$ is an $1 \times L$ weight vector, $\mathbf{w}_\lambda \in \mathcal{W}$ and $\mathcal{W}={\mathbf{w}_\lambda \in [0,1]^L: \mathbf{1}'\mathbf{w}_\lambda=1}$.

One could choose the weights \begin{eqnarray} \hat{\mathbf{w}}_\lambda^{\text{OCV}} &=& \text{arg} \min_{_{_{\mathbf{w}_\lambda\in\mathcal{W}}}} OCV_k \end{eqnarray} with \begin{eqnarray} OCV_k &=& \frac{1}{n}{\tilde{\boldsymbol\epsilon}_\kappa(w)}' \tilde{\boldsymbol\epsilon}_\kappa(w) \nonumber\ &\propto& \mathbf{w}_{\lambda} \mathbf{E}'_k \mathbf{E}_k {\mathbf{w}_{\lambda}}' ,, \end{eqnarray} referring to an optimal cross validation (OCV) based criterion and $\mathbf{E}_k = (\tilde{\boldsymbol\epsilon}_k(\lambda_1),\ldots,\tilde{\boldsymbol\epsilon}_k(\lambda_L))$ is the $n \times L$ matrix of the ($k$-fold) cross-validation residuals for the $L$ competing tuning parameters (given a specific loss function). An optimal weight vector for this criterion is then \begin{eqnarray} w^{\text{LAE}} &=& \text{arg} \min_{_{_{w\in\mathcal{W}}}} OCV_k . \end{eqnarray} These weights can be calculated with quadratic programming.

Alternatively, the weight choice may not be based on predictive purposes, but on confidence interval coverage and resampling may be an option in this case.

Software

Two implementations are available:

  • the original files from 2012, summarized in the SAE - package , available here.
  • a broader and more efficient implementation in the function lae() in the MAMI package.

Briefly, the new implementation is better and faster and can be used for the linear model, the logistic model and the Poisson model – but not for longitudinal or survival data. It can be used for Ridge, Elastic Net and Lasso. The original SAE package can only be used for the linear model, but has both the bootstrap weight choice and Random Lasso option implemented (but not the Elastic Net). The use of the new files are encouraged, but the old files may be useful for reproduction of results. There are many more options and possibilities, see Section 3.1.3 in the MAMI manual, or under ?sae in the SAE package.

Further considerations

LASSO averaging is computationally very efficient. It is thus a fast prediction algorithm and suitable for inclusion into super learning. Super learning means considering a set of prediction algorithms, for example regression models, shrinkage estimators or regression trees. Instead of choosing the algorithm with the smallest cross validation error, super learning chooses a weighted combination of different algorithms, that is the weighted combination which minimizes the cross validation error. It can be shown that this weighted combination will perform at least as good as the best algorithm, if not better.

Briefly, MAMI contains several shrinkage averaging estimation wrappers that can be used for super learning. They are listed and explained when typing

listSLWrappers()

This is useful for both pure prediction problems and causal inference with targeted maximum likelihood estimation. The below 2019 reference (“When and when not to us eoptimal model avergaing”) contains more details.