The ridge()
function was included in the survival
package as a test of the code for penalized models. As the author of the
package, I never actually use the function myself, and as a consequence
it has never received any polish. But it does work. This neglect is
simply due to the types of data that I analyse.
This vignette was created to address a common issue with using the function, one that arises with some frequency on internet message boards. Hopefully this note will provide context and a reference for a solution. I will create a dummy dataset with 20 random 0/1 variables as a running example.
set.seed(1954) # force reproducability
library(survival)
n <- nrow(lung)
snp <- matrix(rbinom(20*n, 1, p=.1), nrow=n)
snpdata <- cbind(lung, data.frame(snp))
dim(snpdata)
#> [1] 228 30
The ridge function makes the most sense when there are a large number of variables, for instance if one had 100 SNPs. Say that you write the call in the obvious way:
cfit1 <- coxph(Surv(time, status) ~ age + sex + ridge(X1, X2, X3, X4, X5, X6,
X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18,
X19, X20, theta=.1), data=snpdata)
The problem with this, of course, is that typing this statement is
cumbersome for 20 SNPs and practically impossible if there were 500. One
can use .
in R formulas to stand for “all the other
variables”, but this does not work inside a function. This leads to the
first work-around, which is to create the formula programatically.
xlist <- paste0("X", 1:20)
myform <- paste("Surv(time, status) ~ age + sex + ridge(",
paste(xlist, collapse= ", "), ", theta=.1)")
cfit2 <- coxph(formula(myform), data=snpdata)
all.equal(cfit1$loglik, cfit2$loglik)
#> [1] TRUE
This approach works well up to a point and then fails, once the model statement grows too long for the internal buffer of the R parser. The solution is to call the ridge function with a single matrix argument.
This fit has completely ignored the variables X1, X2, …, X50 found in the dataframe. If the SNP data had come to us as a dataframe, then we would create a temporary matrix using the as.matrix command applied to the appropriate columns of said data.
The problem with this approach comes if we want to do predictions using a new set of data.
newsnp <- matrix(rbinom(20*4, 1, p=.12), nrow=4)
prdata <- data.frame(age= c(50, 65, 48, 70), sex= c(1, 1, 2,1),
newsnp)
predict(cfit3, newdata=prdata)
#> Error in model.frame.default(data = prdata, formula = ~age + sex + ridge(snp, : variable lengths differ (found for 'ridge(snp, theta = 0.1)')
We get a failure message that “variable lengths differ”. The reason
is that R fitting functions look for their variables first in the
data=
argument, and then look in the primary working data
for any that were not found. The call to cfit3 has 3 variables of age,
sex, and snp; while the new dataframe prdata has variables of age, sex,
X1, X2, …, X20. The predict function pulls 4 rows from prdata (for age
and sex), and 50 rows from the global variable snp. This clearly does
not work.
What if we put those variables in the the main data, instead of a dataframe?
prdata <- data.frame(age= c(50, 65, 48, 70), sex= c(1, 1, 2,1),
newsnp)
snpsave <- snp
snp <- newsnp
age <- prdata$age
sex <- prdata$sex
new <- predict(cfit3)
length(new)
#> [1] 228
snp <- snpsave # restore the status quo
rm(age, sex)
This time we have been tripped up by the
predict.coxph function
. If there is no newdata argument,
the function assumes the user is asking for predictions using the
original data, and that is what is produced. There is also the need to
restore datasets that we overwrote.
The most direct way around this is to do the prediction “by hand”. Since this dataset has no factor variables, splines, or other terms that lead to special coding, it is fairly easy to do the computation.
prmat <- cbind(age= c(50, 65, 48, 70), sex=c(1,1,2,1), newsnp)
drop(prmat %*% coef(cfit3)) # simplify results to vector
#> [1] 0.1428461 0.9497542 0.1203784 1.0034065
#alternate (center)
drop(prmat %*% coef(cfit3)) - sum(coef(cfit3)* cfit3$mean)
#> [1] -0.2148958 0.5920124 -0.2373634 0.6456646
The predict.coxph
function, by default, gives
predictions for centered covariates. This has its roots in the internals
of the coxph
code, where centering is used to avoid
overflow of the exp function. This is entirely optional when doing the
prediction ourselves, unless one wants to match
predict.coxph
. (If I could go back in time, centering the
predictions would be undone; it’s effect has been mostly to cause
confusion. Que sera sera.)
Suppose that one still wanted to use the standard predict function,
for instance if the model did include a factor variable, or simply to
fit in to the usual R pattern? What is needed is a way to store a matrix
directly into a dataframe. This is done regularly by the
model.frame
function; the code below causes data.frame to
use that behavior for our matrices of SNP values.
# new data type
ridgemat <- function(x) {
class(x) <- c("ridgemat", class(x))
x
}
as.data.frame.ridgemat <- function(x, ...)
as.data.frame.model.matrix(as.matrix(x), ...)
snpdata2 <- cbind(lung, snp= ridgemat(snp))
names(snpdata2)
#> [1] "inst" "time" "status" "age" "sex"
#> [6] "ph.ecog" "ph.karno" "pat.karno" "meal.cal" "wt.loss"
#> [11] "snp"
cfit4 <- coxph(Surv(time, status) ~ age + sex + ridge(snp, theta=.1),
data= snpdata2)
prdata <- data.frame(age= c(50, 65, 48, 70), sex= c(1, 1, 2,1),
snp= ridgemat(newsnp))
predict(cfit4, newdata=prdata)
#> 1 2 3 4
#> -0.2148958 0.5920124 -0.2373634 0.6456646
Success!
The downside to this is that the modified dataframe object is known to work with modeling functions, but is not guaranteed to be comparable with standard manipulations for dataframes, such as merge, subset, etc., or tidyverse concepts such as a tibble.