beta <- Variable(2)
X0 <- matrix(rnorm(20), 10, 2)
is_dcp(Problem(Minimize(sum(log(1 + exp(X0 %*% beta)))))) # FALSE[1] FALSE
Fit a binary classifier by maximum likelihood in CVXR, recognize why the natural expression is not DCP and which atom fixes it, match stats::glm, and read off predicted probabilities — then add a penalty for regularized logistic regression.
So far our responses were continuous. Now \(y_i \in \{0, 1\}\) is a class label. Logistic regression models the probability of class 1 as \(g_\beta(x) = 1/(1 + e^{-x^\top\beta})\) and fits \(\beta\) by maximizing the log-likelihood
\[ \underset{\beta}{\mbox{maximize}}\quad \sum_{i=1}^m \Big[\, y_i\, x_i^\top\beta - \log\big(1 + e^{x_i^\top\beta}\big) \Big]. \]
The first term is linear; the second is a sum of \(\log(1 + e^{z})\) terms.
You might write that second term directly as log(1 + exp(X %*% beta)). CVXR will reject it — recall Chapter 3: exp is convex, so 1 + exp(...) is convex, but log is concave-increasing and needs a concave argument. The composition is not verifiable:
beta <- Variable(2)
X0 <- matrix(rnorm(20), 10, 2)
is_dcp(Problem(Minimize(sum(log(1 + exp(X0 %*% beta)))))) # FALSE[1] FALSE
The function is convex; DCP just can’t see it through that expression. The fix is the dedicated logistic atom, \(\texttt{logistic}(z) = \log(1 + e^{z})\), which DCP knows is convex. This is the reformulation habit again — reach for the atom.
We generate data from a known logistic model and maximize the log-likelihood, writing \(\sum_i \log(1+e^{x_i^\top\beta})\) as sum(logistic(X %*% beta)):
set.seed(42)
m <- 500; n <- 8
beta_true <- rnorm(n)
X <- matrix(rnorm(m * n), m, n)
y <- as.numeric(1 / (1 + exp(-X %*% beta_true)) > runif(m)) # 0/1 labels
beta <- Variable(n)
loglik <- sum(y * (X %*% beta)) - sum(logistic(X %*% beta))
prob <- Problem(Maximize(loglik))
psolve(prob)[1] -222.1508
check_solver_status(prob)It matches stats::glm to solver tolerance:
g <- glm(y ~ 0 + X, family = "binomial")
cbind(CVXR = round(value(beta), 4), glm = round(coef(g), 4)) glm
X1 1.3253 1.3253
X2 -0.5073 -0.5073
X3 0.3291 0.3291
X4 0.5693 0.5693
X5 0.1983 0.1983
X6 -0.1071 -0.1071
X7 1.4180 1.4180
X8 -0.2124 -0.2124
Just as we read off fitted values for OLS in Chapter 2, every quantity here is value() of an expression: the log-odds are value(X %*% beta), and the fitted probabilities follow from them. You’ll compute these in the exercises.
Because the objective is just an expression, the regularization idea from Chapter 5 drops straight in. Add an \(\ell_1\) penalty for a sparse logistic fit — one line:
lambda <- 25
pen_obj <- loglik - lambda * p_norm(beta, 1) # penalized log-likelihood
psolve(Problem(Maximize(pen_obj)))[1] -294.2886
round(value(beta), 3) [,1]
[1,] 0.714
[2,] -0.151
[3,] 0.000
[4,] 0.149
[5,] 0.000
[6,] 0.000
[7,] 0.782
[8,] 0.000
Larger lambda drives more coefficients to zero, just as the lasso did for linear regression — logistic regression and penalized regression are the same modeling vocabulary applied to a different loss.
Exercise 1. In Chapter 2 you read off OLS fitted values as value(X %*% beta). Do the classification analogue here: from the (unpenalized) fitted model, compute the in-sample predicted probabilities, turn them into class predictions at a 0.5 threshold, and report the classification accuracy and a confusion matrix.
Re-solve the maximum-likelihood fit (the penalized sections above changed beta), then everything is just value() of an expression:
psolve(prob) # the unpenalized MLE fit[1] -222.1508
p_hat <- 1 / (1 + exp(-value(X %*% beta))) # fitted probabilities
y_hat <- as.numeric(p_hat >= 0.5) # predicted classes
c(accuracy = mean(y_hat == y))accuracy
0.792
table(predicted = y_hat, actual = y) actual
predicted 0 1
0 191 56
1 48 205
The log-odds X %*% beta is an expression; value() evaluates it at the solution, and the rest is ordinary R. The very same pattern would score new data — just substitute a new design matrix.
Exercise 2. Fit a ridge-penalized (\(\ell_2\)) logistic regression at the same lambda = 25 and compare its coefficients with the \(\ell_1\) fit above. Which zeroes coefficients?
l1_fit <- { psolve(Problem(Maximize(loglik - 25 * p_norm(beta, 1)))); value(beta) }
l2_fit <- { psolve(Problem(Maximize(loglik - 25 * sum_squares(beta)))); value(beta) }
data.frame(l1 = round(l1_fit, 3), l2 = round(l2_fit, 3)) l1 l2
1 0.714 0.626
2 -0.151 -0.249
3 0.000 0.154
4 0.149 0.273
5 0.000 0.091
6 0.000 -0.067
7 0.782 0.683
8 0.000 -0.107
The \(\ell_2\) penalty shrinks every coefficient smoothly but zeroes none; the \(\ell_1\) penalty sets several exactly to zero. Same trade-off as ridge vs. lasso in Chapter 5 — the loss changed, the regularization logic did not.
stats::glm.log(1 + exp(...)) is non-DCP; the logistic atom is the DCP-compliant tool — reformulation, again.value() of an expression, and penalties from Chapter 5 add in one line.See Logistic Regression.