Skip to contents

Introduction

Suppose you have a convex optimization problem whose data — the constraint matrix, the offset, the objective coefficients — depend on some upstream quantity (a parameter, a model weight, a hyperparameter, the output of a previous computation). You solve the problem and get an optimum. How does that optimum change when the data changes?

diffcp answers that question. It treats the optimal solution as an implicit function of the problem data, and returns the derivative of that function — both in the forward direction (perturb the data, get the predicted change in the optimum) and as its adjoint (start from a downstream loss on the optimum and backpropagate to a gradient on the data).

Three concrete things you can do with it:

  • Sensitivity analysis. Quantify how much each constraint or cost coefficient matters at the optimum.
  • Hyperparameter learning. Solve a regularized problem inside a training loop and let gradient descent on the upstream loss learn the regularizer.
  • End-to-end optimization layers. Embed a CVXR problem as a differentiable component of a larger model — see CVXR’s Problem$backward() and Problem$derivative() API, which internally calls diffcp.

The implementation is a port of the Python diffcp package (Agrawal, Barratt, Boyd, Busseti, Moursi, 2019); the numerical core (cone projections, Jacobians, the M operator, LSQR) is direct C++ from the upstream repository, called from R via RcppEigen.

The mathematical setup

diffcp differentiates the primal–dual conic problem

minimizecxsubject toAx+s=b,sK, \begin{aligned} \text{minimize} &\quad c^\top x \\ \text{subject to} &\quad A x + s = b, \quad s \in K, \end{aligned}

where KK is a Cartesian product of the standard cones (zero, the non-negative orthant, second-order, positive semidefinite, exponential, and the exponential dual). The dual variable attached to the cone constraint is yy; together (x,y,s)(x, y, s) is the primal–dual triple.

Differentiating “the optimum” turns out to mean something precise:

  1. The KKT conditions can be lifted into a homogeneous self-dual embedding (Ye–Todd–Mizuno) in which optimality, primal infeasibility, and dual infeasibility all become statements about a single residual map N(z)=((QI)Π(z)+I)zN(z) = ((Q - I) \Pi(z) + I) z, where QQ is a skew-symmetric block of the data and Π\Pi is the projection onto n×K*×+\mathbb{R}^n \times K^\ast \times \mathbb{R}_+.
  2. At a regular point — informally, an optimum away from active-set transitions and degeneracies — the implicit-function theorem applies, and the optimum is a smooth function of the problem data.
  3. The Jacobian of the optimum with respect to the data is then a bounded linear operator, computed by solving a linear system in M=(QI)DΠ(z)+IM = (Q - I)\,D\Pi(z) + I using either a dense LDLT factorization or a matrix-free LSQR iteration.

This is the construction in Busseti, Moursi, and Boyd (2019); diffcp packages it into two callables.

What the package computes

Calling solve_and_derivative(A, b, c, cone_dict) returns a list with five entries:

  • x, y, s — the primal–dual optimum.
  • D(dA, db, dc) — the forward derivative. Given a perturbation of the data, returns (dx, dy, ds): the predicted change in the optimum, to first order.
  • DT(dx, dy, ds) — the adjoint of D. Given a perturbation of the optimum (or, more usefully, a gradient of a downstream loss with respect to the optimum), returns (dA, db, dc): the gradient of that loss with respect to the data.

Both are functions, not matrices. The dense mode materializes the Jacobian as an Eigen matrix and factors it once; the LSQR mode applies the Jacobian and its transpose matrix-free, never forming the dense object. The trade-off is per-call cost (dense is faster once M has been factored) versus memory and asymptotic scaling (lsqr is the only viable choice for very large problems).

Example 1: a linear program

The smallest illustrative case: pick the cheapest non-negative mixture of three goods that sums to one,

minx3cxsubject to𝟏x=1,x0. \min_{x \in \mathbb{R}^3} \; c^\top x \quad \text{subject to} \quad \mathbf{1}^\top x = 1,\; x \ge 0.

In SCS standard form Ax+s=bAx + s = b, sK={0}×+3s \in K = \{0\} \times \mathbb{R}^3_+ — one row for the equality (zero cone, s=0s = 0) followed by three rows for the non-negativity constraints (the slack absorbs the inequality):

A <- sparseMatrix(i = c(1, 2, 1, 3, 1, 4),
                  j = c(1, 1, 2, 2, 3, 3),
                  x = c(1, -1, 1, -1, 1, -1),
                  dims = c(4, 3))
b <- c(1, 0, 0, 0)
c <- c(1, 2, 3)
cone_dict <- list(z = 1L, l = 3L)

res <- solve_and_derivative(A, b, c, cone_dict, mode = "lsqr",
                            tol_gap_abs = 1e-10,
                            tol_gap_rel = 1e-10,
                            tol_feas    = 1e-10)
res$x
#> [1] 1.000000e+00 1.562233e-11 5.329854e-12

The optimum is x(1,0,0)x \approx (1, 0, 0) — all weight on the cheapest coordinate. Now the derivative. Suppose we slightly increase the cost of the first coordinate, holding everything else fixed:

dA <- sparseMatrix(i = integer(0), j = integer(0), x = numeric(0),
                   dims = c(4, 3))
db <- numeric(4)
dc <- c(0.001, 0, 0)
d <- res$D(dA, db, dc)
d$dx
#> [1] -2.311790e-16 -6.104703e-16  4.390389e-15

D reports that dx is essentially zero — the optimum is robust to a small bump in c1c_1, because the constraint 𝟏x=1\mathbf{1}^\top x = 1 pins coordinate 1 against x0x \ge 0. (You’d see a non-trivial dy if you printed the dual perturbation.) The forward derivative agrees with finite-differencing the perturbed solve to within solver tolerance:

res_pert <- solve_only(A, b, c + dc, cone_dict,
                       tol_gap_abs = 1e-10,
                       tol_gap_rel = 1e-10,
                       tol_feas    = 1e-10)
max(abs(d$dx - (res_pert$x - res$x)))
#> [1] 1.453479e-14

The agreement is at the level of the Clarabel solve tolerance, not the LSQR tolerance — the per-call derivative is the exact implicit-function-theorem answer at this point.

Example 2: a second-order-cone program

A geometric example: project the simplex centroid onto a Lorentz cone,

mint,xtsubject tox2t,𝟏x=1. \min_{t,\, x} \; t \quad \text{subject to} \quad \|x\|_2 \le t,\; \mathbf{1}^\top x = 1.

The variables are (t,x1,x2,x3)(t, x_1, x_2, x_3). The block (t,x)(t, x) lives in a single 4-dimensional second-order cone; the equality goes in a zero cone of size 1. Since tt is being minimized against a norm constraint, the optimum lies on the cone boundary x2=t\|x\|_2 = t, which is exactly the kind of active constraint that makes derivatives non-trivial.

A <- sparseMatrix(
  i = c(2, 1, 3, 1, 4, 1, 5),
  j = c(1, 2, 2, 3, 3, 4, 4),
  x = c(-1, 1, -1, 1, -1, 1, -1),
  dims = c(5, 4))
b <- c(1, 0, 0, 0, 0)
c <- c(1, 0, 0, 0)
cone_dict <- list(z = 1L, q = 4L)

res <- solve_and_derivative(A, b, c, cone_dict, mode = "dense",
                            tol_gap_abs = 1e-10,
                            tol_gap_rel = 1e-10,
                            tol_feas    = 1e-10)
res$x
#> [1] 0.5773503 0.3333333 0.3333333 0.3333333

The optimum is x=(1/3,1/3,1/3)x = (1/3, 1/3, 1/3), t=3/3t = \sqrt{3}/3, and the SOC constraint x2=t\|x\|_2 = t holds with equality. The cone projection’s Jacobian at a boundary point is the technical heart of diffcp — it’s a smoothed projector that interpolates between the identity (interior of the cone) and zero (interior of the polar cone). The package handles all the boundary geometry automatically; from the R side you just call D and DT.

Example 3: a small semidefinite program

The minimum-eigenvalue SDP: find a unit-trace PSD matrix that minimizes C,X\langle C, X \rangle for given CC. With C=diag(1,0,0,0,1)C = \mathrm{diag}(1, 0, 0, 0, -1), the optimum is rank-one and puts all weight on the eigenvector of CC with most-negative eigenvalue.

PSD support requires a vectorization convention. diffcp uses the SCS convention: lower-triangular column-major, with off-diagonals scaled by 2\sqrt{2}. The package exports vec_symm() and unvec_symm() for round-trips between symmetric matrices and their packed vector form. The trace-coefficient row in the equality constraint is sparse — only the diagonal entries contribute, and we build it explicitly:

DIM <- 5L
n   <- DIM * (DIM + 1L) %/% 2L          # 15

trace_row <- numeric(n)
pos <- 1L
for (col in seq_len(DIM)) {
  trace_row[pos] <- 1.0
  pos <- pos + (DIM - col + 1L)         # next diagonal in column-major lower-tri
}

A <- rbind(
  Matrix(matrix(trace_row, nrow = 1L), sparse = TRUE),
  -Diagonal(n))
A <- as(A, "CsparseMatrix")
b <- c(1.0, numeric(n))

C <- matrix(0, DIM, DIM)
C[1, 1] <- 1; C[DIM, DIM] <- -1
c_vec <- vec_symm(C)

cone_dict <- list(z = 1L, s = DIM)

res <- solve_only(A, b, c_vec, cone_dict,
                  tol_gap_abs = 1e-10,
                  tol_gap_rel = 1e-10,
                  tol_feas    = 1e-10)
X <- unvec_symm(res$x, DIM)
sum(diag(X))
#> [1] 1
range(eigen(X, symmetric = TRUE, only.values = TRUE)$values)
#> [1] -2.080584e-12  1.000000e+00

X has trace 1 (to solver tolerance) and is PSD (smallest eigenvalue 109\ge -10^{-9}). Behind the scenes, when the solve goes through Clarabel — which packs PSD entries in upper-triangular column-major order, opposite to SCS — diffcp permutes the rows of AA and the entries of bb on the way in, and inverse-permutes the dual yy and slack ss on the way out, so the user-facing convention stays SCS throughout.

A practical sensitivity-analysis example

Here is the canonical use case: shadow prices. Take a small LP — a profit-maximizing production plan with a labor and a materials budget — and ask, given the optimum, by how much would the profit change per extra unit of labor? That number is the dual variable on the labor constraint. diffcp lets you compute the same sensitivity for any perturbation of the data, including ones that have no closed-form dual interpretation.

We solve

maxx1,x203x1+5x2s.t.2x1+x2100,x1+3x290, \max_{x_1, x_2 \ge 0}\; 3 x_1 + 5 x_2 \quad \text{s.t.}\quad 2 x_1 + x_2 \le 100,\; x_1 + 3 x_2 \le 90,

and ask: how does the optimum respond to a small change in the labor-budget right-hand side, b1=100b_1 = 100?

We rewrite the maximization as a minimization of cx-c^\top x, and encode the two budget inequalities and the two non-negativity inequalities as four rows of a non-negative orthant slack constraint:

A_lp <- sparseMatrix(i = c(1, 2, 3,  1, 2, 4),
                     j = c(1, 1, 1,  2, 2, 2),
                     x = c(2, 1, -1, 1, 3, -1),
                     dims = c(4, 2))
b_lp <- c(100, 90, 0, 0)
c_lp <- c(-3, -5)
cones_lp <- list(l = 4L)

res <- solve_and_derivative(A_lp, b_lp, c_lp, cones_lp, mode = "dense",
                            tol_gap_abs = 1e-10, tol_gap_rel = 1e-10,
                            tol_feas    = 1e-10)
res$x      # optimal production plan
#> [1] 42 16
-sum(c_lp * res$x)   # max profit
#> [1] 206

The optimum makes x1=42x_1 = 42, x2=16x_2 = 16, profit =206= 206. Now use D to ask how the optimal xx would change if we got one extra unit of labor (Δb1=1\Delta b_1 = 1):

dA <- sparseMatrix(i = integer(0), j = integer(0), x = numeric(0),
                   dims = c(4, 2))
db <- c(1, 0, 0, 0)
dc <- numeric(2)
d  <- res$D(dA, db, dc)
d$dx                       # change in optimal production
#> [1]  0.6 -0.2
-sum(c_lp * d$dx)          # change in profit
#> [1] 0.8

d$dx says that an extra unit of labor increases x1x_1 by 0.6 and decreases x2x_2 by 0.2, and the corresponding profit change is 0.80.8 per unit of labor: the labor-budget shadow price, recovered from D rather than from the dual variable yy.

The advantage of going through D instead of the dual: it generalizes. We can ask the same question about non-trivial perturbations of AA (a productivity gain on a particular input), c (a price change on a particular product), or any combination. The dual variable answers only the canonical RHS-perturbation question; D answers all of them through the same single call.

DT answers the opposite question. Suppose we have a downstream loss L(x*)L(x^\ast) — say, a smooth function of the optimum with known gradient L/x*\partial L / \partial x^\ast. The adjoint computes the gradient of LL with respect to every data entry (A,b,c)(A, b, c) in one shot:

## Suppose dL/dx* is given (e.g. from autodiff in a larger model).
dL_dx <- c(1.0, -0.5)
dL_dy <- numeric(length(res$y))
dL_ds <- numeric(length(res$s))
da    <- res$DT(dL_dx, dL_dy, dL_ds)
da$db                 # gradient of L w.r.t. each entry of b
#> [1]  7.000000e-01 -4.000000e-01 -6.162041e-12 -2.615861e-12
da$dc                 # gradient of L w.r.t. each entry of c
#> [1] 3.130900e-10 1.741567e-10

In a training loop you’d feed da$db and da$dc straight into a parameter update.

Forward versus adjoint: when to use each

Both D and DT access the same underlying Jacobian, but their cost profile differs:

  • D(dA, db, dc) is right-multiplication by the Jacobian. Cost is one solve of the same linear system per call. Use D when you have a small number of perturbation directions (e.g., parameter sweeps) and want to forecast the effect on the optimum.
  • DT(dx, dy, ds) is right-multiplication by the Jacobian’s transpose. Same per-call cost, but the answer is a gradient with respect to all data entries simultaneously. Use DT when you have an upstream gradient on the optimum (typically from an autodiff system) and need to propagate it through the conic optimization.

mode = "dense" factors M once on the first call and amortizes the cost across subsequent D / DT invocations on the same problem, which is the right choice when you’ll evaluate either operator many times.

Choosing dense versus lsqr

solve_and_derivative accepts two modes for the inner linear-algebra step:

  • mode = "lsqr" (default): matrix-free conjugate-gradient- style iterative solve against the M operator. Memory is O(nnz(A))O(\text{nnz}(A)); iteration count grows with the conditioning of MM. Appropriate for large sparse problems.
  • mode = "dense": build MM as a dense matrix, factor with Eigen’s LDLT, and reuse the factorization across D / DT calls. Cost is O(N3)O(N^3) once, O(N2)O(N^2) per call, where N=m+n+1N = m + n + 1. Appropriate for small problems and for workflows that re-apply D / DT many times.

Both modes produce the same answer to within solver tolerance. The dense path matches Python’s (MM).ldlt().solve(Mrhs)(M^\top M).\mathrm{ldlt}().\mathrm{solve}(M^\top \cdot \mathrm{rhs}) bit-for-bit; the LSQR path agrees with the dense path to roughly 10610^{-6} at the default atol = btol = 1e-8.

Conventions

  • PSD vectorization. vec_symm() packs the lower triangle in column-major order with off-diagonals scaled by 2\sqrt{2} (the SCS convention). unvec_symm() is the inverse. This scaling makes the inner product vec(A),vec(B)\langle \mathrm{vec}(A), \mathrm{vec}(B) \rangle equal tr(AB)\mathrm{tr}(AB) for symmetric A,BA, B, which is what the cone-projection Jacobian expects.
  • PSD with Clarabel. Clarabel uses upper-triangular column- major; diffcp permutes rows of AA and entries of bb to Clarabel order on the way in and inverse-permutes the dual yy and slack ss on the way out. The user-facing convention is SCS throughout.
  • Exponential cones. Each ep cone consumes three rows of AA (one cone per count); the dual ed cone is handled separately or recovered via Moreau’s decomposition, ΠK*(x)=x+ΠK(x)\Pi_{K^\ast}(x) = x + \Pi_K(-x).
  • Quadratic objectives. solve_only(P = P) accepts a positive-semidefinite P for the forward solve. The derivative path through solve_and_derivative does not support P directly; CVXR’s reduction chain canonicalizes QPs into auxiliary-variable conic problems before reaching the solver, so the loss is invisible at the CVXR layer.

Connection to CVXR

R users typically don’t call diffcp directly. CVXR (≥ 1.8.2) exposes a higher-level API:

library(CVXR)
p <- Parameter(); x <- Variable(); value(p) <- 3
prob <- Problem(Minimize((x - 2 * p)^2), list(x >= 0))
psolve(prob, requires_grad = TRUE)
backward(prob)
gradient(p)             # 2  (since x* = 2p)
delta(p) <- 1e-3
derivative(prob)
delta(x)                # 0.002

The requires_grad = TRUE flag routes the solve through the DIFFCP solver wrapper, which calls diffcp::solve_and_derivative under the hood. CVXR then handles the chain rule through any problem reductions (DGP log-transform, Complex2Real splitting, parameter canonicalization) and reports gradients keyed by Parameter rather than by raw data entries.

When you want the lower-level data-perturbation interface — for research, for fixtures that bypass CVXR canonicalization, or for problems already in conic standard form — diffcp is the right entry point.

Citation

If you use diffcp in academic work, please cite both the R package and the original paper. See citation("diffcp") for the canonical BibTeX entries.

  • R package

    Narasimhan, B.; Agrawal, A.; Barratt, S.; Boyd, S.; Busseti, E.; Moursi, W. diffcp: Differentiating Through Cone Programs. R package, 2026. https://github.com/bnaras/diffcp

  • Original paper

    Agrawal, A.; Barratt, S.; Boyd, S.; Busseti, E.; Moursi, W. “Differentiating through a cone program.” Journal of Applied and Numerical Optimization, 1(2), 107–115, 2019.