# torch for optimization

Torch is not just for deep learning. Its L-BFGS optimizer, complete with Strong-Wolfe line search, is a powerful tool in unconstrained as well as constrained optimization.

Sigrid Keydana (RStudio)https://www.rstudio.com/
04-27-2021

So far, all torch use cases we’ve discussed here have been in deep learning. However, its automatic differentiation feature is useful in other areas. One prominent example is numerical optimization: We can use torch to find the minimum of a function.

In fact, function minimization is exactly what happens in training a neural network. But there, the function in question normally is far too complex to even imagine finding its minima analytically. Numerical optimization aims at building up the tools to handle just this complexity. To that end, however, it starts from functions that are far less deeply composed. Instead, they are hand-crafted to pose specific challenges.

This post is a first introduction to numerical optimization with torch. Central takeaways are the existence and usefulness of its L-BFGS optimizer, as well as the impact of running L-BFGS with line search. As a fun add-on, we show an example of constrained optimization, where a constraint is enforced via a quadratic penalty function.

To warm up, we take a detour, minimizing a function “ourselves” using nothing but tensors. This will turn out to be relevant later, though, as the overall process will still be the same. All changes will be related to integration of optimizers and their capabilities.

## Function minimization, DYI approach

To see how we can minimize a function “by hand”, let’s try the iconic Rosenbrock function. This is a function with two variables:

$f(x_1, x_2) = (a - x_1)^2 + b * (x_2 - x_1^2)^2$

, with $$a$$ and $$b$$ configurable parameters often set to 1 and 5, respectively.

In R:

library(torch)

a <- 1
b <- 5

rosenbrock <- function(x) {
x1 <- x[1]
x2 <- x[2]
(a - x1)^2 + b * (x2 - x1^2)^2
}

Its minimum is located at (1,1), inside a narrow valley surrounded by breakneck-steep cliffs:1

Our goal and strategy are as follows.

We want to find the values $$x_1$$ and $$x_2$$ for which the function attains its minimum. We have to start somewhere; and from wherever that gets us on the graph we follow the negative of the gradient “downwards”, descending into regions of consecutively smaller function value.

Concretely, in every iteration, we take the current $$(x1,x2)$$ point, compute the function value as well as the gradient, and subtract some fraction of the latter to arrive at a new $$(x1,x2)$$ candidate. This process goes on until we either reach the minimum – the gradient is zero – or improvement is below a chosen threshold.

Here is the corresponding code. For no special reasons, we start at (-1,1) . The learning rate (the fraction of the gradient to subtract) needs some experimentation. (Try 0.1 and 0.001 to see its impact.)

num_iterations <- 1000

# fraction of the gradient to subtract
lr <- 0.01

# function input (x1,x2)
# this is the tensor w.r.t. which we'll have torch compute the gradient
x_star <- torch_tensor(c(-1, 1), requires_grad = TRUE)

for (i in 1:num_iterations) {

if (i %% 100 == 0) cat("Iteration: ", i, "\n")

# call function
value <- rosenbrock(x_star)
if (i %% 100 == 0) cat("Value is: ", as.numeric(value), "\n")

# compute gradient of value w.r.t. params
value$backward() if (i %% 100 == 0) cat("Gradient is: ", as.matrix(x_star$grad), "\n\n")

# manual update
x_star$sub_(lr * x_star$grad)
x_star$grad$zero_()
})
}
Iteration:  100
Value is:  0.3502924

Iteration:  200
Value is:  0.07398106

...
...

Iteration:  900
Value is:  0.0001532408

Iteration:  1000
Value is:  6.962555e-05
Gradient is:  -0.003222887 -0.006653666 

While this works, it really serves to illustrate the principle. With torch providing a bunch of proven optimization algorithms, there is no need for us to manually compute the candidate $$\mathbf{x}$$ values.

## Function minimization with torch optimizers

Instead, we let a torch optimizer update the candidate $$\mathbf{x}$$ for us. Habitually, our first try is Adam.

With Adam, optimization proceeds a lot faster. Truth be told, though, choosing a good learning rate still takes non-negligeable experimentation. (Try the default learning rate, 0.001, for comparison.)

num_iterations <- 100

x_star <- torch_tensor(c(-1, 1), requires_grad = TRUE)

lr <- 1

for (i in 1:num_iterations) {

if (i %% 10 == 0) cat("Iteration: ", i, "\n")

optimizer$zero_grad() value <- rosenbrock(x_star) if (i %% 10 == 0) cat("Value is: ", as.numeric(value), "\n") value$backward()
optimizer$step() if (i %% 10 == 0) cat("Gradient is: ", as.matrix(x_star$grad), "\n\n")

}
Iteration:  10
Value is:  0.8559565

Iteration:  20
Value is:  0.1282992

...
...

Iteration:  90
Value is:  4.003079e-05

Iteration:  100
Value is:  6.937736e-05
Gradient is:  -0.003240437 -0.006630421 

It took us about a hundred iterations to arrive at a decent value. This is a lot faster than the manual approach above, but still quite a lot. Luckily, further improvements are possible.

### L-BFGS

Among the many torch optimizers commonly used in deep learning (Adam, AdamW, RMSprop …), there is one “outsider”, much better known in classic numerical optimization than in neural-networks space: L-BFGS, a.k.a. Limited-memory BFGS, a memory-optimized implementation of the Broyden–Fletcher–Goldfarb–Shanno optimization algorithm (BFGS).

BFGS is perhaps the most widely used among the so-called Quasi-Newton, second-order optimization algorithms. As opposed to the family of first-order algorithms that, in deciding on a descent direction, make use of gradient information only, second-order algorithms additionally take curvature information into account. To that end, exact Newton methods actually compute the Hessian (a costly operation), while Quasi-Newton methods avoid that cost and, instead, resort to iterative approximation.

Looking at the contours of the Rosenbrock function, with its prolonged, narrow valley, it is not difficult to imagine that curvature information might make a difference. And, as you’ll see in a second, it really does. Before though, one note on the code. When using L-BFGS, it is necessary to wrap both function call and gradient evaluation in a closure (calc_loss(), in the below snippet), for them to be callable several times per iteration. You can convince yourself that the closure is, in fact, entered repeatedly, by inspecting this code snippet’s chatty output:

num_iterations <- 3

x_star <- torch_tensor(c(-1, 1), requires_grad = TRUE)

optimizer <- optim_lbfgs(x_star)

calc_loss <- function() {

optimizer$zero_grad() value <- rosenbrock(x_star) cat("Value is: ", as.numeric(value), "\n") value$backward()
cat("Gradient is: ", as.matrix(x_star$grad), "\n\n") value } for (i in 1:num_iterations) { cat("Iteration: ", i, "\n") optimizer$step(calc_loss)
}
Iteration:  1
Value is:  4

Value is:  6

...
...

Value is:  0.04880721

Value is:  0.0302862

Iteration:  2
Value is:  0.01697086

Value is:  0.01124081

...
...

Value is:  1.111701e-09

Value is:  4.547474e-12

Iteration:  3
Value is:  4.547474e-12
Gradient is:  -1.907349e-05 9.536743e-06 

Even though we ran the algorithm for three iterations, the optimal value really is reached after two. Seeing how well this worked, we try L-BFGS on a more difficult function, named flower, for pretty self-evident reasons.

## (Yet) more fun with L-BFGS

Here is the flower function.2 Mathematically, its minimum is near (0,0), but technically the function itself is undefined at (0,0), since the atan2 used in the function is not defined there.

a <- 1
b <- 1
c <- 4

flower <- function(x) {
a * torch_norm(x) + b * torch_sin(c * torch_atan2(x[2], x[1]))
}

We run the same code as above, starting from (20,20) this time.

num_iterations <- 3

x_star <- torch_tensor(c(20, 0), requires_grad = TRUE)

optimizer <- optim_lbfgs(x_star)

calc_loss <- function() {

optimizer$zero_grad() value <- flower(x_star) cat("Value is: ", as.numeric(value), "\n") value$backward()
cat("Gradient is: ", as.matrix(x_star$grad), "\n") cat("X is: ", as.matrix(x_star), "\n\n") value } for (i in 1:num_iterations) { cat("Iteration: ", i, "\n") optimizer$step(calc_loss)
}
Iteration:  1
Value is:  28.28427
X is:  20 20

...
...

Value is:  19.33546
X is:  12.957 14.68274

...
...

Value is:  18.29546
X is:  12.14691 14.06392

...
...

Value is:  9.853705
X is:  5.763702 8.895616

Value is:  2635.866
X is:  -1949.697 -1773.551

Iteration:  2
Value is:  1333.113
X is:  -985.4553 -897.5367

Value is:  30.16862
X is:  -21.02814 -21.72296

Value is:  1281.39
X is:  964.0121 843.7817

Value is:  628.1306
X is:  475.7051 409.7372

Value is:  4965690
X is:  -3721262 -3287901

Value is:  2482306
X is:  -1862675 -1640817

Value is:  8.61863e+11
X is:  645200412672 571423064064

Value is:  430929412096
X is:  322643460096 285659529216

Value is:  Inf
X is:  -2.826342e+19 -2.503904e+19

Iteration:  3
Value is:  Inf
X is:  -2.826342e+19 -2.503904e+19 

This has been less of a success. At first, loss decreases nicely, but suddenly, the estimate dramatically overshoots, and keeps bouncing between negative and positive outer space ever after.

Luckily, there is something we can do.

Taken in isolation, what a Quasi-Newton method like L-BFGS does is determine the best descent direction. However, as we just saw, a good direction is not enough. With the flower function, wherever we are, the optimal path leads to disaster if we stay on it long enough. Thus, we need an algorithm that carefully evaluates not only where to go, but also, how far.

For this reason, L-BFGS implementations commonly incorporate line search, that is, a set of rules indicating whether a proposed step length is a good one, or should be improved upon.

Specifically, torch’s L-BFGS optimizer implements the Strong Wolfe conditions. We re-run the above code, changing just two lines. Most importantly, the one where the optimizer is instantiated:

optimizer <- optim_lbfgs(x_star, line_search_fn = "strong_wolfe")

And secondly, this time I found that after the third iteration, loss continued to decrease for a while, so I let it run for five iterations. Here is the output:

Iteration:  1
...
...

Value is:  -0.8838741
X is:  0.09035123 -0.03220009

Value is:  -0.928809
X is:  0.06564617 -0.026706

Iteration:  2
...
...

Value is:  -0.9991404
X is:  0.0006493925 -0.0002656128

Value is:  -0.9992246
X is:  0.0007130796 -0.0002947929

Iteration:  3
...
...

Value is:  -0.9997789
X is:  0.0002042478 -8.457939e-05

Value is:  -0.9998025
X is:  0.0001822711 -7.553725e-05

Iteration:  4
...
...

Value is:  -0.9999917
X is:  -6.320081e-06 2.614706e-06

Value is:  -0.9999923
X is:  -6.921942e-06 2.865841e-06

Iteration:  5
...
...

Value is:  -0.9999999
X is:  -7.267168e-08 3.009783e-08

Value is:  -0.9999999
X is:  -7.404627e-08 3.066708e-08 

It’s still not perfect, but a lot better.

Finally, let’s go one step further. Can we use torch for constrained optimization?

### Quadratic penalty for constrained optimization

In constrained optimization, we still search for a minimum, but that minimum can’t reside just anywhere: Its location has to fulfill some number of additional conditions. In optimization lingo, it has to be feasible.

To illustrate, we stay with the flower function, but add on a constraint: $$\mathbf{x}$$ has to lie outside a circle of radius $$sqrt(2)$$, centered at the origin. Formally, this yields the inequality constraint

$2 - {x_1}^2 - {x_2}^2 <= 0$

A way to minimize flower and yet, at the same time, honor the constraint is to use a penalty function. With penalty methods, the value to be minimized is a sum of two things: the target function’s output and a penalty reflecting potential constraint violation. Use of a quadratic penalty, for example, results in adding a multiple of the square of the constraint function’s output:

# x^2 + y^2 >= 2
# 2 - x^2 - y^2 <= 0
constraint <- function(x) 2 - torch_square(torch_norm(x))

penalty <- function(x) torch_square(torch_max(constraint(x), other = 0))

A priori, we can’t know how big that multiple has to be to enforce the constraint. Therefore, optimization proceeds iteratively. We start with a small multiplier, $$1$$, say, and increase it for as long as the constraint is still violated:

penalty_method <- function(f, p, x, k_max, rho = 1, gamma = 2, num_iterations = 1) {

for (k in 1:k_max) {
cat("Starting step: ", k, ", rho = ", rho, "\n")

minimize(f, p, x, rho, num_iterations)

cat("Value: ",  as.numeric(f(x)), "\n")
cat("X: ",  as.matrix(x), "\n")

current_penalty <- as.numeric(p(x))
cat("Penalty: ", current_penalty, "\n")
if (current_penalty == 0) break

rho <- rho * gamma
}

}

minimize(), called from penalty_method(), follows the usual proceedings, but now it minimizes the sum of the target and up-weighted penalty function outputs:

minimize <- function(f, p, x, rho, num_iterations) {

calc_loss <- function() {
optimizer$zero_grad() value <- f(x) + rho * p(x) value$backward()
value
}

for (i in 1:num_iterations) {
cat("Iteration: ", i, "\n")
optimizer\$step(calc_loss)
}

}

This time, we start from a low-target-loss, but unfeasible value. With yet another change to default L-BFGS (namely, a decrease in tolerance), we see the algorithm exiting successfully after twenty-two iterations, at the point (0.5411692,1.306563).

x_star <- torch_tensor(c(0.5, 0.5), requires_grad = TRUE)

optimizer <- optim_lbfgs(x_star, line_search_fn = "strong_wolfe", tolerance_change = 1e-20)

penalty_method(flower, penalty, x_star, k_max = 30)
Starting step:  1 , rho =  1
Iteration:  1
Value:  0.3469974
X:  0.5154735 1.244463
Penalty:  0.03444662

Starting step:  2 , rho =  2
Iteration:  1
Value:  0.3818618
X:  0.5288152 1.276674
Penalty:  0.008182613

Starting step:  3 , rho =  4
Iteration:  1
Value:  0.3983252
X:  0.5351116 1.291886
Penalty:  0.001996888

...
...

Starting step:  20 , rho =  524288
Iteration:  1
Value:  0.4142133
X:  0.5411959 1.306563
Penalty:  3.552714e-13

Starting step:  21 , rho =  1048576
Iteration:  1
Value:  0.4142134
X:  0.5411956 1.306563
Penalty:  1.278977e-13

Starting step:  22 , rho =  2097152
Iteration:  1
Value:  0.4142135
X:  0.5411962 1.306563
Penalty:  0 

## Conclusion

Summing up, we’ve gotten a first impression of the effectiveness of torch’s L-BFGS optimizer, especially when used with Strong-Wolfe line search. In fact, in numerical optimization – as opposed to deep learning, where computational speed is much more of an issue – there is hardly ever a reason to not use L-BFGS with line search.

We’ve then caught a glimpse of how to do constrained optimization, a task that arises in many real-world applications. In that regard, this post feels a lot more like a beginning than a stock-taking. There is a lot to explore, from general method fit – when is L-BFGS well suited to a problem? – via computational efficacy to applicability to different species of neural networks. Needless to say, if this inspires you to run your own experiments, and/or if you use L-BFGS in your own projects, we’d love to hear your feedback!

## Appendix

### Rosenbrock function plotting code

library(tidyverse)

a <- 1
b <- 5

rosenbrock <- function(x) {
x1 <- x[1]
x2 <- x[2]
(a - x1)^2 + b * (x2 - x1^2)^2
}

df <- expand_grid(x1 = seq(-2, 2, by = 0.01), x2 = seq(-2, 2, by = 0.01)) %>%
rowwise() %>%
mutate(x3 = rosenbrock(c(x1, x2))) %>%
ungroup()

ggplot(data = df,
aes(x = x1,
y = x2,
z = x3)) +
geom_contour_filled(breaks = as.numeric(torch_logspace(-3, 3, steps = 50)),
show.legend = FALSE) +
theme_minimal() +
scale_fill_viridis_d(direction = -1) +
theme(aspect.ratio = 1)

### Flower function plotting code

a <- 1
b <- 1
c <- 4

flower <- function(x) {
a * torch_norm(x) + b * torch_sin(c * torch_atan2(x[2], x[1]))
}

df <- expand_grid(x = seq(-3, 3, by = 0.05), y = seq(-3, 3, by = 0.05)) %>%
rowwise() %>%
mutate(z = flower(torch_tensor(c(x, y))) %>% as.numeric()) %>%
ungroup()

ggplot(data = df,
aes(x = x,
y = y,
z = z)) +
geom_contour_filled(show.legend = FALSE) +
theme_minimal() +
scale_fill_viridis_d(direction = -1) +
theme(aspect.ratio = 1)

Photo by Michael Trimble on Unsplash

1. The code used to generate this plot is found in the appendix.↩︎

2. The code to plot it is, again, found in the appendix.↩︎

### Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

### Citation

Keydana (2021, April 27). Posit AI Blog: torch for optimization. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2021-04-22-torch-for-optimization/
@misc{keydanatorchoptimization,
}