Up to now, all torch
We’ve discussed several instances of deep learning applications that have been studied extensively. Despite this, its computerized differentiation feature proves particularly valuable across various domains. Numerical optimisation stands as a striking example: We employ torch
to identify the optimal operation.
The truth is that minimization of operations occurs during the training process of a neural network. Since the operation in question typically involves a problem that is too intricate to analyze mathematically, Numerical optimization techniques are developed to effectively manage and overcome complex problems. However, ultimately, it starts with characteristics that are significantly less intricately crafted. Crafted with intentionality, these substitutes present unique difficulties from the onset.
Numerical Optimization: A Primer torch
. Key findings highlight the significance and practical value of the library’s L-BFGS optimizer, alongside the impact of integrating it with a line search algorithm. As a delightful add-on, we showcase a concrete example of constrained optimization, where a constraint is enforced through a quadratic penalty function.
To accelerate processing, we deviate from traditional approaches and harness the power of tensors alone. This may eventually become relevant, despite the overall direction remaining unchanged. All adjustments will be closely tied to the seamless integration of optimizer
s and their capabilities.
Operate minimization, DYI method
Let’s perform this operation manually, and explore the process. This operation can be performed with just two variables.
With alpha and beta values typically adjustable between 1 and 5, respectively?
In R:
The village of minimal lies nestled at the coordinates (1,1), within a narrow and secluded valley that plunges precipitously into rugged terrain on all sides.

Determine 1: Rosenbrock operate.
Our purpose and methodology are thus defined.
We seek to determine the values of x and y for which the function f minimizes its value. As we embark on this journey, it’s essential to establish a starting point, which sets the trajectory for our exploration of the graph. From there, we begin to descend along the downward-sloping gradient, navigating through zones with progressively lower operational values.
In each iteration, we start with the current level, calculate the operational value by combining it with the gradient, and then adjust this combination by subtracting a portion of the gradient to obtain a novel candidate level. The optimization process continues until the desired minimum is reached, as indicated by a zero gradient, or when incremental improvements fall below a predetermined threshold.
Right here is the corresponding code. Without warning, we embark (-1,1)
. What is the optimal learning rate for this model? We’re going to attempt a fraction of 0.1 to examine its impact, followed by an even smaller fraction of 0.001 to further gauge its effects.
Iteration: 100, Worth is: 0.3502924, Gradient is: -0.667685, -0.5771312 Iteration: 200, Worth is: 0.07398106, Gradient is: -0.1603189, -0.2532476 ... Iteration: 900, Worth is: 0.0001532408, Gradient is: -0.004811743, -0.009894371 Iteration: 1000, Worth is: 6.962555e-05, Gradient is: -0.003222887, -0.006653666
While this formulation may technically function, it merely illustrates the principle in a literal sense. With torch
By providing a comprehensive selection of verified optimization algorithms, we eliminate the need for manual calculation of potential values.
Operate minimization with torch
optimizers
As a substitute, we let aspiring educators join our community and embark on a transformative journey. torch
The algorithm will replace the current candidate. Habitually, our initial foray is often flawed.
Adam
Optimization unfolds at an accelerated pace with Adam’s involvement. While reality suggests that finding an ideal learning pace requires significant trial and error. Defaulting to a studying price of $0.001 for comparison purposes is reasonable.
Iteration: 10 | Worth: 0.8559565 | Gradient: [-1.732036, -0.5898831] Iteration: 20 | Worth: 0.1282992 | Gradient: [-3.22681, 1.577383] Iteration: 30 | Worth: 4.003079e-05 | Gradient: [-0.05383469, 0.02346456] Iteration: 40 | Worth: 6.937736e-05 | Gradient: [-0.003240437, -0.006630421]
After refining the process through several hundred attempts, we finally reached a satisfactory outcome. While this approach can yield results quickly, it is still significantly slower than the original method described in the guide. Fortunately, additional enhancements are attainable.
L-BFGS
Among the many many torch
While optimizers like Adam, AdamW, and RMSprop are commonly used in deep learning, there is an “outsider” that excels in traditional numerical optimization but remains relatively unknown in the realm of neural networks: the Limited Memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm. A novel memory-optimized implementation of the quasi-Newton Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm.
BFGS is arguably the most widely applied of the numerous quasi-Newton, second-order optimization algorithms. Compared to the typical household of first-order optimization methods that rely solely on gradient information when choosing a descent path, second-order algorithms go one step further by incorporating curvature information into their decision-making process. To converge efficiently, Newton-based methods explicitly calculate the expensive Hessian matrix, whereas Quasi-Newton approaches eschew this step, relying instead on iterative approximations of the matrix’s inverse.
As the Rosenbrock function’s elongated and narrow basin invites consideration of how curvature data might impact performance. In fact, the proof will quickly reveal that it indeed does. Earlier in the development process, one needs to have a good understanding of the code. When employing L-BFGS, it is crucial to envelop each objective function evaluation and gradient computation within a closure,calc_loss()
Within the scope of each iteration, these functions are designed to be reusable multiple times. Here’s a revised version of the text:
Observe how the closure’s repeated execution becomes apparent through the verbose output of this code snippet.
Iteration: 1 Worth is: 0.04880721 Gradient is: -0.262119 -0.1132655, Worth is: 0.0302862 Gradient is: 1.293824 -0.7403332 Iteration: 2 Worth is: 0.01697086 Gradient is: 0.3468466 -0.3173429, Worth is: 0.01124081 Gradient is: 0.2420997 -0.2347881 Iteration: 3 Worth is: 4.547474e-12 Gradient is: -1.907349e-05 9.536743e-06
Although we executed the algorithm three times, the optimal outcome is achieved after just two iterations. With impressive performance already achieved, we now turn our attention to the more challenging task of applying L-BFGS to an even tougher dataset, aptly named “challenge_data” due to its inherent complexity.
(But) extra enjoyable with L-BFGS
What is the desired outcome of this operation? Mathematically, its minimum is approximately (0,0)
However, from a technical perspective, the operation itself remains undefined until. (0,0)
, because the atan2
Used components within operations will not be defined there.

Determine 2: Flower operate.
We execute the same code repeatedly, spanning (20,20)
this time.
Iteration: 1 Worth: 28.28427 Gradient: 0.8071069, 0.6071068 X: 20, 20 ... Worth: 19.33546 Gradient: 0.8100872, 0.6188223 X: 12.957, 14.68274 ... Worth: 18.29546 Gradient: 0.8096464, 0.622064 X: 12.14691, 14.06392 ... Worth: 9.853705 Gradient: 0.7546976, 0.7025688 X: 5.763702, 8.895616 Worth: 2635.866 Gradient: -0.7407354, -0.6717985 X: -1949.697, -1773.551 Iteration: 2 Worth: 1333.113 Gradient: -0.7413024, -0.6711776 X: -985.4553, -897.5367 Worth: 30.16862 Gradient: -0.7903821, -0.6266789 X: -21.02814, -21.72296 Worth: 1281.39 Gradient: 0.7544561, 0.6563575 X: 964.0121, 843.7817 Worth: 628.1306 Gradient: 0.7616636, 0.6480014 X: 475.7051, 409.7372 Worth: 4965690 Gradient: -0.7493951, -0.662123 X: -3721262, -3287901 Worth: 2482306 Gradient: -0.7503822, -0.6610042 X: -1862675, -1640817 Worth: 8.61863e+11 Gradient: 0.7486113, 0.6630091 X: 645200412672, 571423064064 Worth: 430929412096 Gradient: 0.7487153, 0.6628917 X: 322643460096, 285659529216 Worth: Inf Gradient: 0, 0 X: -2.826342e+19, -2.503904e+19
This experiment has had far fewer positive outcomes than we initially hoped for. Initially, losses decreased steadily, but then the projection skyrocketed out of control, perpetually oscillating between dire forecasts and optimistic assessments thereafter.
Luckily, there’s one thing that works in our favor.
L-BFGS with line search
In isolation, what a Quasi-Newton technique like L-BFGS accomplishes is determining an optimal descent path. Notwithstanding our initial observations, a superior pathway remains insufficient nonetheless? As a natural phenomenon unfolds, the optimal course ultimately leads to disaster if continued indefinitely. We seek an algorithm that meticulously assesses both the destination and distance.
Due to this constraint, most L-BFGS implementations incorporate a line search algorithm, which determines whether a suggested step size is acceptable or needs refinement.
Particularly, torch
The L-BFGS optimizer efficiently minimizes functions using a limited-memory quasi-Newton approach. We re-run the code with minor adjustments to two specific lines. The most critical point at which the optimizer is initialized:
After further experimentation, I observed a surprising trend: even after three iterations, the loss persisted in decreasing over time, prompting me to continue training for an additional two cycles. Right here is the output:
Iteration: 1 ... Worth is: -0.8838741 Gradient is: 3.742207, 7.521572 X is: 0.09035123, -0.03220009 Worth is: -0.928809 Gradient is: 1.464702, 0.9466625 X is: 0.06564617, -0.026706 Iteration: 2 ... Worth is: -0.9991404 Gradient is: 39.28394, 93.40318 X is: 0.0006493925, -0.0002656128 Worth is: -0.9992246 Gradient is: 6.372203, 12.79636 X is: 0.0007130796, -0.0002947929 Iteration: 3 ... Worth is: -0.9997789 Gradient is: 3.565234, 5.995832 X is: 0.0002042478, -8.457939e-05 Worth is: -0.9998025 Gradient is: -4.614189, -13.74602 X is: 0.0001822711, -7.553725e-05 Iteration: 4 ... Worth is: -0.9999917 Gradient is: -382.3041, -921.4625 X is: -6.320081e-06, 2.614706e-06 Worth is: -0.9999923 Gradient is: -134.0946, -321.2681 X is: -6.921942e-06, 2.865841e-06 Iteration: 5 ... Worth is: -0.9999999 Gradient is: -3446.911, -8320.007 X is: -7.267168e-08, 3.009783e-08 Worth is: -0.9999999 Gradient is: -3419.361, -8253.501 X is: -7.404627e-08, 3.066708e-08
Despite its limitations, it is significantly better.
Let’s take another stride forward. Can we use torch
for constrained optimization?
Quadratic penalty for constrained optimization
Although constrained optimization seeks a minimum value, this minimal value cannot exist arbitrarily; its location must satisfy specific additional conditions. Optimized data structures are often said to have a time complexity of O(1), indicating that the amount of time taken by the algorithm remains constant regardless of the size of the input.
Here is the rewritten text:
To illustrate this concept, we consider a flower operation that takes place outside a circular boundary of radius centered at the origin. The revised text is:
Formally, this yields a strict inequality constraint.
To effectively minimize while respecting constraints, consider employing a penalty function. With penalty strategies, the objective to minimize is a combination of two components: the goal function’s output and a penalty term that reflects potential constraint violations. Using a comma, for instance, leads to including a few of the square. of the constraint operate’s output:
A priori, it is unclear what size a number must attain to enforce the constraint. As a result, optimization unfolds through incremental refinement. As we initiate our iterative process, we commence with a modest multiplier, for instance, and continually refine it until the constraint is no longer being breached.
decrease()
, referred to as from penalty_method()
As a result, the algorithm optimizes the balance between the goal and penalized outcomes.
We initiate our endeavour with an ambitious yet unrealistic objective of achieving a low target loss. By introducing an additional tweak to the default L-BFGS, specifically a lower in-tolerance threshold, we observe the algorithm terminating promptly after just 22 iterations, effectively leveling up. (0.5411692,1.306563)
.
Beginning Step 1, Rho: 1 Iteration 1, Worth: 0.3469974, X: 0.5154735, 1.244463, Penalty: 0.03444662 Beginning Step 2, Rho: 2 Iteration 1, Worth: 0.3818618, X: 0.5288152, 1.276674, Penalty: 0.008182613 Beginning Step 3, Rho: 4 Iteration 1, Worth: 0.3983252, X: 0.5351116, 1.291886, Penalty: 0.001996888 ... Beginning Step 20, Rho: 524288 Iteration 1, Worth: 0.4142133, X: 0.5411959, 1.306563, Penalty: 3.552714e-13 Beginning Step 21, Rho: 1048576 Iteration 1, Worth: 0.4142134, X: 0.5411956, 1.306563, Penalty: 1.278977e-13 Beginning Step 22, Rho: 2097152 Iteration 1, Worth: 0.4142135, X: 0.5411962, 1.306563, Penalty: 0
Conclusion
The preliminary findings suggest that the endeavour has yielded a promising initial outcome, offering insights into its overall efficacy. torch
The L-BFGS optimizer’s performance is significantly enhanced when paired with the Sturdy-Wolfe line search. While numerical optimization often prioritizes speed over deep exploration, in situations where computation is the bottleneck, it’s rare to justify using L-BFGS with line search.
Here is the rewritten text:
We have briefly touched upon techniques for constrained optimization, an important concept with numerous applications in real-world scenarios. Given the current trajectory, this publication seems more of an introduction than a comprehensive review. What are key characteristics that make L-BFGS particularly well-suited for specific optimization problems or issues? By virtue of its computational efficiency, this concept finds applicability in various species of neural networks. If this inspires you to conduct personal experiments or utilize L-BFGS in your own projects, we would be delighted to hear your ideas!
Thanks for studying!
Appendix
Rosenbrock operate plotting code
Flower operate plotting code
Picture by on