An intro the BFGS Optimisation Algorithm
The Broyden, Fletcher, Goldfarb, and Shanno, or BFGS algorithm, is a local search optimisation algorithm.
It is a variant of second-order optimisation algorithm, implying that it leverages the second-order derivative of an objective function and comes from a categorization of algorithms referenced to as Quasi-Newton methods that go about approximating the second derivative – referred to as the Hessian, for optimisation issues where the second derivative cannot be quantified.
The BFGS algorithm is probably one of the most broadly leveraged second-order algorithms for numerical optimisation and is typically leveraged to fit machine learning algorithms like the logistic regression algorithm.
In this guide, you will find out about the BFGS second-order optimisation algorithm.
After going through this guide, you will be aware of:
- That second-order optimisation algorithms are algorithms that harness the second-order derivative, referred to as the Hessian matrix for multivariate objective functions.
- The BFGS algorithm is probably the most popular second-order algorithm for numerical optimisation and comes from a group referred to as Quasi-Newton methods.
- How to minimize objective functions leveraging the BFGS and L-BFGS-B algorithms within Python.
Tutorial Summarization
This tutorial is subdivided into three portions, which are:
1] Second-order Optimisation Algorithms
2] BFGS Optimisation Algorithm
3] Worked Example of BFGS
Second-order optimisation algorithms
Optimisation consists of identifying values for input parameters that maximize or minimize an objective function.
Newton-method optimisation algorithms are those algorithms that leverage the second derivative of the objective function.
You may remember from calculus that the first derivative of a function is the rate of change or curvature of the function at a particular point. We can follow the derivative downhill, or uphill, by an optimisation algorithm toward the minima of the function (the input values that have the outcome in the smallest output of the objective function)
Algorithms that leverage the first-derivative are referred to as the first-order optimisation algorithms: an instance of a first-order algorithm is the gradient descent optimisation algorithm.
First-order methods: Optimisation algorithms that leverage the first-order derivative to identify the optima of an objective function.
The second-order derivative is the derivative of the derivative, or the rate of change of the rate of change.
You can follow the second-order derivative to more effectively situate the optima of the objective function. This makes sense in a more general sense, as the more data we have with regards to the objective function, the simpler it might be to go about optimising it.
The second-order derivative enables us to know both which direction to move (like the first-order) but also estimate how far to shift in that direction, referred to as the step size.
Second-order information, on the other hand, enables us to make a quadratic approximation of the objective function and approximate the correct step size to attain a local minimum.
Algorithms that leverage the second-order derivative are referenced to as second-order optimisation algorithms.
Second-order methods: Optimisation algorithms that leverage the second-order derivative to identify the optima of an objective function.
An instance of a second-order optimisation algorithm is referenced to as Newton’s method.
When an objective function has more than one input variable, the input variables together may be perceived of as a vector, which may be familiar from linear algebra.
The gradient is the generalization of the derivative to multivariate functions. It captures the local slope of the function, enabling us to forecast the influence of taking a small step from a point in any direction.
Likewise, the first derivative of several input variables might also be a vector, where every element is referred to as a partial derivative. This vector of partial derivatives is referenced to as the gradient.
Gradient: Vector of partial first derivatives for several input variables of an objective function.
This concept generalizes to the second-order derivatives of the multivariate inputs, which is a matrix consisting of the second derivatives referred to as the Hessian matrix.
Hessian: Matrix of partial second-order derivatives for several input variables of an objective function.
The Hessian Matrix is square and symmetric if the second derivatives are all ongoing at the point where we are quantifying the derivatives. This is usually the scenario when finding solutions real-valued optimisation issues and an expectation when leveraging several second-order methods.
The Hessian of a multivariate function is a matrix consisting all of the second derivatives with regards to the input. The second derivatives gather data with regards to the local curvature of the function.
As such, it is typical to detail second-order optimisation algorithms leveraging the Hessian to the optima of the objective function.
Now that we possess a high-level comprehension of second-order optimisation algorithms, let’s take a deeper look at the BFGS algorithm.
BFGS Optimisation Algorithm
BFGS is a second-order optimisation algorithm.
The acronym is named after the four co-founders of the algorithm: Broyden, Fletcher, Goldfarb, and Shanno.
It is a local search algorithm, which is targeted for convex optimisation issues with a singular optima.
The BFGS algorithm is probably ideally comprehended as coming from a grouping of algorithms that are an extension to Newton’s Method optimisation algorithm, referenced to as Quasi-Newton methods.
Newton’s method is a second-order optimisation algorithm that leverages the Hessian matrix.
A restriction of Newton’s method is that it needs the calculation of the inverse of the Hessian matrix. This is a computationally costly operation and might not be stable dependent on the attributes of the objective function.
Quasi-Newton methods are amongst the most broadly leveraged methods for nonlinear optimisation. They are incorporated in several software libraries, and they are efficient in finding solutions to a broad array of small to midsize issues, specifically when the Hessian is difficult to compute.
The primary difference amongst different Quasi-Newton optimisation algorithms is the particular way in which the approximation of the inverse Hessian is quantified.
The BFGS algorithm is one particular way for updating the calculation of the inverse Hessian, rather than recalculating every iteration. It, or its extension, might be one of the most widespread Quasi-Newton or even second-order optimisation algorithms leveraged for numerical optimisation.
An advantage of leveraging the Hessian, when available, it that it can be leveraged to decide both the direction and the step size to shift in order to alter the input parameters to minimize (or maximize) the objective function.
Quasi-Newton strategies like BFGS approximate the inverse Hessian, which can subsequently be leveraged to decide the direction to move, but we no more possess the step size.
The BFGS algorithm tackles this by leveraging a line search in the selected direction to decide how far to shift in that direction.
For the derivation and calculations leveraged by the BFGS algorithms, the resources indicated in the further reading portion of this blog article are highly recommended.
The size of the Hessian and its inverse is proportional to the number of input parameters to the objective function. As such, the size of the matrix can become very big for hundreds, thousand, or millions of parameters.
The BFGS algorithm must store the inverse Hessian matrix, M, that needs O(n2) memory, making BFGS not practical for most sophisticated deep learning models that usually have millions of parameters.
Limited Memory BFGS (or L-BFGS) is an extension to the BFGS algorithm that tackles the cost of possessing a big number of parameters. It does this by not needing that the complete approximation of the inverse matrix be recorded, through the assumption that a simplification of the inverse Hessian in the prior iteration of the algorithm (leveraged in the approximation)
Now that we are acquainted with the BFGS algorithm from a high level, let’s delve into how we can leverage it.
Worked instance of BFGS
In this part of the blog, we will observe instances of leveraging the BFGS optimisation algorithm.
We can implement the BFGS algorithm for optimization of random functions in Python leveraging the minimize() SciPy function.
The function takes various arguments, but most critically, we can mention the name of the objective function as the first argument, the beginning point for the search as the second argument, and mention the “method” argument as “BFGS”. The name of the function leveraged to calculate the derivative of the objective function can be mentioned through the “jac” argument.
…
# perform the bfgs algorithm search
result = minimize(objective, pt, method=’BFGS’, jac=derivative)
Let’s take a look at an instance.
To start with, we can go about defining a simple 2D objective function, a bowl function, for example, x^2. It is simple the sum of the squared input variables possessing an optima at f(0,0) = 0.0
# objective function
def objective(x):
return x[0]**2.0 + x[1]**2.0
Then, let’s define a function for the derivative of the function which is: [x*2, y*2]
# derivative of the objective function
def derivative(x):
return [x[0] * 2, x[1] * 2]
We will then go about defining the bounds of the function as a box with the range -5 and 5 in every dimension.
…
# define range for input
r_min, r_max = -5.0, 5.0
The beginning point of the search will be an arbitrarily generated position in the search domain.
…
# define the starting point as a random sample from the domain
pt = r_min + rand(2) * (r_max – r_min)
We can then go about applying the BFGS algorithm to identify the minima of the objective function by mentioning the name of the objective function, the initial point, the strategy we want to leverage (BFGS) and the name of the derivative function.
…
# perform the bfgs algorithm search
result = minimize(objective, pt, method=’BFGS’, jac=derivative)
We can subsequently review the outcome reporting a message as to if the algorithm completed successfully or not and the cumulative number of evaluations of the objective function that were carried out.
…
# summarize the result
print(‘Status : %s’ % result[‘message’])
print(‘Total Evaluations: %d’ % result[‘nfev’])
Lastly, we can go about reporting the input variables that were identified and their evaluation against the objective function.
…
# evaluate solution
solution = result[‘x’]
evaluation = objective(solution)
print(‘Solution: f(%s) = %.5f’ % (solution, evaluation))
Connecting all of this together, the complete instance is detailed below:
# bfgs algorithm local optimization of a convex function
from scipy.optimize import minimize
from numpy.random import rand
# objective function
def objective(x):
return x[0]**2.0 + x[1]**2.0
# derivative of the objective function
def derivative(x):
return [x[0] * 2, x[1] * 2]
# define range for input
r_min, r_max = -5.0, 5.0
# define the starting point as a random sample from the domain
pt = r_min + rand(2) * (r_max – r_min)
# perform the bfgs algorithm search
result = minimize(objective, pt, method=’BFGS’, jac=derivative)
# summarize the result
print(‘Status : %s’ % result[‘message’])
print(‘Total Evaluations: %d’ % result[‘nfev’])
# evaluate solution
solution = result[‘x’]
evaluation = objective(solution)
print(‘Solution: f(%s) = %.5f’ % (solution, evaluation))
Running the instance applies the BFGS algorithm to the objective function and reports the outcomes.
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation procedure, or variations in numerical accuracy. Take up executing the instance a few times and contrast the average outcome.
In this scenario, we can observe that four iterations of the algorithm were carried out and a solution very near to the optima f(0.0, 0.0) was found out, at least to an adequate degree of accuracy.
Status: Optimization terminated successfully.
Total Evaluations: 4
Solution: f([0.00000000e+00 1.11022302e-16]) = 0.00000
The minimize() function is also supportive of the L-BFGS algorithm that has reduced memory requirements than BFGS.
Particularly, the L-BFGS-B variant of the algorithm where the -B suffix signifies a “boxed” variant of the algorithm, where the bounds of the domain can be mentioned.
This can be accomplished by mentioning the “method” argument as “L-BFGS-B”
…
# perform the l-bfgs-b algorithm search
result = minimize(objective, pt, method=’L-BFGS-B’, jac=derivative)
The total instance with this update is detailed below.
# l-bfgs-b algorithm local optimization of a convex function
from scipy.optimize import minimize
from numpy.random import rand
# objective function
def objective(x):
return x[0]**2.0 + x[1]**2.0
# derivative of the objective function
def derivative(x):
return [x[0] * 2, x[1] * 2]
# define range for input
r_min, r_max = -5.0, 5.0
# define the starting point as a random sample from the domain
pt = r_min + rand(2) * (r_max – r_min)
# perform the l-bfgs-b algorithm search
result = minimize(objective, pt, method=’L-BFGS-B’, jac=derivative)
# summarize the result
print(‘Status : %s’ % result[‘message’])
print(‘Total Evaluations: %d’ % result[‘nfev’])
# evaluate solution
solution = result[‘x’]
evaluation = objective(solution)
print(‘Solution: f(%s) = %.5f’ % (solution, evaluation))
Executing the instance application applies the L-BFGS-B algorithm to the objective function and report the outcomes.
Your outcomes may vary provided the stochastic nature of the algorithm or evaluation process, or variations in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
Then, we can observe that the minima to the function is identified in very minimal evaluations.
Status : b’CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL’
Total Evaluations: 3
Solution: f([-1.33226763e-15 1.33226763e-15]) = 0.00000
Further Reading
This part of the blog post furnishes additional resources on the subject if you are seeking to delve deeper.
Books
Algorithms for Optimization, 2019
Deep Learning, 2016
Numerical Optimization, 2006
Linear and Nonlinear Optimisation, 2009
APIs
scipy.optimize.minimize API
Articles
Broyden-Fletcher-Goldfarb-Shanno algorithm, Wikipedia
Limited memory BFGS, Wikipedia
Conclusion
In this guide, you found out about the BFGS second-order optimization algorithm.
Particularly, we learned:
- Second-order optimization algorithms are algorithms are algorithms that leverage the second-order derivative, referred to as the Hessian Matrix for multivariate objective functions.
- The BFGS algorithm is probably the most popular second-order algorithm for numerical optimisation and comes from a grouping referred to as Quasi-Newton methods.
- How to minimize objective functions leveraging the BFGS and L-BFGS-B algorithms within Python.