### Convex Optimization in R

Optimization is a major aspect of machine learning. It is at the base of most widespread strategies, from least squares regression to artificial neural networks.

In this blog article, you will find out recipes for five optimization algorithms in R.

These strategies might be useful in the base of your own implementation of a machine learning algorithm. You might want to implement your own algorithm tuning scheme to optimize the parameters of a model for some cost function.

A good instance may be the scenario where you wish to optimize the hyper-parameters of a blend of predictions from an ensemble of several child models.

Golden Section Search

Golden Section Search is a Line Search method for Global Optimization in one-dimension. It is a Direct Search (Pattern Search) strategy as it samples the function to approximate a derivative instead of computing it directly.

The Golden Section Search is connected to pattern searches of discrete ordered listings like the binary search and the Fibonnaci Search. It is connected to other Line Search algorithms like Brent’s method and more typically to other direct search optimization strategies like the Nelder-Mead Method.

The information processing goal of the method is to situate the extremum of a function. It does this by directly sampling the function leveraging a pattern of a trio of points. The points form the brackets on the search: the first and the last points are the present bounds of the search, and the third point partitions the intervening space. The partitioning point is chosen so that the ratio between the bigger partition and the entire interval is the same as the ratio of the large partition to the small partition, referred to as the golden ratio (phi). The partitions are contrasted on the basis of their function evaluation and the better performing section is chosen as the new bounds on the search. The procedure recurses until the desired level of precision (bracketing of the optimum) is obtained or the search stalls.

The following instance furnishes a code listing Golden Section Search method in R solving a 1D nonlinear unconstrained optimization function.

den Section Search in R

R

[Control]

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 # define a 1D basin function, optima at f(0)=0 basin <- function(x) { x^2 }   # locate the minimum of the function using a Golden Section Line Search result <- optimize(  basin, # the function to be minimized c(-5, 5), # the bounds on the function parameter maximum=FALSE, # we are concerned with the function minima tol=1e-8) # the size of the final bracketing   # display the results print(result\$minimum) print(result\$objective)   # plot the function  x <- seq(-5, 5, length.out=100) y <- basin(expand.grid(x)) plot(x, y, xlab=”x”,ylab=”f(x)”, type=”l”) # plot the solution as a point points(result\$minimum, result\$objective, col=”red”, pch=19) # draw a square around the optima to highlight it rect(result\$minimum-0.3, result\$objective-0.7, result\$minimum+0.3, result\$objectiv Usage Heuristics

• Assumes that the function is convex and unimodal specifically, that the function has a singular optimum and that it lies between the bracketing points.
• Intended to identify the extrema one-dimensional continuous functions.
• It was demonstrated to be more effective than an equal-sized partition line search.
• The termination criteria is a specification on the minimum distance between the brackets on the optimum.
• It can swiftly locate the bracketed area of the optimum but features lesser efficiency at locating the particular optimum.
• After a solution of desired accuracy is situated, it can be furnished as the basis to a second search algorithm that has a quicker rate of convergence.

The Nelder-Mead method is an optimization algorithm for multidimensional nonlinear unconstrained functions.

It is a direct search strategy in that it does not leverage a function gradient during the procedure. It is a pattern search in that it leverages a geometric pattern to explore the problem space.

It is connected to other Direct Search optimization strategies like Hooke and Jeeves Pattern Search that additionally leverages a geometric pattern to optimize an objective function.

The data processing objective of the Nelder-Mead method is to situate the extremum of a function. This is accomplished by overlaying a simplex (geometrical pattern) in the domain and iteratively increasing and/or reducing its size till an optimal value is identified. The simplex is defined always with n+1 vertices, where n is the number of dimensions of the search space (i.e. a triangle for a 2D problem)

The procedure consists of identifying the worst performing point of the complex and substituting it with a point reflected through the centroid (centre point) of the remaining points. The simplex can be deformed (adapt itself to the topology of the search space) by expanding away from the worst point, contract along a single dimension away from the worst point, or contract on all dimensions towards the best point.

The following instance furnishes a code listing of the Nelder-Mead method in R solving a 2D nonlinear optimization function.

R

[Control]

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 # definition of the 2D Rosenbrock function, optima is at (1,1) rosenbrock <- function(v) {  (1 – v)^2 + 100 * (v – v*v)^2 }   # locate the minimum of the function using the Nelder-Mead method result <- optim( c(runif(1,-3,3), runif(1,-3,3)), # start at a random position rosenbrock, # the function to minimize NULL, # no function gradient  method=”Nelder-Mead”, # use the Nelder-Mead method control=c( # configure Nelder-Mead maxit=100, # maximum iterations of 100 reltol=1e-8, # response tolerance over-one step alpha=1.0, # reflection factor beta=0.5, # contraction factor gamma=2.0)) # expansion factor   # summarize results # the coordinate of the minimum print(result\$par)  # the function response of the minimum print(result\$value)  # the number of function calls performed print(result\$counts)   # display the function as a contour plot x <- seq(-3, 3, length.out=100) y <- seq(-3, 3, length.out=100) z <- rosenbrock(expand.grid(x, y)) contour(x, y, matrix(log10(z), length(x)), xlab=”x”,ylab=”y”) # draw the optima as a point points(result\$par, result\$par, col=”red”, pch=19) # draw a square around the optima to highlight it rect(result\$par-0.2, result\$par-0.2, result\$par+0.2, result\$par+0.2, l Usage Heuristics

• It can be leveraged with multi-dimensional functions (one or more parameters) and with non-linear response surfaces.
• It does not leverage a function derivative implying that it can be leveraged on non-differentiable functions, discontinuous functions, non-smooth functions, and noisy functions.
• As a direct search method, it is perceived as inefficient and slow compared to modern derivative-based methods.
• It is dependent on the standing position and can be caught by local optimum in multimodal functions.
• The stopping criteria can be a minimum change in the ideal position.
• The nature of the simplex structure can imply it can get stuck in non-optimal areas of the search space, this is more probable if the size of the initial simplex is too big.
• When the strategy does function (is relevant for the function being optimized) it has been demonstrate to be quick and robust.

Gradient descent is a first-order derivative optimization strategy for unconstrained nonlinear function optimization. It is referred to as Gradient Descent as it was envisioned for function minimization. When applied to function maximization it might be referenced to as Gradient Ascent.

Steepest Descent Search is an extension that carries out a line search on the line of the gradient to the locate the optimum neighbouring point (optimum step or steepest step). Batch gradient descent is an extension where the cost function and its derivative are computed as the summed error on a collection of training instances. Stochastic Gradient Descent (or Online Gradient Descent) is like Batch Gradient Descent except that the cost function and derivative are computed for every training instance.

The information processing goal of the method is to situate the extremum of a function. This is accomplished by initially choosing a beginning point in the search space. For a provided point in the search space, the derivative of the cost function is calculated and a new point is chosen down the gradient of the functions derivative at a distance of alpha (the step size parameter) from the present point.

The instance furnishes a code listing Gradient Descent algorithm in R finding a solution to two-dimensional nonlinear optimization function.

dient Descent in R

R

[Control]

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 # define a 2D basin function, optima is at (0,0) basin <- function(x) { x^2 + x^2 }   # define the derivative for a 2D basin function derivative <- function(x) { c(2*x, 2*x) }   # definition of the gradient descent method in 2D gradient_descent <- function(func, derv, start, step=0.05, tol=1e-8) { pt1 <- start grdnt <- derv(pt1) pt2 <- c(pt1 – step*grdnt, pt1 – step*grdnt) while (abs(func(pt1)-func(pt2)) > tol) { pt1 <- pt2 grdnt <- derv(pt1) pt2 <- c(pt1 – step*grdnt, pt1 – step*grdnt) print(func(pt2)) # print progress } pt2 # return the last point }   # locate the minimum of the function using the Gradient Descent method result <- gradient_descent( basin, # the function to optimize derivative, # the gradient of the function c(runif(1,-3,3), runif(1,-3,3)), # start point of the search  0.05, # step size (alpha) 1e-8) # relative tolerance for one step   # display a summary of the results print(result) # coordinate of fucntion minimum print(basin(result)) # response of fucntion minimum   # display the function as a contour plot x <- seq(-3, 3, length.out=100) y <- seq(-3, 3, length.out=100) z <- basin(expand.grid(x, y)) contour(x, y, matrix(z, length(x)), xlab=”x”,ylab=”y”) # draw the optima as a point points(result, result, col=”red”, pch=19) # draw a square around the optima to highlight it rect(result-0.2, result-0.2, result+0.2, result+0.2, lwd=2) Usage Heuristics

• The strategy is restricted to identifying the local optimum, which if the function is convex, is also the global optimum.
• It is viewed as ineffective and slow (linear) to converge relative to sophisticated methods. Convergence can be slow if the gradient at the optimum flattens out (gradient goes to 0 slowly). Convergence can also be slow if the Hessian is badly conditioned (gradient changes swiftly in some directions and slower in others)
• The step size (alpha) might be constant, might adapt with the search, and might be maintained holistically or for every dimension.
• The strategy is sensitive to initial conditions, and as such, it can be typical to repeat the search procedure a number of times with arbitrarily chosen initial positions.
• If the step size parameter (alpha) is too minimal, the search will typically take a large number of iterations to converge, if the parameter is too big and can overshoot the function’s optimum.
• Contrasted to non-iterative function optimization strategies, gradient descent has some comparative efficiencies when it comes to scaling with the number of features (dimensions)

Conjugate gradient method is a first-order derivative optimization strategy for multidimensional nonlinear unconstrained functions. It is connected to other first-order derivative optimization algorithms like Gradient Descent and Steepest Descent.

The information processing goal of the strategy is to situate the extremum of a function. From a beginning position, the strategy first computes the gradient to locate the direction of steepest descent, then carries out a line search to situate the optimum step size (alpha). The strategy then repeats the procedure of computing the steepest direction, computes direction of the search, and performing a line search to locate the optimum step size. A parameter beta defines the direction update rule based on the gradient and can be computed leveraging one of a number of strategies.

The difference between Conjugate Gradient and Steepest Descent is that it leverages conjugate directions instead than local gradients to move downhill towards the function minimum, which can be very effective.

The instance furnishes a code listing of the Conjugate Gradient method in R solving a two-dimensional nonlinear optimization function.

R

[Control]

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 # definition of the 2D Rosenbrock function, optima is at (1,1) rosenbrock <- function(v) {  (1 – v)^2 + 100 * (v – v*v)^2 }   # definition of the gradient of the 2D Rosenbrock function derivative <- function(v) { c(-400 * v * (v – v*v) – 2 * (1 – v),  200 * (v – v*v)) }   # locate the minimum of the function using the Conjugate Gradient method result <- optim( c(runif(1,-3,3), runif(1,-3,3)), # start at a random position rosenbrock, # the function to minimize derivative, # no function gradient  method=”CG”, # use the Conjugate Gradient method control=c( # configure Conjugate Gradient maxit=100, # maximum iterations of 100 reltol=1e-8, # response tolerance over-one step type=2)) # use the Polak-Ribiere update method   # summarize results print(result\$par) # the coordinate of the minimum print(result\$value) # the function response of the minimum print(result\$counts) # the number of function calls performed   # display the function as a contour plot x <- seq(-3, 3, length.out=100) y <- seq(-3, 3, length.out=100) z <- rosenbrock(expand.grid(x, y)) contour(x, y, matrix(log10(z), length(x)), xlab=”x”, ylab=”y”) # draw the optima as a point points(result\$par, result\$par, col=”red”, pch=19) # draw a square around the optima to highlight it rect(result\$par-0.2, result\$par-0.2, result\$par+0.2, result\$par+0.2, l Usage Heuristics

• It is more effective than Steepest Descent, for instance it might take a straight line path down narrow valleys, while Steepest Descent would have to zig-zag (pinball) down the valley.
• Devolves into steepest descent if the conjugate direction is reset every iteration.
• It is nearly as fast as second-order gradient methods, needing just n iterations to locate an optimum for appropriate functions (where n is the number of parameters).
• It does not maintain a Hessian Matrix (like BFGS) and thus might be appropriate for bigger problems with several variables.
• The strategy is most effective (functions the best) with quadratic or quadratic like-functions, or where the function is approximately quadratic near the optimum.
• The strategy is sensitive to its beginning position on non-convex problems.
• The learning rate (step size alpha) does not have to be mentioned as a line search is leveraged to locate the optimum value as required.
• Common strategies for computation of the direction update rule (beta) are the Hestenes-Stiefel, Fletcher-Reeves, Polak-Ribiere, and Beale-Sorenson Methods. The Polak-Ribiere method typically functions better in practice for non-quadratic functions.
• It is dependent on the precision of the line search, errors put forth from this and the less-than quadratic nature of the response surface will have the outcome of more frequent updates to the direction and a less effective search.
• To prevent the search degenerating, consider restarting the search process after n iterations, where n is the number of function parameters.

BFGS

BFGS is an optimization strategy for multidimensional nonlinear unconstrained functions.

BFGS comes from the family of quasi-Newton (Variable Metric) optimization strategies that leverage both first-derivative (gradient) and second-derivative (Hessian matrix) on the basis of data of the function being optimized. More particularly, it is a quasi-Newton method which implies that it approximates the second-order derivative instead of compute it directly.

It is connected to other quasi-Newton strategies such as the DFP Method, Broyden’s method and the SR1 method.

Two famous extensions of BFGS is L-BFGS (Limited Memory BFGS) which has lower memory resource requirements and L-BFGS-B (Limited Memory Boxed BFGS) which extends L-BFGS and imposes box constraints on the strategy.

The data processing objective of the BFGS Method is to locate the extremum of a function. This is accomplished by iteratively developing a good approximation of the inverse Hessian matrix. Provided an initial beginning position, it preps an approximation of the Hessian matrix (square matrix of second-order partial derivatives). It then repeats the procedure of computing the search direction leveraging the approximated Hessian, computes an optimum step size leveraging a Line Search then, updates the position, and updates the approximation of the Hessian. The strategy for updating the Hessian every iteration is referred to as the BFGS rule which insures the updated matrix is positive definite.

The instance furnishes a code listing of the BFGS method in R solving a 2D nonlinear optimization function.

in R

R

[Control]

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 # definition of the 2D Rosenbrock function, optima is at (1,1) rosenbrock <- function(v) {  (1 – v)^2 + 100 * (v – v*v)^2 }   # definition of the gradient of the 2D Rosenbrock function derivative <- function(v) { c(-400 * v * (v – v*v) – 2 * (1 – v), 200 * (v – v*v)) }   # locate the minimum of the function using the BFGS method result <- optim( c(runif(1,-3,3), runif(1,-3,3)), # start at a random position rosenbrock, # the function to minimize derivative, # no function gradient  method=”BFGS”, # use the BFGS method control=c( # configure BFGS maxit=100, # maximum iterations of 100 reltol=1e-8)) # response tolerance over-one step   # summarize results print(result\$par) # the coordinate of the minimum print(result\$value) # the function response of the minimum print(result\$counts) # the number of function calls performed   # display the function as a contour plot x <- seq(-3, 3, length.out=100) y <- seq(-3, 3, length.out=100) z <- rosenbrock(expand.grid(x, y)) contour(x, y, matrix(log10(z), length(x)), xlab=”x”, ylab=”y”) # draw the optima as a point points(result\$par, result\$par, col=”red”, pch=19) # draw a square around the optima to highlight it rect(result\$par-0.2, result\$par-0.2, result\$par+0.2, result\$par+0.2, l Usage Heuristics

• It needs a function where the function gradient (first partial derivatives) can be computed at random points.
• It does not need second-order derivatives as it approximates the Hessian matrix, making it less computationally expensive contrasted to Newton’s method.
• It needs a comparatively large memory footprint, as it maintains an n*n Hessian matrix, where n is the number of variables. This is a restriction on the methods scalability.
• The rate of convergence is super-linear and the computational expense of every iteration is O(n^2)
• The L-BFGS extension to BFGS is intended for functions with very massive numbers of parameters (>1000)
• The stopping condition is usually a minimum change is response or a minimum gradient.
• BFGS is somewhat robust and self correcting in terms of the search direction and as such it does not need the leveraging of exact line searches when deciding the step size.

Conclusion

Optimization is a critical concept to comprehend and apply carefully within applied machine learning.

In this blog post you found out five convex optimization algorithms with recipes in R that are ready to copy and paste into your own problem.

You also learned some background for every method and general heuristics for operating every algorithm.