Quantcast
Channel: Jerry Zitao Liu
Viewing all articles
Browse latest Browse all 41

Gradient Descent Essentials

$
0
0

Gradient descent is arguably the most widely used optimization techniques. At least in my personal research, it almost occupied the majority of my research solutions.

Gradient descent looks pretty simple at the first glance but it is worthwhile to dig into it a little bit.

What is Gradient Descent?

Think about optimizing an unconstrained, smooth convex optimization problem

Gradient descent solves the above problem as follows:

Start the iteration and stop at some point.

Intuitively, this can be understood easily: every iteration, we are moving along the direction which can minimize the function value and how far we move is controlled by step size $t_k$.

Mathematical Interpretation

The intuition of gradient descent is pretty clear. However, as a mathematical tool, it should root in the solid mathematical foundation.

Remember the Taylor Series,

In optimization, a widely used trick is quadratic approximation based on taylor expansion,

Here, we play the same trick with a slight change, we replace by , which leads to the following approximation:

  • represents the linear approximation to .
  • represents the proximity term to , with a weight .

Going back to our minimization problem, we want to find the minimizer iteratively. Assume we just finish the th iteration and are about to start the th iteration. We have to solve the following problem:

By applying the slightly changed quadratic optimization, we will solve the following new problem:

Of course, this is much easier to solve. Let’s to the high school algebra,

Note that we apply this approximation at every iteration, and we can set to different values according to each iteration, such as setting to at th iteration.

Here is a plot illustration from my instructor Ryan’s slides.

gd_approx_illustration

How to choose Step Size

Setting the step size is an art in gradient descent and it can directly lead your optimization procedure to success or failure.

I have seen lots of presentations and papers talked about this. Some did it right, some did it wrong, and some did it partially right (of course, it means partially wrong.)

There is no need to argue the importance of step size: if it is too small, it takes forever to get converged; if it is too large, the program jumps back and forth and never reaches the optimality.

A general solution to adaptively choose the step size is to use backtracking line search. It is a general and simple solution and lots of people have demonstrate its practical usage.

The backtracking line search works as follows:

  1. First fix parameters and .
  2. At each iteration, start with , and while , shrink . Else perform gradient descent update .

Usually, can be simply set to 1/2.

Interpretation

Constant Step Size

I have to admit the power of backtracking line search. However, in some (lots of) cases, we even don’t need the backtracking.

When the function is

  1. convex
  2. differentiable
  3. dom
  4. is Lipschitz continuous.

If our objective function satisfy the above conditions, we can simply set the step size to constant!

Lipschitz continuous

A function is Lipschitz continuous if for any x and y, we can always find a positive constant , such that

Here we use since our input is scalar, the distance function depends on the metric space.

In our case, we require is Lipschitz continuous, which means

Choice of step size

Theorem

Gradient descent with fixed step size satisfies .

Proof

The above Theorem leads to the following two equivalent lemmas.

Lemma1

Gradient descent with fixed step size has convergence rate .

Lemma2

Gradient descent with fixed step size needs to get .


Viewing all articles
Browse latest Browse all 41

Trending Articles