▸Optimization algorithms :

These solutions are for reference only.

It is recommended that you should solve the assignment and quiz by yourself honestly then only it makes sense to complete the course.

but if you cant figure out some part of it than you can refer these solutions

make sure you understand the solution

dont just copy paste it

answers in green colour

-------------------------------------------------------------------------------------------

Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?

Which of these statements about mini-batch gradient descent do you agree with?
- Training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.
- You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches, so that the algorithm processes all mini-batches at the same time (vectorization).
- One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent.

Why is the best mini-batch size usually not 1 and not m, but instead something in between?
- If the mini-batch size is 1, you end up having to process the entire training set before making any progress.
- If the mini-batch size is m, you end up with stochastic gradient descent, which is usually slower than mini-batch gradient descent.
- If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.
- If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.

Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like this:

Which of the following do you agree with?

- If you’re using mini-batch gradient descent, this looks acceptable. But if you’re using batch gradient descent, something is wrong.
- Whether you’re using batch gradient descent or mini-batch gradient descent, something is wrong.
- Whether you’re using batch gradient descent or mini-batch gradient descent, this looks acceptable.
- If you’re using mini-batch gradient descent, something is wrong. But if you’re using batch gradient descent, this looks acceptable.

Which of the following statements about Adam is False?
- Adam combines the advantages of RMSProp and momentum
- We usually use “default” values for the hyperparameters β1, β2 and ε in Adam (β1 = 0.9, β2 = 0.999, $\large \varepsilon = 10^{-8}$ )
- The learning rate hyperparameter α in Adam usually needs to be tuned.
- Adam should be used with batch gradient computations, not with mini-batches.

COURSERA:Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization (Week 2) Quiz