Coursera: Machine Learning - Andrew NG(Week 5) Quiz - Neural Networks: Learning - codemummy |online technical computer science platform.

These solutions are for reference only.

try to solve on your own

but if you get stuck in between than you can refer these solutions

there are different set of questions ,

read questions carefully before marking

-----------------------------------------------------------------------------------------

Question 1

You are training a three layer neural network and would like to use backpropagation to compute the gradient of the cost function.  In the backpropagation algorithm, one of the steps is to update
        \Delta^{(2)}_{ij} := \Delta^{(2)}_{ij} +  \delta^{(3)}_i * (a^{(2)})_jΔij(2)​:=Δij(2)​+δi(3)​∗(a(2))j​
        for every i, ji,j.  Which of the following is a correct vectorization of this step?

1 point

\Delta^{(2)} := \Delta^{(2)} +  \delta^{(2)} * (a^{(2)})^TΔ(2):=Δ(2)+δ(2)∗(a(2))T

\Delta^{(2)} := \Delta^{(2)} +   (a^{(3)})^T * \delta^{(3)}Δ(2):=Δ(2)+(a(3))T∗δ(3)

\Delta^{(2)} := \Delta^{(2)} +  \delta^{(3)} * (a^{(3)})^TΔ(2):=Δ(2)+δ(3)∗(a(3))T

\Delta^{(2)} := \Delta^{(2)} +  \delta^{(3)} * (a^{(2)})^TΔ(2):=Δ(2)+δ(3)∗(a(2))T

EXPLANATION:

This version is correct, as it takes the “outer product” of the two vectors $\delta^{(3)}$ and $a^{(2)}$ which is a matrix such that the (i,j)-th entry is as $\delta_i^{(3)} * (a^{(2)})_j$ desired.

\tt{reshape(thetaVec(16:39), 4, 6)}reshape(thetaVec(16:39),4,6)

\tt{reshape(thetaVec(15:38), 4, 6)}reshape(thetaVec(15:38),4,6)

\tt{reshape(thetaVec(16:24), 4, 6)}reshape(thetaVec(16:24),4,6)

\tt{reshape(thetaVec(15:39), 4, 6)}reshape(thetaVec(15:39),4,6)

\tt{reshape(thetaVec(16:39), 6, 4)}reshape(thetaVec(16:39),6,4)

EXPLANATION:
Theta1 has 15 elements, so Theta2 begins at index 16 and ends at index 16 + 24 - 1 = 39.

Question 3

Let J(\theta) = 2 \theta^3 + 2J(θ)=3θ3+4.  Let \theta=1θ=1, and \epsilon = 0.01ϵ=0.01.  Use the formula \frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2\epsilon}2ϵJ(θ+ϵ)−J(θ−ϵ)​ to numerically compute an approximation to the derivative at \theta=1θ=1.  What value do you get? (When \theta = 1θ=1, the true/exact derivati ve is \frac{dJ(\theta)}{d\theta} = 6dθdJ(θ)​=12.)

1 point

11.9988

6

12

12.0012

EXPLANATION:

Using gradient checking can help verify if one's implementation of backpropagation is bug-free.

Using a large value of \lambdaλ cannot hurt the performance of your neural network; the only reason we do not set \lambdaλ to be too large is to avoid numerical problems.

Gradient checking is useful if we are using gradient descent as our optimization algorithm.  However, it serves little purpose if we are using one of the advanced optimization methods (such as in fminunc).

If our neural network overfits the training set, one reasonable step to take is to increase the regularization parameter \lambdaλ.

EXPLANATION:

Using gradient checking can help verify if one's implementation of backpropagation is bug-free. (TRUE)

If the gradient computed by backpropagation is the same as one computed numerically with gradient checking, this is very strong evidence that you have a correct implementation of backpropagation.

If our neural network overfits the training set, one reasonable step to take is to increase the regularization parameter λ. (TRUE)

Just as with logistic regression, a large value of λ will penalize large parameter values, thereby reducing the changes of overfitting the training set.

OTHER STATEMENTS WHICH CAN OCCUR IN THE ABOVE 4 TH QUESTION:

For computational efficiency, after we have performed gradient checking to verify that our backpropagation code is correct, we usually disable gradient checking before using backpropagation to train the network. (TRUE)

Computing the gradient of the cost function in a neural network has the same efficiency when we use backpropagation or when we numerically compute it using the method of gradient checking.(FALSE)

Gradient checking is useful if we are using one of the advanced optimization methods (such as in fminunc) as our optimization algorithm. However, it serves little purpose if we are using gradient descent.(FALSE)

If we initialize all the parameters of a neural network to ones instead of zeros, this will suffice for the purpose of "symmetry breaking" because the parameters are no longer symmetrically equal to zero.

Suppose you are training a neural network using gradient descent.  Depending on your random initialization, your algorithm may converge to different local optima (i.e., if you run the algorithm twice with different random initializations, gradient descent may converge to two different solutions).

Suppose you have a three layer network with parameters \Theta^{(1)}Θ(1) (controlling the function mapping from the inputs to the hidden units) and \Theta^{(2)}Θ(2) (controlling the mapping from the hidden units to the outputs).  If we set all the elements of \Theta^{(1)}Θ(1) to be 0, and all the elements of \Theta^{(2)}Θ(2) to be 1, then this suffices for symmetry breaking, since the neurons are no longer all computing the same function of the input.

If we are training a neural network using gradient descent, one reasonable "debugging" step to make sure it is working is to plot J(\Theta)J(Θ) as a function of the number of iterations, and make sure it is decreasing (or at least non-increasing) after each iteration.

EXPLANATION:

Suppose you are training a neural network using gradient descent. Depending on your random initialization, your algorithm may converge to different local optima (i.e., if you run the algorithm twice with different random initializations, gradient descent may converge to two different solutions). (TRUE)

=>The cost function for a neural network is non-convex, so it may have multiple minima. Which minimum you find with gradient descent depends on the initialization.

If we are training a neural network using gradient descent, one reasonable "debugging" step to make sure it is working is to plot J(Θ) as a function of the number of iterations, and make sure it is decreasing (or at least non-increasing) after each iteration. (TRUE)

=>Since gradient descent uses the gradient to take a step toward parameters with lower cost (ie, lower J(Θ)), the value of J(Θ) should be equal or less at each iteration if the gradient computation is correct and the learning rate is set properly.

---------------------------------------------------------------------------------

reference : coursera

Coursera: Machine Learning - Andrew NG(Week 5) Quiz - Neural Networks: Learning

Neural Networks: Learning

Top Coding Questions

Quantitative Aptitude

Popular Posts

Tags