{% verbatim %}
$$\frac{dLoss}{{dw}_{1}}=\frac{dLoss}{d\hat{y}} \cdot \frac{d\hat{y}}{dz} \cdot \frac{dz}{dw1}$$ $$\frac{dLoss}{{dw}_{2}}=\frac{dLoss}{d\hat{y}} \cdot \frac{d\hat{y}}{dz} \cdot \frac{dz}{dw2}$$ $$\frac{dLoss}{{db}_{1}}=\frac{dLoss}{d\hat{y}} \cdot \frac{d\hat{y}}{dz} \cdot \frac{dz}{db}$$
${ z }={ w }_{ 1 }{ x }_{ 1 }+{ w }_{ 2 }{ x }_{ 2 }+{ b }\\{ \hat{y} } =\sigma ({ z }) = \frac { 1 }{ 1+{ e }^{ -z } }\\Loss(Error function)=-y\cdot ln(\hat{y})-(1-y)\cdot ln(1-\hat{y})$
$\hat{y}$ is the prediction and y is the target/label of each training data set.
Loss(Error function) is the measure of network efficacity (grade of similarity between prediction- $\hat{y}$ andy - target/label)
Entire Cost function (average of losses of all training sets): $= \frac{1}{n} \sum_{i=1}^n Loss = \overline {Loss} $, where n = nr of training samples
Necessary math preliminaries for gradients evaluation:
$$\frac { dLoss }{ { dw }_{ 1 } }, \frac { dLoss }{ { dw }_{ 2 } }, \frac { dLoss }{ { db } }$$1)Derivative of sigmoid function:
$\sigma '(x)=\frac { d }{ dx } \frac { 1 }{ 1+{ e }^{ -x } } =\frac { d }{ dx } { (1+{ e }^{ -x }) }^{ -1 }=-{ (1+{ e }^{ -x }) }^{ -2 }\cdot ({ -e }^{ -x })=\frac { { e }^{ -x } }{ { (1+{ e }^{ -x }) }^{ 2 } } =\frac { 1 }{ 1+{ e }^{ -x } } \cdot \frac { { e }^{ -x } }{ 1+{ e }^{ -x } } \Rightarrow \\ \sigma '(x)=\sigma (x)(1-\sigma (x))\\ $
2)Chain rule:
$(f(g(x)))'\quad =\quad f'(g(x))\cdot g'(x)$
$\frac { d }{ dx } f(g(x))=\frac { d\, f(g(x)) }{ d\, g(x) } \frac { d\, g(x) }{ dx }$
$which\quad could\quad be\quad simplified\quad to\quad the\quad following\quad conventional\quad notation$
$\frac { d }{ dx } f(g(x))=\frac { df }{ dg } \frac { dg }{ dx }$
$or$
$\frac { d }{ dx } (f\circ g)(x)=\frac { df }{ dg } \frac { dg }{ dx }$
$and\quad generalized:$
$\frac { d }{ dx } (f\circ g\circ h\circ i...)(x)=\frac { df }{ dg } \frac { dg }{ dh } \frac { dh }{ di } \frac { di }{ ... } ...\frac { ... }{ dx }$
Gradients calculations
$\frac{dLoss}{d\hat{y}}=-\frac{d\,(y\ln(\hat{y}))}{d\hat{y}} -\frac{d\,((1-y)\ln(1-\hat{y}) )}{d\hat{y}}=$
$=-y\frac{d}{d\hat{y}}\ln{\hat{y}}-(1-y)\frac{d}{d\hat{y}}\ln{(1-\hat{y})}=$
$=-y\frac{1}{\hat{y}}-(1-y)\frac{1}{1-\hat(y)}\frac{d}{d\hat{y}}(-\hat{y})=$
$=-\frac{y}{\hat{y}}+\frac{1-y}{1-\hat{y}}$
$\frac{d\,\hat{y}}{dz}=\frac{d\,\sigma(z)}{dz}=\sigma(z)(1-\sigma(z))=\hat{y}(1-\hat{y})$
$\frac{dz}{dw1}=\frac{d}{dw1}(w_{1}x_{1}+w_{2}x_{2}+b)=x_{1}\frac{d}{dw1}w_{1}+0+0=x_{1}$
So:
$\frac{dLoss}{{dw}_{1}}=\frac{dLoss}{d\hat{y}} \cdot \frac{d\hat{y}}{dz} \cdot \frac{dz}{dw1}=(-\frac{y}{\hat{y}}+\frac{1-y}{1-\hat{y}}) \cdot \hat{y}(1-\hat{y}) \cdot x_{1}$
$=(\hat{y}-y)\cdot x_{1}$
Similarly:
$\frac{dLoss}{{dw}_{2}}=(\hat{y}-y)\cdot x_{2}$
$\frac{dLoss}{{db}}=(\hat{y}-y)$
Weight adjusting(optimization algorithm):
$dw_{1}=\frac{ dLoss}{dw1}=(\hat{y}-y)x_{1}$
$dw_{2}=\frac{ dLoss}{dw2}=(\hat{y}-y)x_{2}$
$db=\frac{ dLoss }{db}=(\hat{y}-y)$
We are using the mean(average $\overline {dw}$) of dw for all training sets:
$w_{1}=w_{1}-alpha\cdot \overline {dw_{1}} $
$w_{2}=w_{2}-alpha\cdot \overline {dw_{2}} $
$b=b-alpha\cdot \overline {db} $
where: alpha = learn rate