The same plot for the modified function (with the same parameters) is almost a straight line. The starting point is $$x=0$$ and we see that the convergence is fast.
We see that the result will be $$ \frac{1+\tanh(-1.1)}{2} \sim 0.10$$ as expected.
On criteo dataset, the usual Newton method goes out of range for a small (but non negligible) fraction of the examples. The returned dual in these cases will be $$0$$ or $$\pm 1$$. The modified Newton algorithm always find the true zero and achieves a better log loss.
The blue lines represent the modified Newton (evaluation and training) and the orange line is the normal Newton algorithm (training).
TODO(vgodet): Update the plot with eval_continuous for Newton
Note: Newton seems to converge faster at the beginning because it is taking more aggressive steps when going out-of-range.
The second derivative of $$H$$
$$ H''(x) = \frac{A}{y} \tanh x (1-\tanh^2 x) $$
is bounded and quadratic convergence should be guaranteed if we are close enough to the solution (see https://en.wikipedia.org/wiki/Newton%27s_method#Proof_of_quadratic_convergence_for_Newton.27s_iterative_method for the proof).
However we can't really know if we are close to the zero. To prove the convergence in any cases, we can use Kantovitch Theorem (reviewed in [5]). The sufficient condition to have convergence is that we start at a point $$ x_0 $$ such that
$$ \left|\frac{4A H(x_0)}{H'(x_0)^2} \right|\leq 1 $$
If $$ A$$ is not small, the starting point $$x_0 = 0$$ doesn't satisfy this condition and we may solve the above inequality to find a starting point which does.
However, in practice, convergence with $$x_0 = 0$$ always happens (tested for a sample of generic values for the parameters).
Poisson log loss is defined as $$ \l(u) = e^u - uy $$ for label $$y \geq 0.$$ Its dual is
$$ \l^\star(v) = (y+v) (\log(y+v) - 1) $$
and is only defined for $$ y+v > 0 $$. We then have the constraint
$$ y > \a+\d. $$
The dual is
$$ D(\d) = -(y-\a-\d) (\log(y-\a-\d) - 1) - \bar{y} \d - \frac{A}{2} \d^2 $$
and its derivative is,
$$ D'(\d) = \log(y-\a-\d) - \bar{y} - A\d $$
Similar to the logistic loss, we perform a change of variable to handle the constraint on $$ \d $$
$$ y - (\a+\d) = e^x $$
After this change of variable, the goal is to find the zero of this function
$$ H(x) = x - \bar{y} -A(y-\a-e^x) $$
whose first derivative is
$$ H'(x) = 1+Ae^x $$
Since this function is always positive, $$H$$ is increasing and has a unique zero.
We can start Newton algorithm at $$\d=0$$ which corresponds to $$ x = \log(y-\a)$$. As before the Newton step is given by
$$x_{k+1} = x_k - \frac{H(x_k)}{H'(x_k)}. $$
[1] C. Ma et al., Adding vs. Averaging in Distributed Primal-Dual Optimization, arXiv:1502.03508, 2015
[2] C. Ma et al., Distributed Optimization with Arbitrary Local Solvers, arxiv:1512.04039, 2015
[3] S. Shalev-Shwartz, T. Zhang, Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization, 2013
[4] S. Shalev-Shwartz, T. Zhang, Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization, 2013
[5] A. Galantai, The theory of Newton’s method, 2000
We want to compute $$\l^\star(v) = \max_u [ uv-\l(u) ] $$ where $$\l$$ is smooth hinge loss. We thus have to solve $$v=\l'(u)$$. The derivative of smooth hinge loss is given by
$$ \l'(u) = \begin{cases} 0 ::: & y_i u \geq 1\ -y :::& y_i u \leq1-\gamma \ \frac{u-y}{\gamma} & \text{otherwise} \end{cases} $$
By solving for $$v$$, we find the dual of smooth hinge loss as
$$ \l^\star(v) = yv + \frac{\gamma}{2}v^2 $$
with the restriction $$ yv \in (0,1) $$.
Now, we can now minimize the dual objective with respect to $$\d$$
$$ D(\a+\d) = -\l^\star(-\a-\d)-\bar{y}\d-\frac{A}{2} \d^2 $$
which gives the expected result
$$\d = \frac{y-\bar{y}-\gamma\a}{A+\gamma} $$
with the constraint $$ y(\a+\d) \in (0,1)$$.