Derivation of Algorithm

As we all know, for any predictive learning problem, solution is trying to minimize the expected value of loss function \(L(y, F(x))\) over the joint distribution of all \((y, \mathbf{x})\)

$$F^* = arg\min_FE_{y,\mathbf{x}}L(y, F(\mathbf{x})) = arg\min_FE_{\mathbf{x}}[E_y(L(y,F(\mathbf{x})))|\mathbf{x}]$$

Such function approximation \(F(\mathbf{x})\) can be expressed as an expansions of the form

$$F(\mathbf{x};\{\beta_m,\mathbf{\alpha_m}\}_1^M)=\sum_{m=1}^M\beta_mh(\mathbf{x};\mathbf{\alpha_m})$$

In general, choosing a parameterized model or approximation function \(F(\mathbf{x};\mathbf{P})\) changes the function optimization problem to one of parameter optimization which is Gradient Descent (or Steepest Descent)

$$\mathbf{P}^*=arg\min_{\mathbf{P}}E_{y,\mathbf{x}}L(y, F(\mathbf{x};\mathbf{P}))$$ $$F(\mathbf{x}^*)=F(\mathbf{x};\mathbf{P}^*)$$ $$\mathbf{P}^*=\sum_{m=0}^M\mathbf{p}_m$$ $$\mathbf{g}_m=\{g_{jm}\}=\{[\frac{\delta E_{y,\mathbf{x}}L(y, F(\mathbf{x};\mathbf{P_{m-1}}))}{\delta P_j}]\}$$ $$\mathbf{P}_{m-1}=\sum_{i=0}^{m-1}\mathbf{p}_i$$ $$\downarrow$$ $$\mathbf{p}_{m}=-\rho_m\mathbf{g}_m$$ $$\rho_m=arg\min_\rho E_{y,\mathbf{x}}L(y, \mathbf{P}_{m-1}-\rho\mathbf{g}_m)$$

Another way to find the function is using numeric optimization in function space.

$$F^*(\mathbf{x})=arg\min_{F}E_{y,\mathbf{x}}L(y, F(\mathbf{x}))$$ $$=arg\min_{F}E_y[L(y, F(\mathbf{x})|\mathbf{x}]$$ $$F^*(\mathbf{x})=\sum_{m=0}^Mf_m(\mathbf{x})$$ $$g_m(\mathbf{x})=E_y[\frac{\delta L(y, F(\mathbf{x}))}{\delta F(\mathbf{x})}]_{F(\mathbf{x})=F_{m-1}(\mathbf{x})}$$ $$F_{m-1}(\mathbf{x})=\sum_{i=0}^{m-1}f_i(\mathbf{x})$$ $$f_m(\mathbf{x})=-\rho_mg_m(\mathbf{x})$$ $$\rho_m=arg\min_\rho E_{y,\mathbf{x}}L(y, F_{m-1}(\mathbf{x})-\rho g_m(\mathbf{x}))$$

However, for the finite data tuples, joint distribution of \((y, \mathbf{x})\) is hard to estimate, thus adopting the ‘greedy-stagewise’ approach.
For m = 1,2,3,…,M

$$(\beta_m, \mathbf{a}_m)=arg\min_{\beta,\mathbf{a}}\sum_{i=1}^{N}L(y_i,F_{m-1}(\mathbf{x}_i)+\beta h(\mathbf{x}_i;\mathbf{a}))$$ $$F_m(\mathbf{x})=F_{m-1}(\mathbf{x})+\beta_m h(\mathbf{x};\mathbf{a}_m)$$

Still we can formalize the data based negative gradient.

$$-g_m(\mathbf{x}_i)=-[\frac{\delta L(y_i,F_{m-1}(\mathbf{x}_i))}{\delta F_{m-1}(\mathbf{x}_i)}]$$

In order to generalize the direction to other \(\mathbf{x}\)-values, we should find a parameterized function \(\beta h(\mathbf{x};\mathbf{a})\) that close to \(-g_m(\mathbf{x})\)

$$\mathbf{a}_m=arg\min_{\beta,\mathbf{a}}\sum_{i=1}^N[-g_m(\mathbf{x}_i)-\beta h(\mathbf{x}_i;\mathbf{a})]^2$$ $$\rho_m=arg\min_{\rho}\sum_{i=1}^NL(y_i,F_{m-1}(\mathbf{x}_i)+\rho h(\mathbf{x}_i;\mathbf{a}_m))$$ $$F_m(\mathbf{x})=F_{m-1}(\mathbf{x})+\rho_mh(\mathbf{x};\mathbf{a}_m)$$

References

  1. Greedy Function Approximation: A Gradient Boosting Machine
  2. A Gentle Introduction to Gradient Boosting
  3. Boosting: Foundations and Algorithms