# AdaBoost

The AdaBoost algorithm of Freund and Schapire was the first practical boosting algorithm, and remains one of the most widely used and studied, with applications in numerous fields.

### Algorithm

\((x_i, y_i)\) is a tuple where \(x_i\) is features vector and \(y_i\) is class label. In the begining, all tuples have same weigh represented by distribution \(D_t^i\). In each iteration, we first train a weak learner using \(D_t\), as long as the learner is better than random guess, the boosting can get a good result. If the learner is a decision tree, we should consider the weigh when calculating loss at every tree node for discriminator selection. By using the learner, we can get a weak hypothesis \(h_t\) which maps features space to {-1, 1}. \(\epsilon_t\) is the sum of weigh of misclassified tuples. By derivation, \(D_t\) can be updated as above where \(\alpha_t\) equals to \(\frac{1}{2}ln(\frac{1-\epsilon_t}{\epsilon_t})\). After several rounds, we can get a strong classfier by combining trained weak learner with learner weigh \(\alpha_t\).

### Derivation by Loss Function

For simplicity we have \(H(x)=\sum_{t=1}^{T}\alpha_th_t(x)\)
Forward Stagewise Additive Modeling

Input: \(T={(x_1,y_1),(x_2,y_2),…,(x_N,y_N)}\)

Loss Func: \(L(y,f(x))\)

Base Func Set: \(b(x;\gamma)\)

Output: \(f(x)=\sum_{m=1}^{M}\beta_mb(x;\gamma_m)\)

Algorithm:

- Initial\(f_0(x)=0\)
- For m \(\in\) {1,2,3,…,M}
- \((\beta_m,\gamma_m)=arg\min_{\beta,\gamma}\sum_{i=1}^{N}L(y_i,f_{m-1}(x_i)+\beta b(x_i;\gamma))\)
- \(f_m(x)=f_{m-1}(x)+\beta_mb(x;\gamma_m)\)
- \(f(x)=\sum_{m=1}^{M}\beta_mb(x;\gamma_m)\)

We suppose that loss function of Adaboost is exponential loss.

We aim to get

\((\alpha_t,h_t(x))=arg\min_{\alpha,h}\sum_{i=1}^{m}\exp(-y_i(H_{t-1}(x_i)+\alpha h(x_i)))\)

The solution to problem with multiple parameters, we can fix \(H_{t-1}(x_i)\) so that \(\exp(-y_iH_{t-1}(x_i))\) is dependent to
\(\alpha\ and\ h(x_i)\). Set

For solving this equation, \(h_t(x)\) has another equation.

We can solve this by training a weak model.

Consider \(y_i \in \{-1,+1\}\)

From equation above, we can say that \(h_t^{*}(x)\) can replace \(h_t(x_i)\)

Next, we need to get \(\alpha_t\)

\(\alpha_t=arg\min_{\alpha}e^{-\alpha}\sum_{i=1}^{m}D_t^i + (e^{\alpha}-e^{-\alpha})\sum_{i=1}^{m}D_t^iI(y_i \ne h_t^*(x_i))\)

Calculate derivative of the right part and make it 0 to get \(\alpha_t^*\)

We can get

This equation is similar to the update equation in Adaboost except for the normalization coefficient \(Z_t\).

### Derivation by Upper Bound

#### Theorem:

Error is minimized by minimizing \(Z_t\)

#### Proof:

From figure below, we can find that \(I(H(x_i) \ne y_i) \le e^{-y_if(x_i)}\) Thus we have

According to this theorem, we should choose \(\alpha_t\) to minimize \(Z_t\)

Remember that \(Z_t\) is the normalization factor, so \(Z_t=\min_{\alpha_t}\sum_{i=1}^{m}D_t^i\exp(-y_i\alpha_th_t(x_i))\)

Making the first derivative of \(Z_t\) with respect to \(\alpha_t\) equals to zero

The result is the same as deviations above.

### References

- Improved boosting algorithms using confidence
- AdaBoost Czech Technical University, Prague
- Additive Logistic Regression: a Statistic View of Boosting
- On the Margin Explanation of Boosting Algorithms
- Explaining AdaBoost
- A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting*