SAMME: A Multiclass Boosting Algorithm


Boosting is one of the coolest Machine Learning techniques I’ve ever used. Essentially, boosting allows you to build a strong learning by combining the outputs of a set of weak learners. How cool is that?

The question which motivated boosting algorithms has been there for quite a while (since mid 1900s). And one of the early successful algorithm is Adaboost [ref:Adaboost]. Adaboost is great for binary classifications (you know… +1 or -1 stuff). But Adaboost cannot do well for multi class problems. The reason being, Adaboost requires the error of each weak learner to exceed 50%. And obviously, this requirement becomes more challenging as the number of classes increases. And from this point onward, I assume you know the working of Adaboost (If not search for Adaboost tutorials, or read the original paper).

But to give you a rough idea, the idea is the following. We give each training instance a weight (basically means how important that instance is to the classifier). Initially, this is equal for all the instances (1/N, where N is number of training examples). Then we fit a classifier (SVM, Trees, Forests, …) for the weighted examples. Next calculate the weighted error. Based on whether the example classified correctly or not, update weights (if correct less weight, if wrong more weight). Repeat this process until M iterations.

Now, to tackle this, SAMME [ref:SAMME] was introduced. SAMME looks pretty much like Adaboost except for few differences. First let’s look at the pseudo-code.

Notation:
x_i – i training sample
w(i) – weight of i training sample
N – number of training samples
CLF(t) classifier at time t
Ind(c) – Indicator function (1 if c holds, else 0)
K – number of classes


Intialize w(i) = 1/N, i=1 to N
.
For t = 1 to T:
....Fit classifier CLF(t) to data weighted by w
....Calculate Err(t) := Sum(w(i)*Ind(c(x_i)==CLF(t,x_i)))/Sum(w(i)), i=1 to N
....Calculate Alp(t) := Log((1-Err(t))/Err(t)) + Log(K-1)
....Update w(i) := w(i)*Exp(Alp(t)*Ind(c(x_i)==CLF(t,x_i)), i=1 to N
....Re-Normalize(w(i)), i=1 to N
.
Output C(x) := ArgMax(Sum(Alp(t)*Ind(CLF(t,x)==k))), t=1 to T and for some class k

It doesn’t look different to Adaboost except for the extra Log(K-1) in Alp(t) calculation. In fact for K=2 SAMME reduces to Adaboost. True power of SAMME comes from this extra Log(K-1) term. The paper goes to show that, this term behaves as a multi stage additive model using an exponential loss function (whatever that is!). Then the paper delve into the mathematical proofs of SAMME which you don’t really need to understand to use the algorithm.