
TABLE I
CLASSES CONSIDERED FOR EVALUATION TOGETHER WITH THE NUMBER
OF POSITIVE AND NEGATIVE EXAMPLES
ID Name Positives Negatives
SD Software Development 2,077 8,223
TM Technical Management 1,727 8,573
Sales Sales 1,587 8,713
P-QA Production & Quality Assurance 1,501 8,799
TDD Technical Development & Design 1,069 9,231
field(s) of work. A set of 103 different labels was used for
annotation. On average, each instance was annotated with 4.25
labels with a standard deviation of 1.86.
For the purpose of evaluation we considered only the five
classes that showed to have the largest number of positive
examples, which are instances annotated with the respective
class label. This allows to have a large evaluation corpus
available. As mentioned above, the multi-label classification
problem is solved by building binary classifiers for each single
label. The number of positive and negative examples for each
classifier are shown in Table I. All instances not annotated with
the respective class label are considered as negative training
example for this class. Because the evaluation corpus is highly
unbalanced, we applied a resampling of the data to achieve a
better balanced data distribution. By doing so, we aim to obtain
a more robust classifier.
The whole corpus used for the evaluation was preprocessed
consistently so that the different classifiers where able to
perform their work on the same feature set. To obtain the
numeric vectors required for SVM classification, the TF-IDF
statistics are gathered for the 10,000 most used words by
applying a german tokenizer without using a stop word list
or a stemming algorithm.
For simulating the interactive feedback given by the human
annotators, we also used parts of this evaluation corpus. For
each instance where the classifier decides to request the human
for feedback we provide the label from the evaluation corpus.
Based on this, we achieve a division of the training data into
three sub-sets:
1) Subset A is used to train the base classifier.
2) The elements of subset B are classified by the base
classifier and if an element is identified as ambiguous it
is passed to the specialized classifier as training data
together with its annotation. All elements that were
identified as ambiguous form subset S.
3) Subset C is used for the evaluation (testing) of the
overall CENFA classifier.
Since the goal of our work is a classification approach that
can classify instances with the same accuracy as traditional
ensemble learning approaches but with reduced manual human
effort and with a better timing behavior, we need to compare
our approach to other approaches. These approaches will be
explained in the following. Subset S is the set of ambiguous
instances which is used to train the specialized classifier of
the CENFA classifier. The Random classifier approach uses
set R, a random selection of elements from subset B, to
train the specialized classifier. The number of elements in this
selection is similar to the number of ambiguous instances the
CENFA approach uses to train the specialized classifier. In
other words, subset R is chosen to be of the same cardinality
as subset S, while both are subsets of set B. The aim of
this approach is to verify the suitability of using ambiguous
instances for incorporating user feedback instead of training
the specialized classifier with random instances. In order to
examine the benefit of not retraining the base classifier with
the ambiguous instances but only the specialized classifier
we introduce the Extended classifier approach. After having
recognized a number of instances as ambiguous, the whole
ensemble is re-trained and the accuracy and run time of this
approach are compared to the training of CENFA’s specialized
classifier only. Last but not least, our approach is evaluated
against the Random Single SVM (RSSVM) approach that uses
a single SVM trained with the subset A and a random selection
from set B. It has to be noted that CENFA, Random, Extended
and RSSVM are all trained with the same amount of training
data but the instances used and the overall system architecture
vary across these approaches.
The three different classifiers for comparison each have
a separate purpose. The Random delivers insights on the
accuracy performance of CENFA compared to a classifier
which does not use the active learning methodology for
selecting ambiguous instances. The RSSVM delivers insights
on CENFA’s accuracy performance compared to a classifier
which does not use the ensemble learning methodology and
additionally the training time difference to a single SVM setup.
The Extended delivers insights on the accuracy performance
compared to a classifier which does not apply the provided
compromise. Here CENFA was expected to be outperformed
while being much faster. Table II provides an overview of
the different classifiers with their used training sets and their
evaluation purpose.
The CENFA architecture and the evaluation concept allow
to tune different parameters and examine their influence on
the overall accuracy in order to determine the best setting.
The dividing factor denotes the division into the subsets A, B
and C; in particular the given number represents the fraction
of data that is assigned to subset A. Subsets B and C always
hold the same number of instances. Hence, e.g. a dividing
factor of 0.7 means that A consists of 70% of the instances
from the evaluation corpus, B of 15% and C of 15%. A
higher dividing factor results in a larger training set A but a
smaller number of instances for the training of the specialized
classifier. The second parameter which can be varied is the
confidence value, which denotes the decision threshold of the
ensemble learner up from which an instance is considered as
ambiguous. If this is chosen to be very low then only a very
small amount of instances from B are considered as ambiguous
and used for the training of the specialized classifier. Further,
the specialized classifier gets only a small amount of instances
from C assigned for training since the base classifier decides
42Polibits (49) 2014 ISSN 1870-9044
Steffen Schnitzer, Sebastian Schmidt, Christoph Rensing, and Bettina Harriehausen-Mühlbauer