University of Vermont
Computer Science Technical Report CS-06-08
Ontology-based Feature Weighting for Biomedical Literature Classification
by
Dan He
Abstract
The "Curse of Dimensionality" is a well
known problem for lots of classification algorithms such as
k-nearest neighbor algorithm, whose efficiencies largely depend on
the dimensionality of the feature space. The dimensionality of the
feature space for text classification problem (the number of unique
words in the texts) is always prohibitively high for these
classification algorithms. Another drawback of these regular
classification algorithms for text classification tasks is that the
semantic information among the features are missed. Ontology-based
feature transformation has been used to reduce the dimensionality of
the feature space. Semantic information can be also incorporated
into the classification process by transforming different features
into a same feature according to the structure of the domain
ontology. It is shown that the dimensionality can be reduced
effectively by the transformation and the performances of those
algorithms can be improved on the transformed feature space,
compared with the same algorithms on the original untransformed
feature space. However, it may achieve only slightly improvements on
classification accuracies. Given that the training of the
ontology-based feature transformation method is complicated and time
consuming, this method may not work as well as people expected. In
this paper we give possible reasons that the ontology-based feature
transformation method is not satisfactory and we further propose a
novel ontology-based feature weighting strategy to achieve similar
or even better classification accuracies than the accuracies by the
ontology-based feature transformation method. Similar effects of
dimensionality reduction can also be achieved. And our method does
not require any training on the structure of the ontology and
therefore it is nearly
as fast as the original algorithms without using this strategy.
Last updated: Apr 17,
2006.