University of Vermont
Computer Science Technical Report CS-06-08

Ontology-based Feature Weighting for Biomedical Literature Classification
by Dan He

CS-06-08 in PDF

Abstract

The "Curse of Dimensionality" is a well known problem for lots of classification algorithms such as k-nearest neighbor algorithm, whose efficiencies largely depend on the dimensionality of the feature space. The dimensionality of the feature space for text classification problem (the number of unique words in the texts) is always prohibitively high for these classification algorithms. Another drawback of these regular classification algorithms for text classification tasks is that the semantic information among the features are missed. Ontology-based feature transformation has been used to reduce the dimensionality of the feature space. Semantic information can be also incorporated into the classification process by transforming different features into a same feature according to the structure of the domain ontology. It is shown that the dimensionality can be reduced effectively by the transformation and the performances of those algorithms can be improved on the transformed feature space, compared with the same algorithms on the original untransformed feature space. However, it may achieve only slightly improvements on classification accuracies. Given that the training of the ontology-based feature transformation method is complicated and time consuming, this method may not work as well as people expected. In this paper we give possible reasons that the ontology-based feature transformation method is not satisfactory and we further propose a novel ontology-based feature weighting strategy to achieve similar or even better classification accuracies than the accuracies by the ontology-based feature transformation method. Similar effects of dimensionality reduction can also be achieved. And our method does not require any training on the structure of the ontology and therefore it is nearly as fast as the original algorithms without using this strategy.


Last updated: Apr 17, 2006.