Abstract: Recent research in machine learning, data mining and related areas has produced a wide variety of algorithms for cost-sensitive (CS) classification, where instead of maximizing the classification accuracy, minimizing the misclassification cost becomes the objective. However, these methods often assume that their input is quality data without inconsistent or incorrect values, or the noise impact is trivial, which is seldom the case in real-world environments. In this paper, we systematically study the impacts of class noise on CS learning, and propose a cost-guided class noise handling algorithm to identify noise for effective CS learning. We call it "Cost-guided Iterative Classification Filter" (CICF), because it seamlessly incorporates costs and an existing Classification Filter (CF) (Brodley & Friedl 99) for noise identification. Instead of putting equal weights on handling noise in all classes in existing efforts, CICF puts more emphasis on expensive classes, which makes it attractive in dealing with datasets with a large cost-ratio. Experimental results and comparative studies indicate that the existence of noise may seriously corrupt the performance of the underlying CS learners, and by adopting the proposed CICF algorithm, we can significantly reduce the misclassification cost of a CS classifier in noisy environments.
Abstract: To cleanse mislabeled examples from a training dataset for efficient and effective induction, most existing approaches adopt a major set oriented scheme: the training dataset is separated into two parts (a major set and a minor set). The classifiers learned from the major set are used to identify noise in the minor set. The obvious drawbacks of such a scheme are twofold: (1) when the underlying data volume keeps growing, it would be either physically impossible or time consuming to load the major set into the memory for inductive learning; and (2) for multiple or distributed datasets, it can be either technically infeasible or factitiously forbidden to download data from other sites (for security or privacy reasons). Therefore, these approaches have severe limitations in conducting effective global data cleansing from large, distributed datasets. In this paper, we propose a solution to bridge the local and global analysis for noise cleansing. More specifically, the proposed effort tries to identify and eliminate mislabeled data items from large or distributed datasets through local analysis and global incorporation. For this purpose, we make use of distributed datasets or partition a large dataset into subsets, each of which is regarded as a local subset and is small enough to be processed by an induction algorithm at one time to construct a local model for noise identification. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance Ik, two error count variables are used to count the number of times it has been identified as noise by all data subsets. The instance with higher error values will have a higher probability of being a mislabeled example. Two threshold schemes, majority and non-objection, are used to identify and eliminate the noisy examples. Experimental results and comparative studies on both real-world and synthetic datasets are reported to evaluate the effectiveness and efficiency of the proposed approach.
Abstract: Recently, mining from data streams has become an important and challenging task for many real-world applications such as credit card fraud protection and sensor networking. One popular solution is to separate stream data into chunks, learn a base classifier from each chunk, and then integrate all base classifiers for effective classification. In this paper, we propose a new dynamic classifier selection (DCS) mechanism to integrate base classifiers for effective mining from data streams. The proposed algorithm dynamically selects a single "best" classifier to classify each test instance at run time. Our scheme uses statistical information from attribute values, and uses each attribute to partition the evaluation set into disjoint subsets, followed by a procedure that evaluates the classification accuracy of each base classifier on these subsets. Given a test instance, its attribute values determine the subsets that the similar instances in the evaluation set have constructed, and the classifier with the highest classification accuracy on those subsets is selected to classify the test instance. Experimental results and comparative studies demonstrate the efficiency and efficacy of our method. Such a DCS scheme appears to be promising in mining data streams with dramatic concept drifting or with a significant amount of noise, where the base classifiers are likely conflictive or have low confidence.
Abstract: Real-world data is noisy and can often suffer from corruptions or incomplete values that may impact the models created from the data. To build accurate predictive models, data acquisition is usually adopted to prepare the data and complete missing values in a dataset. However, due to the significant cost of doing so and the inherent correlations in the dataset, acquiring correct information for all instances is prohibitive and unnecessary. An interesting and important problem that arises here is to select what kinds of instances to complete so the model built from the processed data can receive the "maximum" performance improvement. This problem is complicated by the reality that the costs associated with the attributes are different, and fixing the missing values of some attributes is inherently more expensive than others. Therefore, the problem becomes that given a fixed budget, what kinds of instances should be selected for preparation, so that the learner built from the processed dataset can maximize its performance In this paper, we propose a solution for this problem, and the essential idea is to combine attribute costs and the relevance of each attribute to the target concept, so that the data acquisition can pay more attention to those attributes that are cheap in price but informative for classification. To this end, we will first introduce a unique Economical Factor (EF) that seamlessly integrates the cost and the importance (in terms of classification) of each attribute. Then we will propose a cost-constrained data acquisition model, where active learning, missing value prediction and impact-sensitive instance ranking are combined for effective data acquisition. Experimental results and comparative studies from real-world datasets demonstrate the effectiveness of our method.
Abstract: Hierarchical video browsing and feature-based video retrieval are two standard methods for accessing video content. Very little research, however, has addressed the benefits of integrating these two methods for more effective and efficient video content access. In this paper we introduce InsightVideo, a video analysis and retrieval system, which joins video content hierarchy, hierarchical browsing and retrieval for efficient video access. We propose several video processing techniques to organize the content hierarchy of the video. We first apply a camera motion classification and key-frame extraction strategy that operates in the compressed domain to extract video features. Then, shot grouping, scene detection and pairwise scene clustering strategies are applied to construct the video content hierarchy. We introduce a video similarity evaluation scheme at different levels (key-frame, shot, group, scene, and video.) By integrating the video content hierarchy and the video similarity evaluation scheme, hierarchical video browsing and retrieval are seamlessly integrated for efficient video content access. We construct a progressive video retrieval scheme to refine user queries through the interaction of browsing and retrieval. Experimental results and comparisons of camera motion classification, key-frame extraction, scene detection, and video retrieval are presented to validate the effectiveness and efficiency of the proposed algorithms and the performance of the system.
Abstract: Advances in the media and entertainment industries, including streaming audio and digital TV, present new challenges for managing and accessing large audio-visual collections. Current content management systems support retrieval using low-level features, such as motion, color, and texture. However, low-level features often have little meaning for naïve users, who much prefer to identify content using high-level semantics or concepts. This creates a gap between systems and their users that must be bridged for these systems to be used effectively. To this end, in this paper we first present a knowledge-based video indexing and content management framework for domain specific videos (using basketball video as an example). We will provide a solution to explore video knowledge by mining associations from video data. The explicit definitions and evaluation measures (e.g., temporal support and confidence) for video associations are proposed, by integrating the inherent feature of video data. Our approach uses video processing techniques to find visual and audio cues (e.g. court field, camera motion activities, and applause), introduces multilevel sequential association mining to explore associations among the audio and visual cues, classifies the associations by assigning each of them with a class label, and uses their appearances in the video to construct video indices. Our experimental results demonstrate the performance of the proposed approach.
Abstract: Real-world data is never perfect and can often suffer from corruptions or missing values that may impact models created from the data. To build accurate predictive models, data ac-quisition is usually adopted to complete missing values in the incomplete instances. Due to the significant cost of doing so and the inherent correlations in the dataset, acquiring com-plete information for all instances is likely prohibitive and unnecessary. An interesting and important problem raises here is to select what kind of instances to complete so the model built from the data can receive significant improvement. In this paper, we propose two solutions to resolve this problem, and the essential idea is to complete the attributes with higher impacts to the system performance. The first solution is based on an impact-sensitive instance ranking mechanism [1]. We explore the correlation between attributes and the class and use the correlation as weights of the attributes; the larger the weight, the higher the impacts of the attribute. For each in-complete instance, we sum all weights of the attributes with missing values, and the instance with larger sum appears to be more important for users to complete their missing informa-tion. In the second solution, active learning, impact-sensitive instance ranking and missing value prediction are combined for data acquisition. Experimental results from real-world datasets demonstrate the effectiveness of our strategies.
Abstract: Recently, mining from data streams has become an important and challenging task for many real-world applications such as credit card fraud protection and sensor networking. One popular solution is to separate stream data into chunks, learn a base classifier from each chunk, and then integrate all base classifiers for effective classification. In this paper, we propose a new dynamic classifier selection (DCS) mechanism to integrate base classifiers for effective mining from data streams. The proposed algorithm dynamically selects a single "best" classifier to classify each test instance at run time. Our scheme uses statistical information from attribute values, and uses each attribute to partition the evaluation set into disjoint subsets, followed by a procedure that evaluates the classification accuracy of each base classifier on these subsets. Given a test instance, its attribute values determine the subsets that the similar instances in the evaluation set have constructed, and the classifier with the highest classification accuracy on those subsets is selected to classify the test instance. Experimental results and comparative studies demonstrate the efficiency and efficacy of our method. Such a DCS scheme appears to be promising in mining data streams with dramatic concept drifting or with a significant amount of noise, where the base classifiers are likely conflictive or have low confidence.
Abstract: Recent research in machine learning, data mining and related areas has produced a wide variety of algorithms for cost-sensitive (CS) classification, where instead of maximizing the classification accuracy, minimizing the misclassification cost becomes the objective. However, these methods assume that training sets do not contain significant noise, which is rarely the case in real-world environments. In this paper, we systematically study the impacts of class noise on CS learning, and propose a cost-guided class noise handling algorithm to identify noise for effective CS learning. We call it Cost-guided Iterative Classification Filter (CICF), because it seamlessly integrates costs and an existing Classification Filter for noise identification. Instead of putting equal weights to handle noise in all classes in existing efforts, CICF puts more emphasis on expensive classes, which makes it especially successful in dealing with datasets with a large cost-ratio. Experimental results and comparative studies from real-world datasets indicate that the existence of noise may seriously corrupt the performance of CS classifiers, and by adopting the proposed CICF algorithm, we can significantly reduce the misclassification cost of a CS classifier in noisy environments.
Abstract: Attribute noise can affect classification learning. Previous work in handling attribute noise has focused on those predictable attributes that can be predicted by the class and other attributes. However, attributes can often be predictive but unpredictable. Being predictive, they are essential to classification learning and it is important to handle their noise. Being unpredictable, they require strategies different from those of predictable attributes. This paper presents a study on identifying, cleansing and measuring noise for predictive-but-unpredictable attributes. New strategies are accordingly proposed. Both theoretical analysis and empirical evidence suggest that these strategies are more effective and more efficient than previous alternatives.
Abstract: Real-world data is never perfect and can often suffer from corruptions (noise) that may impact interpretations of the data, models created from the data and decisions made based on the data. Noise can reduce system performance in terms of classification accuracy, time in building a classifier and the size of the classifier. Accordingly, most existing learning algorithms have integrated various approaches to enhance their learning abilities from noisy environments, but the existence of noise can still introduce serious negative impacts. A more reasonable solution might be to employ some preprocessing mechanisms to handle noisy instances before a learner is formed. Unfortunately, rare research has been conducted to systematically explore the impact of noise, especially from the noise handling point of view. This has made various noise processing techniques less significant, specifically when dealing with noise that is introduced in attributes. In this paper, we present a systematic evaluation on the effect of noise in machine learning. Instead of taking any unified theory of noise to evaluate the noise impacts, we differentiate noise into two categories: class noise and attribute noise, and analyze their impacts on the system performance separately. Because class noise has been widely addressed in existing research efforts, we concentrate on attribute noise. We investigate the relationship between attribute noise and classification accuracy, the impact of noise at different attributes, and possible solutions in handling attribute noise. Our conclusions can be used to guide interested readers to enhance data quality by designing various noise handling mechanisms.
Abstract: Given a noisy dataset, how to locate erroneous instances and attributes and rank suspicious instances based on their impacts with the system performance is an interesting and important research issue. We provide in this paper an Error Detection and Impact-sensitive instance Ranking (EDIR) mechanism to address this problem. Given a noisy dataset D, we first train a benchmark classifier T from D. The instances, which cannot be effectively classified by T, are treated as suspicious instances and forwarded to a subset S. For each attribute Ai, we switch Ai and class label C to train a classifier APi for Ai. Given instance Ik in S, we use APi and the benchmark classifier T to locate the erroneous value of each attribute Ai. To quantitatively rank instances in S, we define an impact measure based on the Information-gain Ratio (IR). We calculate IRi between attribute Ai and C, and use IRi as the impact-sensitive weight of Ai. The sum of impact-sensitive weights from all located erroneous attributes of Ik indicates its total impact value. The experimental results demonstrate the effectiveness of our strategies.
Abstract: In this paper, we propose a hierarchical video summarization strategy that explores video content structure to provides users with a scalable, multilevel video summary. First, video shot segmentation and key-frame extraction algorithms are applied to parse video sequences into physical shots and discrete key-frames. Next, an affinity (self-correlation) matrix is constructed to merge visually similar shots into clusters (super-groups). Since video shots with higher similarities do not necessarily imply that they belong to the same story unit, temporal information is addressed by merging temporally adjacent shots (within a specified distance) from the super-group into each video group. A video scene detection algorithm is thus adopted to merge temporally or spatially correlated video groups into scenario units. This is followed by a scene clustering algorithm which eliminates visual redundancy among the units. Finally, a hierarchical video content structure with increasing granularity is constructed from clustered scenes, scenes, groups and key-frames. Then, we introduce a hierarchical video summarization scheme by executing various approaches among the different levels of video content hierarchy to statically or dynamically determine the video summary within the multiple layers. Extensive experiments based on real-world videos have been executed to validate the effectiveness of the proposed approach.
Abstract: This paper presents a new approach for identifying and eliminating mislabeled data items in large or distributed datasets. We first partition a dataset into subsets, each of which is small enough to be processed by an induction algorithm at one time. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance Ik, two error count variables are used to count how many times it has been identified as noise by all subsets. The instance with higher error values will have a higher probability of being a mislabeled example. Various threshold schemes are used to identify the noisy examples. Experimental results and comparisons from real-world datasets are reported to evaluate the effectiveness and efficiency of the proposed approach.
Abstract: To support more efficient video database manage-ment, this paper explores the concept of video asso-ciation mining, with which the association patterns are characterized by sequentially associated video shots and their cluster information. Given a continu-ous video sequence V, the video shot segmentation mechanism is first introduced to parse it into discrete shots. We then cluster shots into visually distinct groups and construct a shot cluster sequence by using the class label of each shot. An association mining scheme is designed to mine sequentially associated clusters from the sequence. Those detected associa-tions will convey valuable knowledge for video data-base management. The experimental results demon-strate the effectiveness of our design.
Abstract: Multi-database mining is an important research area because (1) there is an urgent need for analyzing data in different sources, (2) there are essential differences between mono- and multi-database mining, and (3) there are limitations in existing multi-database mining efforts. This paper designs a new multi-database mining process. Some research issues involving mining multi-databases, including database clustering and local pattern analysis, are discussed.
Abstract: In this paper, we propose an association-based video summarization scheme that mines sequential associations from video data for summary creation. Given detected shots of video V, we first cluster them into visually distinct groups, and then construct a sequential sequence by integrating the temporal order and cluster type of each shot. An association mining scheme is designed to mine sequentially associated clusters from the sequence, and these clusters are selected as summary candidates. With a user specified summary length, our system generates the corresponding summary by selecting representative frames from candidate clusters and assembling them by their original temporal order. The experimental evaluation demonstrates the effectiveness of our summarization method.
Abstract: Video is increasingly the medium of choice for a variety of communication channels, resulting primarily from increased levels of networked multimedia systems. One way to keep our heads above the video sea is to provide summaries in a more tractable format. Many existing approaches are limited to exploring important low-level feature related units for summarization. Unfortunately, the semantics, content and structure of the video do not correspond to low-level features directly, even with closed-captions, scene detection, and audio signal processing. The drawbacks of existing methods are the following: (1) instead of unfolding semantics and structures within the video, low-level units usually address only the details, and (2) any important unit selection strategy based on low-level features cannot be applied to general videos.
Providing users with an overview of the video content at various levels of summarization is essential for more efficient database retrieval and browsing. In this paper, we present a hierarchical video content description and summarization strategy supported by a novel joint semantic and visual similarity strategy. In order to describe the video content efficiently and accurately, a video content description ontology is adopted. Various video processing techniques are then utilized to construct a semi-automatic video annotation framework. By integrating acquired content description data, a hierarchical video content structure is constructed with group merging and clustering. Finally, a four layer video summary with different granularities is assembled to assist users in unfolding the video content in a progressive way. Experiments on real-word videos have validated the effectiveness of the proposed approach.