Knowledge and Information Systems
An International Journal
ISSN: 0219-1377 (printed version)
ISSN: 0219-3116 (electronic version)
by Springer

Volume 1, Number 1, February 1999

Regular Papers

Data Preparation for Mining World Wide Web Browsing Patterns

Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava

Abstract. The World Wide Web (WWW) continues to grow at an astounding rate in both the sheer volume of traffic and the size and complexity of Web sites. The complexity of tasks such as Web site design, Web server design, and of simply navigating through a Web site have increased along with this growth. An important input to these design tasks is the analysis of how a Web site is being used. Usage analysis includes straightforward statistics, such as page access frequency, as well as more sophisticated forms of analysis, such as finding the common traversal paths through a Web site. Web Usage Mining is the application of data mining techniques to usage logs of large Web data repositories in order to produce results that can be used in the design tasks mentioned above. However, there are several preprocessing tasks that must be performed prior to applying data mining algorithms to the data collected from server logs. This paper presents several data preparation techniques in order to identify unique users and user sessions. Also, a method to divide user sessions into semantically meaningful transactions is defined and successfully tested against two other methods. Transactions identified by the proposed methods are used to discover association rules from real world data using the WEBMINER system [Mobasher, Jain, Han and Srivastava 1996].

Data Mining via Discretization, Generalization and Rough Set Feature Selection

Xiaohua Hu and Nick Cercone

Abstract. We present a data mining method which integrates discretization, generalization and rough set feature selection. Our method reduces the data horizontally and vertically. In the first phase, discretization and generalization are integrated. Numeric attributes are discretized into a few intervals. The primitive values of symbolic attributes are replaced by high level concepts and some obvious superfluous or irrelevant symbolic attributes are also eliminated. The horizontal reduction is done by merging identical tuples after substituting an attribute value by its higher level value in a pre-defined concept hierarchy for symbolic attributes, or the discretization of continuous (or numeric) attributes. This phase greatly decreases the number of tuples we consider further in the database(s). In the second phase, a novel context-sensitive feature merit measure is used to rank features, a subset of relevant attributes is chosen, based on rough set theory and the merit values of the features. A reduced table is obtained by removing those attributes which are not in the relevant attributes subset and the data set is further reduced vertically without changing the interdependence relationships between the classes and the attributes. Finally, the tuples in the reduced relation are transformed into different knowledge rules based on different knowledge discovery algorithms. Based on these principles, a prototype knowledge discovery system DBROUGH-II has been constructed by integrating discretization, generalization, rough set feature selection and a variety of data mining algorithms. Tests on a telecommunication customer data warehouse demonstrates that different kinds of knowledge rules, such as characteristic rules, discriminant rules, maximal generalized classification rules, and data evolution regularities, can be discovered efficiently and effectively.

Towards Automated Case Knowledge Discovery in the M2 Case-Based Reasoning System

D. Patterson, S. S. Anand, W. Dubitzky and J. G. Hughes

Abstract. In this paper we present the M2 Case-Based Reasoning (CBR) system. The M2 system addresses a number of issues that present methodologies for CBR systems have shied away from. We discuss techniques for removing the knowledge acquisition bottleneck when acquiring case knowledge. Here, case knowledge refers to the complementary knowledge structures, cases (more specific in nature) and adaptation rules (more general). We address the use of negative cases for updating the case knowledge as well as for refining the similarity measures. In particular we discuss in detail, showing experimental results, the use of Data Mining within the M2 system to build the case base from a database containing operational data, and discover adaptation rules. A methodology to monitor the competence of the CBR system and to utilize negative cases for updating the CBR system to enhance its competence is also discussed. The M2 CBR system also employs Rough Set and Fuzzy Set theories to further enhance its capabilities within real-world applications as well as providing a richer and truer model of human reasoning.

Learning from Batched Data: Model Combination Versus Data Combination

Kai Ming Ting, Boon Toh Low, and Ian H. Witten

Abstract. The approach of combining models learned from multiple batches of data provide an alternative to the common practice of learning one model from all the available data (i.e. the data combination approach). This paper empirically examines the base-line behavior of the model combination approach in this multiple-data-batches scenario. We find that model combination can lead to better performance even if the disjoint batches of data are drawn randomly from a larger sample, and relate the relative performance of the two approaches to the learning curve of the classifier used. In the beginning of the curve, model combination has higher bias and variance than data combination and thus a higher error rate. As training data increases, model combination has either a lower error rate than or a comparable performance to data combination because the former achieves larger variance reduction. We also show that this result is not sensitive to the methods of model combination employed. Another interesting result is that we empirically show that the near-asymptotic performance of a single model in some classification tasks can be significantly improved by combining multiple models (derived from the same algorithm) in the multiple-data-batches scenario.

Short Papers

Comparative Evaluation of Two Neural Network Based Techniques for the Classification of Microcalcifications in Digital Mammograms

Brijesh K. Verma

Abstract. This paper investigates two neural network based techniques for the classification of microcalcifications in digital mammograms. Both techniques extract suspicious areas containing microcalcifications from digital mammograms and classify them into two categories: whether they contain benign or malignant clusters. The centroids and radii provided by expert radiologists are being used to locate and extract suspicious areas. Two neural network based on iterative and non-iterative training methods are used to classify them into benign or malignant. The proposed techniques have been implemented in C++ on the SP2 supercomputer. The database from the Department of Radiology at the University of Nijmegen has been used for the experiments. The comparative results are very interesting and promising. Some of them are included in this paper.

Managing Null Entries in Pairwise Comparisons

W.W. Koczkodaj, Michael W. Herman, and Marian Orlowski

Abstract. This paper shows how to manage null entries in pairwise comparisons matrices. Although assessments can be imprecise, since subjective criteria are involved, the classical pairwise comparisons theory expects all of them to be available. In practice, some experts may not be able (or available) to provide all assessments. Therefore managing null entries is a necessary extension of the pairwise comparisons method. It is shown that certain null entries can be recovered on the basis of the transitivity property which each pairwise comparisons matrix is expected to satisfy.


This page has been accessed times since December 26, 1998.