|
Knowledge and Information Systems |
Abstract. The World Wide Web (WWW) continues to grow at an
astounding rate in both the sheer volume of traffic and the size and
complexity of Web sites. The complexity of tasks such as Web site
design, Web server design, and of simply navigating through a Web site
have increased along with this growth. An important input to these
design tasks is the analysis of how a Web site is being used. Usage
analysis includes straightforward statistics, such as page access
frequency, as well as more sophisticated forms of analysis, such as
finding the common traversal paths through a Web site. Web Usage
Mining is the application of data mining techniques to usage logs of
large Web data repositories in order to produce results that can be
used in the design tasks mentioned above. However, there are several
preprocessing tasks that must be performed prior to applying data
mining algorithms to the data collected from server logs. This paper
presents several data preparation techniques in order to identify
unique users and user sessions. Also, a method to divide user sessions
into semantically meaningful transactions is defined and successfully
tested against two other methods. Transactions identified by the
proposed methods are used to discover association rules from real
world data using the WEBMINER system [Mobasher, Jain, Han and
Srivastava 1996].
Abstract. We present a data mining method which integrates
discretization, generalization and rough set feature selection. Our
method reduces the data horizontally and vertically. In the first
phase, discretization and generalization are integrated. Numeric
attributes are discretized into a few intervals. The primitive values
of symbolic attributes are replaced by high level concepts and some
obvious superfluous or irrelevant symbolic attributes are also
eliminated. The horizontal reduction is done by merging identical
tuples after substituting an attribute value by its higher level value
in a pre-defined concept hierarchy for symbolic attributes, or the
discretization of continuous (or numeric) attributes. This phase
greatly decreases the number of tuples we consider further in the
database(s). In the second phase, a novel context-sensitive feature
merit measure is used to rank features, a subset of relevant
attributes is chosen, based on rough set theory and the merit values
of the features. A reduced table is obtained by removing those
attributes which are not in the relevant attributes subset and the
data set is further reduced vertically without changing the
interdependence relationships between the classes and the
attributes. Finally, the tuples in the reduced relation are
transformed into different knowledge rules based on different
knowledge discovery algorithms. Based on these principles, a
prototype knowledge discovery system DBROUGH-II has been constructed
by integrating discretization, generalization, rough set feature
selection and a variety of data mining algorithms. Tests on a
telecommunication customer data warehouse demonstrates that different
kinds of knowledge rules, such as characteristic rules, discriminant
rules, maximal generalized classification rules, and data evolution
regularities, can be discovered efficiently and effectively.
Abstract. In this paper we present the M2 Case-Based
Reasoning (CBR) system. The M2 system addresses a number
of issues that present methodologies for CBR systems have shied away
from. We discuss techniques for removing the knowledge acquisition
bottleneck when acquiring case knowledge. Here, case knowledge refers
to the complementary knowledge structures, cases (more specific in
nature) and adaptation rules (more general). We address the use of
negative cases for updating the case knowledge as well as for refining
the similarity measures. In particular we discuss in detail, showing
experimental results, the use of Data Mining within the M2
system to build the case base from a database containing operational
data, and discover adaptation rules. A methodology to monitor the
competence of the CBR system and to utilize negative cases for
updating the CBR system to enhance its competence is also
discussed. The M2 CBR system also employs Rough Set and
Fuzzy Set theories to further enhance its capabilities within
real-world applications as well as providing a richer and truer model
of human reasoning.
Abstract. The approach of combining models learned from
multiple batches of data provide an alternative to the common practice
of learning one model from all the available data (i.e. the data
combination approach). This paper empirically examines the base-line
behavior of the model combination approach in this
multiple-data-batches scenario. We find that model combination can
lead to better performance even if the disjoint batches of data are
drawn randomly from a larger sample, and relate the relative
performance of the two approaches to the learning curve of the
classifier used. In the beginning of the curve, model combination has
higher bias and variance than data combination and thus a higher error
rate. As training data increases, model combination has either a lower
error rate than or a comparable performance to data combination
because the former achieves larger variance reduction. We also show
that this result is not sensitive to the methods of model combination
employed. Another interesting result is that we empirically show that
the near-asymptotic performance of a single model in some
classification tasks can be significantly improved by combining
multiple models (derived from the same algorithm) in the
multiple-data-batches scenario.
Abstract. This paper investigates two neural network based
techniques for the classification of microcalcifications in digital
mammograms. Both techniques extract suspicious areas containing
microcalcifications from digital mammograms and classify them into two
categories: whether they contain benign or malignant clusters. The
centroids and radii provided by expert radiologists are being used to
locate and extract suspicious areas. Two neural network based on
iterative and non-iterative training methods are used to classify them
into benign or malignant. The proposed techniques have been
implemented in C++ on the SP2 supercomputer. The database from the
Department of Radiology at the University of Nijmegen has been used
for the experiments. The comparative results are very interesting and
promising. Some of them are included in this paper.
Abstract. This paper shows how to manage null entries in
pairwise comparisons matrices. Although assessments can be imprecise,
since subjective criteria are involved, the classical pairwise
comparisons theory expects all of them to be available. In practice,
some experts may not be able (or available) to provide all
assessments. Therefore managing null entries is a necessary extension
of the pairwise comparisons method. It is shown that certain null
entries can be recovered on the basis of the transitivity property
which each pairwise comparisons matrix is expected to satisfy.
Data Mining via Discretization, Generalization and Rough Set
Feature Selection
Xiaohua Hu and Nick CerconeTowards Automated Case Knowledge Discovery in the M2
Case-Based Reasoning System
D. Patterson, S. S. Anand, W. Dubitzky and J. G. HughesLearning from Batched Data: Model Combination Versus Data
Combination
Kai Ming Ting, Boon Toh Low, and Ian H. WittenShort Papers
Comparative Evaluation of Two Neural Network Based Techniques for
the Classification of Microcalcifications in Digital Mammograms
Brijesh K. VermaManaging Null Entries in Pairwise Comparisons
W.W. Koczkodaj, Michael W. Herman, and Marian Orlowski
This page has been accessed times since December 26, 1998.