Data Mining Paper Presentations
Spring 2009
Please refer to the Research
Guide for information about what to say in a good talk.
Database Specialty
-
J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate
Generation, Proc. 2000 ACM-SIGMOD Int. Conf. on Management of
Data (SIGMOD'00), Dallas, TX, May 2000.
Spring 2009 Presenter: Song Wang
Presentation Date: March 18, 60 min
2009 presentation slides in PowerPoint: Here.
Abstract.
Mining frequent patterns in transaction databases, time-series
databases, and many other kinds of databases has been studied
popularly in data mining research. Most of the previous studies adopt
an Apriori-like candidate set generation-and-test approach. However,
candidate set generation is still costly, especially when there exist
prolific patterns and/or long patterns. In this study,we propose a
novel frequent pattern tree (FP-tree) structure, which is an extended
prefix- tree structure for storing compressed, crucial information
about frequent patterns, and develop an e#cient FP-tree-based mining
method, FP-growth, for mining the complete set of frequent patterns by
pattern fragment growth. Efficiency of mining is achieved with three
techniques: (1) a large database is compressed into a highly
condensed, much smaller data structure, whichavoids costly, repeated
database scans, (2) our FP-treebased mining adopts a pattern
fragment growth method to avoid the costly generation of a large
number of candidate sets, and (3) a partitioning-based,
divide-and-conquer method is used to decompose the mining task into a
set of smaller tasks for mining confined patterns in conditional
databases, which dramatically reduces the search space. Our
performance study shows that the FP-growth method is efficient and
scalable for mining both long and short frequent patterns, and is
about an order of magnitude faster than the Apriori algorithm and also
faster than some recently reported new frequent pattern mining
methods.
- R. Srikant and R. Agrawal, Quantitative Association Rules,
Proc. of the ACM SIGMOD Int'l Conference on Management of Data,
1996.
Spring 2009 Presenter: Sasi Kunta
Presentation Date: March 23, 60 min
2009 Presentation slides in PowerPoint: Here.
- Xindong Wu, Chengqi Zhang, and Shichao Zhang, Efficient Mining of Both Positive and Negative
Association Rules
, ACM Transactions on
Information Systems, 22(2004), 3: 381-405.
Spring 2009 Presenter: Tianyu Cao
Presentation Date: March 25, 60 min
2009 Presentation slides in PowerPoint: Here.
Abstract.
This paper presents an
efficient method for mining both positive and negative association
rules in databases. The method extends traditional associations to
include association rules of forms A ⇒ ¬ B,
¬ A ⇒ B, and ¬ A ⇒ ¬
B, which indicate negative associations between itemsets. With
a pruning strategy and an interestingness measure, our method scales
to large databases. The method has been evaluated using both synthetic
and real-world databases, and our experimental results demonstrate its
effectiveness and efficiency.
- J. Han and Y. Fu, Multiple-Level
Association Rules, IEEE Transactions on Knowledge and Data
Engineering, Volume 11, Number 5, October 1999.
Spring 2009 Presenter: Ramesh Kumar
Presentation Date: March 30, 60 min
2009 presentation slides in PowerPoint: Here.
Abstract. A top-down progressive deepening method is developed
for efficient mining of multiple-level association rules from large
transaction databases based on the Apriori principle. A group of
variant algorithms is proposed based on the ways of sharing
intermediate results, with the relative performance tested and
analyzed. The enforcement of different interestingness measurements to
find more interesting rules, and the relaxation of rule conditions for
finding level- crossing efficient algorithms can be developed from
large databases for the discovery of interesting and strong
multiple-level association rules.
- M. Hernandez and S. Stolfo, Real-World
Data is Dirty: Data Cleansing and The Merge/Purge Problem,
Data Mining and Knowledge Discovery, Volume 2, Issue 1,
1998, 9-37.
Spring 2007 Presenter: Tristan Schneiter
Presentation Date: March 29, 60 min
2007 Presentation slides in PowerPoint: Here.
- Chun-Nan Hsu and Graig A. Knoblock, Discovering Robust Knowledge from Databases that
Change, Data Mining and Knowledge Discovery, Volume 2,
Issue 1, 1998, 69-95.
Spring 2009 Presenter: Mark Stebbins
Presentation Date: April 1, 60 min
2009 Presentation slides in PowerPoint: Here.
Abstract. Many applications of knowledge discovery and data
mining such as rule discovery for semantic query optimization,
database integration and decision support, require the knowledge to be
consistent with the data. However, databases usually change over time
and make machine-discovered knowledge inconsistent. Useful knowledge
should be robust against database changes so that it is unlikely to
become inconsistent after database updates. This paper defines this
notion of robustness in the context of relational databases and
describes how robustness of first-order Horn-clause rules can be
estimated. Experimental results show that our estimation approach can
accurately identify robust rules. We also present a rule antecedent
pruning algorithm that improves the robustness and applicability of
machine discovered rules to demonstrate the usefulness of robustness
estimation.
- S.D. Lee, David Cheung and Ben Kao, Is
Sampling Useful in Data Mining? A Case in the Maintenance of
Discovered Association Rules, Data Mining and Knowledge
Discovery, Volume 2, Issue 3, 1998, 233-262.
Spring 2009 Presenter: Matthew Tretin
Presentation Date: April 6, 60 min
2009 Presentation slides in PowerPoint: Here.
AI Specialty
- Hastie, T. and Tibshirani, R. 1996.
Discriminant Adaptive Nearest
Neighbor Classification.
IEEE Trans. Pattern
Anal. Mach. Intell. (TPAMI). 18, 6 (Jun. 1996), 607-616.
Spring 2009 Presenter: Jesse Fleming
Presentation Date: April 8, 60 min
2009 Presentation slides in PowerPoint: Here.
- R. Agrawal and R. Srikant, Mining
Sequential Patterns, Proc. of the Int'l Conference on Data
Engineering (ICDE), Taipei, Taiwan, March 1995.
2006 Presentation slides in PowerPoint: Here.
- Freund, Y. and Schapire, R. E. 1997.
A decision-theoretic generalization of
on-line learning and an application to boosting.
J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.
Spring 2009 Presenter: Glenn Rachlin
Presentation Date: April 13, 60 min
2009 Presentation slides in PowerPoint: Here.
- Brin, S. and Page, L. 1998.
The anatomy of a large-scale
hypertextual Web search engine. In Proceedings of the Seventh
international Conference on World Wide Web (WWW-7) (Brisbane,
Australia). P. H. Enslow and A. Ellis, Eds. Elsevier Science
Publishers B. V., Amsterdam, The Netherlands, 107-117.
Spring 2009 Presenter: Michael Karpeles
Presentation Date: April 15, 60 min
2009 presentation slides in PDF: Here with an example.
- Xindong Wu, Fuzzy Interpretation of Discretized Intervals,
IEEE Transactions on Fuzzy Systems, Volume 7, Number 6, 1999,
753-759.
2006 Presentation slides in PowerPoint: Here.
Abstract.
When there are both numerical and nominal attributes in a database,
existing data mining systems (such as rule induction and decision tree
construction) discretize numerical domains into intervals, and the
discretized intervals are treated in a similar way to nominal values
during deduction. This paper describes a type of fuzzy intervals
implemented in the HCV (Version 2.0) rule induction software for the
interpretation of rule induction results when rules with sharp
intervals do not clearly apply to a test example at hand. A battery
of experiment results with HCV show that these fuzzy intervals are
useful.
- Tom Fawcett and Foster Provost, Data
Mining for Adaptive Fraud Detection, Data Mining and Knowledge
Discovery, Volume 1, Issue 3, 1997, 291-316.
Spring 2009 Presenter: Adam Boye
Presentation Date: April 20, 60 min
2009 Presentation slides in PowerPoint: Here.
Abstract. One method for detecting fraud is to check for
suspicious changes in user behavior. This paper describes the
automatic design of user profiling methods for the purpose of fraud
detection, using a series of data mining techniques. Specifically, we
use a rule-learning program to uncover indicators of fraudulent
behavior from a large database of customer transactions. Then the
indicators are used to create a set of monitors, which profile
legitimate customer behavior and indicate anomalies. Finally, the
outputs of the monitors are used as features in a system that learns
to combine evidence to generate high-confidence alarms. The system
has been applied to the problem of detecting cellular cloning fraud
based on a database of call records. Experiments indicate that this
automatic approach performs better than hand-crafted methods for
detecting fraud. Furthermore, this approach can adapt to the changing
conditions typical of fraud detection environments.
- J. Dorre, P. Gerstl, and R. Seiffert, Text Mining: Finding
Nuggets in Mountains of Textual Data, Proceedings of the 5th ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining, San Diego, California, August 15-18, 1999, 398-401.
Spring 2009 Presenter: Huimin Ye
Presentation Date: April 22, 35 min
2009 Presentation slides in PowerPoint: Here.
Abstract. Text mining applies the same analytical functions of
data mining to the domain of textual information, relying on
sophisticated text analysis techniques that distill information from
free-text documents. IBM's Intelligent Miner for Text provides the
necessary tools to unlock the business information that is "trapped"
in email, insurance claims, news feeds, or other document
repositories. It has been successfully applied in analyzing patent
portfolios, customer complaint letters, and even competitors' Web
pages. After defining our notion of "text mining", we focus on the
differences between text and data mining and describe in some more
detail the unique technologies that are key to successful text mining.
-
R. Kosala and H. Blockeel, Web Mining
Research: A Survey, SIGKDD Explorations, June 2000. Volume
2, Issue 1.
Spring 2009 Presenter: Fan Min
Presentation Date: April 22, 35 min
2009 Presentation slides in PowerPoint: Here.
Abstract.
With the huge amount of information available online, the World Wide
Web is a fertile area for data mining research. Web mining research
is at the cross roads of research from several research communities,
such as database, information retrieval, and within AI, especially the
sub-areas of machine learning and natural language processing.
However, there is confusion when comparing efforts from different
points of view. In this paper, we survey the research in the area of
Web mining, point out some confusion regarding usage of the term Web
mining and suggest three Web mining categories. We then situate some
of the research with respect to these three categories. We also
explore the connection between the Web mining categories and the
related agent paradigm. For the survey, we focus on representation
issues, on the process, on the learning algorithm, and on the
application of the recent works as the criteria. We conclude the
paper with some research issues.
Statistics Specialty
- Vapnik, V. N. 1995. The Nature of Statistical Learning
Theory. Springer-Verlag New York, Inc.
Spring 2009 Presenter: John DiMona
Presentation Date: April 27, 60 min
2009 Presentation slides in PowerPoint: Here.
-
McLachlan, G. and Peel, D. (2000). Finite Mixture Models.
J. Wiley, New York.
- Douglas Fisher, Iterative
Optimization and Simplification of Hierarchical Clusterings,
Journal of Artificial Intelligence Research, 4(1996), 147-180.
Spring 2007 Presenter: Paul Haake
Presentation Date: April 19, 60 min
2007 Presentation slides in PowerPoint: Here.
Abstract. Clustering is often used for discovering structure
in data. Clustering systems differ in the objective function used to
evaluate clustering quality and the control strategy used to search
the space of clusterings. Ideally, the search strategy should
consistently construct clusterings of high quality, but be
computationally inexpensive as well. In general, we cannot have it
both ways, but we can partition the search so that a system
inexpensively constructs a `tentative' clustering for initial
examination, followed by iterative optimization, which continues to
search in background for improved clusterings. Given this motivation,
we evaluate an inexpensive strategy for creating initial clusterings,
coupled with several control strategies for iterative optimization,
each of which repeatedly modifies an initial clustering in search of a
better one. One of these methods appears novel as an iterative
optimization strategy in clustering contexts. Once a clustering has
been constructed it is judged by analysts -- often according to
task-specific criteria. Several authors have abstracted these criteria
and posited a generic performance task akin to pattern completion,
where the error rate over completed patterns is used to `externally'
judge clustering utility. Given this performance task, we adapt
resampling-based pruning strategies used by supervised learning
systems to the task of simplifying hierarchical clusterings, thus
promising to ease post-clustering analysis. Finally, we propose a
number of objective functions, based on attribute-selection measures
for decision-tree induction, that might perform well on the error rate
and simplicity dimensions.
- David Heckerman, Bayesian
Networks for Data Mining, Data Mining and Knowledge
Discovery, Volume 1, Issue 1, 1997.
2006 Presentation slides in PowerPoint: Here.
Abstract. Bayesian network is a graphical model that encodes
probabilistic relationships among variables of interest. When used in
conjunction with statistical techniques, the graphical model has
several advantages for data mining. One, because the model encodes
dependencies among all variables, it readily handles situations where
some data entries are missing. Two, a Bayesian network can be used to
learn causal relationships, and hence can be used to gain
understanding about a problem domain and to predict the consequences
of intervention. Three, because the model has both a causal and
probabilistic semantics, it is an ideal representation for combining
domain knowledge( which often comes in causal form) and data. Four,
Bayesian statistical methods in conjunction with Bayesian networks
offer an efficient and principled approach for avoiding the
overfitting of data. In this paper, we discuss methods for
constructing Bayesian networks from domain knowledge and summarize
Bayesian statistical methods for using data to improve these
models. With regard to the latter task, we describe methods for
learning both the parameters and structure of a Bayesian network,
including techniques for learning with incomplete data. In addition,
we relate Bayesian-network methods for learning to techniques for
supervised and unsupervised learning. We illustrate the
graphical-modeling approach using a real-world case study.
- Tian Zhang, Raghu Ramakrishnan, Miron Livny, BIRCH: A New Data Clustering Algorithm
and Its Applications, Data Mining and Knowledge Discovery,
Volume 1, Issue 2, 1997, 141-182.
Spring 2009 Presenter: Zhao Li
Presentation Date: April 29, 60 min
2009 Presentation slides in PowerPoint: Here.
Abstract. Data clustering is an important technique for
exploratory data analysis, and has been studied for several years. It
has been shown to be useful in many practical domains such as data
classification and image processing. Recently, there has been a
growing emphasis on exploratory analysis of very large datasets to
discover useful patterns and/or correlations among attributes. This is
called *data mining,* and data clustering is regarded as a particular
branch. However existing data clustering methods do not adequately
address the problem of processing large datasets with a limited amount
of resources (e.g., memory and cpu cycles). So as the dataset size
increases, they do not scale up well in terms of memory requirement,
running time, and result quality.
In this paper, an efficient and scalable data clustering method
is proposed, based on a new in-memory data structure called
*CF-tree.* which serves as an in-memory summary of the data
distribution. We have it in a system called BIRCH (Balanced
Iterative Reducing and Clustering using Hierarchies), and
studied its performance extensively in terms of memory
requirements, running time, clustering quality, stability and
scalability; we also compare it with other available
methods. Finally, BIRCH is applied to solve two real-life
problems: one is building an iterative and interactive pixel
classification tool, and the other is generating the initial
codebook for image compression.
Please e-mail queries and comments to xwu@cs.uvm.edu.