Data Mining Paper Presentations

Spring 2009

Please refer to the Research Guide for information about what to say in a good talk.

    Database Specialty

  1. J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate Generation, Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'00), Dallas, TX, May 2000.
    Spring 2009 Presenter: Song Wang
    Presentation Date: March 18, 60 min
    2009 presentation slides in PowerPoint: Here.
    Abstract. Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist prolific patterns and/or long patterns. In this study,we propose a novel frequent pattern tree (FP-tree) structure, which is an extended prefix- tree structure for storing compressed, crucial information about frequent patterns, and develop an e#cient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a highly condensed, much smaller data structure, whichavoids costly, repeated database scans, (2) our FP-treebased mining adopts a pattern fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent pattern mining methods.
  2. R. Srikant and R. Agrawal, Quantitative Association Rules, Proc. of the ACM SIGMOD Int'l Conference on Management of Data, 1996.
    Spring 2009 Presenter: Sasi Kunta
    Presentation Date: March 23, 60 min
    2009 Presentation slides in PowerPoint: Here.
  3. Xindong Wu, Chengqi Zhang, and Shichao Zhang, Efficient Mining of Both Positive and Negative Association Rules, ACM Transactions on Information Systems, 22(2004), 3: 381-405.
    Spring 2009 Presenter: Tianyu Cao
    Presentation Date: March 25, 60 min
    2009 Presentation slides in PowerPoint: Here.
    Abstract. This paper presents an efficient method for mining both positive and negative association rules in databases. The method extends traditional associations to include association rules of forms A ⇒ ¬ B, ¬ AB, and ¬ A ⇒ ¬ B, which indicate negative associations between itemsets. With a pruning strategy and an interestingness measure, our method scales to large databases. The method has been evaluated using both synthetic and real-world databases, and our experimental results demonstrate its effectiveness and efficiency.
  4. J. Han and Y. Fu, Multiple-Level Association Rules, IEEE Transactions on Knowledge and Data Engineering, Volume 11, Number 5, October 1999.
    Spring 2009 Presenter: Ramesh Kumar
    Presentation Date: March 30, 60 min
    2009 presentation slides in PowerPoint: Here.
    Abstract. A top-down progressive deepening method is developed for efficient mining of multiple-level association rules from large transaction databases based on the Apriori principle. A group of variant algorithms is proposed based on the ways of sharing intermediate results, with the relative performance tested and analyzed. The enforcement of different interestingness measurements to find more interesting rules, and the relaxation of rule conditions for finding level- crossing efficient algorithms can be developed from large databases for the discovery of interesting and strong multiple-level association rules.
  5. M. Hernandez and S. Stolfo, Real-World Data is Dirty: Data Cleansing and The Merge/Purge Problem, Data Mining and Knowledge Discovery, Volume 2, Issue 1, 1998, 9-37.
    Spring 2007 Presenter: Tristan Schneiter
    Presentation Date: March 29, 60 min
    2007 Presentation slides in PowerPoint: Here.
  6. Chun-Nan Hsu and Graig A. Knoblock, Discovering Robust Knowledge from Databases that Change, Data Mining and Knowledge Discovery, Volume 2, Issue 1, 1998, 69-95.
    Spring 2009 Presenter: Mark Stebbins
    Presentation Date: April 1, 60 min
    2009 Presentation slides in PowerPoint: Here.
    Abstract. Many applications of knowledge discovery and data mining such as rule discovery for semantic query optimization, database integration and decision support, require the knowledge to be consistent with the data. However, databases usually change over time and make machine-discovered knowledge inconsistent. Useful knowledge should be robust against database changes so that it is unlikely to become inconsistent after database updates. This paper defines this notion of robustness in the context of relational databases and describes how robustness of first-order Horn-clause rules can be estimated. Experimental results show that our estimation approach can accurately identify robust rules. We also present a rule antecedent pruning algorithm that improves the robustness and applicability of machine discovered rules to demonstrate the usefulness of robustness estimation.
  7. S.D. Lee, David Cheung and Ben Kao, Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules, Data Mining and Knowledge Discovery, Volume 2, Issue 3, 1998, 233-262.
    Spring 2009 Presenter: Matthew Tretin
    Presentation Date: April 6, 60 min
    2009 Presentation slides in PowerPoint: Here.

    AI Specialty

  8. Hastie, T. and Tibshirani, R. 1996. Discriminant Adaptive Nearest Neighbor Classification. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI). 18, 6 (Jun. 1996), 607-616.
    Spring 2009 Presenter: Jesse Fleming
    Presentation Date: April 8, 60 min
    2009 Presentation slides in PowerPoint: Here.
  9. R. Agrawal and R. Srikant, Mining Sequential Patterns, Proc. of the Int'l Conference on Data Engineering (ICDE), Taipei, Taiwan, March 1995.
    2006 Presentation slides in PowerPoint: Here.
  10. Freund, Y. and Schapire, R. E. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.
    Spring 2009 Presenter: Glenn Rachlin
    Presentation Date: April 13, 60 min
    2009 Presentation slides in PowerPoint: Here.
  11. Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh international Conference on World Wide Web (WWW-7) (Brisbane, Australia). P. H. Enslow and A. Ellis, Eds. Elsevier Science Publishers B. V., Amsterdam, The Netherlands, 107-117.
    Spring 2009 Presenter: Michael Karpeles
    Presentation Date: April 15, 60 min
    2009 presentation slides in PDF: Here with an example.
  12. Xindong Wu, Fuzzy Interpretation of Discretized Intervals, IEEE Transactions on Fuzzy Systems, Volume 7, Number 6, 1999, 753-759. 2006 Presentation slides in PowerPoint: Here.
    Abstract. When there are both numerical and nominal attributes in a database, existing data mining systems (such as rule induction and decision tree construction) discretize numerical domains into intervals, and the discretized intervals are treated in a similar way to nominal values during deduction. This paper describes a type of fuzzy intervals implemented in the HCV (Version 2.0) rule induction software for the interpretation of rule induction results when rules with sharp intervals do not clearly apply to a test example at hand. A battery of experiment results with HCV show that these fuzzy intervals are useful.
  13. Tom Fawcett and Foster Provost, Data Mining for Adaptive Fraud Detection, Data Mining and Knowledge Discovery, Volume 1, Issue 3, 1997, 291-316.
    Spring 2009 Presenter: Adam Boye
    Presentation Date: April 20, 60 min
    2009 Presentation slides in PowerPoint: Here.
    Abstract. One method for detecting fraud is to check for suspicious changes in user behavior. This paper describes the automatic design of user profiling methods for the purpose of fraud detection, using a series of data mining techniques. Specifically, we use a rule-learning program to uncover indicators of fraudulent behavior from a large database of customer transactions. Then the indicators are used to create a set of monitors, which profile legitimate customer behavior and indicate anomalies. Finally, the outputs of the monitors are used as features in a system that learns to combine evidence to generate high-confidence alarms. The system has been applied to the problem of detecting cellular cloning fraud based on a database of call records. Experiments indicate that this automatic approach performs better than hand-crafted methods for detecting fraud. Furthermore, this approach can adapt to the changing conditions typical of fraud detection environments.
  14. J. Dorre, P. Gerstl, and R. Seiffert, Text Mining: Finding Nuggets in Mountains of Textual Data, Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, California, August 15-18, 1999, 398-401.
    Spring 2009 Presenter: Huimin Ye
    Presentation Date: April 22, 35 min
    2009 Presentation slides in PowerPoint: Here.
    Abstract. Text mining applies the same analytical functions of data mining to the domain of textual information, relying on sophisticated text analysis techniques that distill information from free-text documents. IBM's Intelligent Miner for Text provides the necessary tools to unlock the business information that is "trapped" in email, insurance claims, news feeds, or other document repositories. It has been successfully applied in analyzing patent portfolios, customer complaint letters, and even competitors' Web pages. After defining our notion of "text mining", we focus on the differences between text and data mining and describe in some more detail the unique technologies that are key to successful text mining.
  15. R. Kosala and H. Blockeel, Web Mining Research: A Survey, SIGKDD Explorations, June 2000. Volume 2, Issue 1.
    Spring 2009 Presenter: Fan Min
    Presentation Date: April 22, 35 min
    2009 Presentation slides in PowerPoint: Here.
    Abstract. With the huge amount of information available online, the World Wide Web is a fertile area for data mining research. Web mining research is at the cross roads of research from several research communities, such as database, information retrieval, and within AI, especially the sub-areas of machine learning and natural language processing. However, there is confusion when comparing efforts from different points of view. In this paper, we survey the research in the area of Web mining, point out some confusion regarding usage of the term Web mining and suggest three Web mining categories. We then situate some of the research with respect to these three categories. We also explore the connection between the Web mining categories and the related agent paradigm. For the survey, we focus on representation issues, on the process, on the learning algorithm, and on the application of the recent works as the criteria. We conclude the paper with some research issues.

    Statistics Specialty

  16. Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc.
    Spring 2009 Presenter: John DiMona
    Presentation Date: April 27, 60 min
    2009 Presentation slides in PowerPoint: Here.
  17. McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York.
  18. Douglas Fisher, Iterative Optimization and Simplification of Hierarchical Clusterings, Journal of Artificial Intelligence Research, 4(1996), 147-180.
    Spring 2007 Presenter: Paul Haake
    Presentation Date: April 19, 60 min
    2007 Presentation slides in PowerPoint: Here.
    Abstract. Clustering is often used for discovering structure in data. Clustering systems differ in the objective function used to evaluate clustering quality and the control strategy used to search the space of clusterings. Ideally, the search strategy should consistently construct clusterings of high quality, but be computationally inexpensive as well. In general, we cannot have it both ways, but we can partition the search so that a system inexpensively constructs a `tentative' clustering for initial examination, followed by iterative optimization, which continues to search in background for improved clusterings. Given this motivation, we evaluate an inexpensive strategy for creating initial clusterings, coupled with several control strategies for iterative optimization, each of which repeatedly modifies an initial clustering in search of a better one. One of these methods appears novel as an iterative optimization strategy in clustering contexts. Once a clustering has been constructed it is judged by analysts -- often according to task-specific criteria. Several authors have abstracted these criteria and posited a generic performance task akin to pattern completion, where the error rate over completed patterns is used to `externally' judge clustering utility. Given this performance task, we adapt resampling-based pruning strategies used by supervised learning systems to the task of simplifying hierarchical clusterings, thus promising to ease post-clustering analysis. Finally, we propose a number of objective functions, based on attribute-selection measures for decision-tree induction, that might perform well on the error rate and simplicity dimensions.
  19. David Heckerman, Bayesian Networks for Data Mining, Data Mining and Knowledge Discovery, Volume 1, Issue 1, 1997.
    2006 Presentation slides in PowerPoint: Here.
    Abstract. Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest. When used in conjunction with statistical techniques, the graphical model has several advantages for data mining. One, because the model encodes dependencies among all variables, it readily handles situations where some data entries are missing. Two, a Bayesian network can be used to learn causal relationships, and hence can be used to gain understanding about a problem domain and to predict the consequences of intervention. Three, because the model has both a causal and probabilistic semantics, it is an ideal representation for combining domain knowledge( which often comes in causal form) and data. Four, Bayesian statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for avoiding the overfitting of data. In this paper, we discuss methods for constructing Bayesian networks from domain knowledge and summarize Bayesian statistical methods for using data to improve these models. With regard to the latter task, we describe methods for learning both the parameters and structure of a Bayesian network, including techniques for learning with incomplete data. In addition, we relate Bayesian-network methods for learning to techniques for supervised and unsupervised learning. We illustrate the graphical-modeling approach using a real-world case study.
  20. Tian Zhang, Raghu Ramakrishnan, Miron Livny, BIRCH: A New Data Clustering Algorithm and Its Applications, Data Mining and Knowledge Discovery, Volume 1, Issue 2, 1997, 141-182.
    Spring 2009 Presenter: Zhao Li
    Presentation Date: April 29, 60 min
    2009 Presentation slides in PowerPoint: Here.
    Abstract. Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of very large datasets to discover useful patterns and/or correlations among attributes. This is called *data mining,* and data clustering is regarded as a particular branch. However existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up well in terms of memory requirement, running time, and result quality. In this paper, an efficient and scalable data clustering method is proposed, based on a new in-memory data structure called *CF-tree.* which serves as an in-memory summary of the data distribution. We have it in a system called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and studied its performance extensively in terms of memory requirements, running time, clustering quality, stability and scalability; we also compare it with other available methods. Finally, BIRCH is applied to solve two real-life problems: one is building an iterative and interactive pixel classification tool, and the other is generating the initial codebook for image compression.

Please e-mail queries and comments to xwu@cs.uvm.edu.