Project Title: Pattern Matching with Wildcards and Length Constraints
Sponsor: NSF, Award No. CCF-0514819
PI: Xindong Wu; Co-PIs: Abdullah N. Arslan and Xingquan Zhu
Duration: July 15, 2005 - June 30, 2008

Project Summary

This research defines a unique problem of pattern matching with wildcards and length constraints, and aims to design efficient algorithms for the problem. Given a pattern P and a text T, a substring S in T is a matching string of P if (1) the number of wildcards between each two consecutive pattern letters in S and (2) the length of S are both bounded by the user's specifications. The project seeks to find the maximum number of ``distinct'' occurrences of P in T. This is a complex problem that integrates both local constraints (in the form of gaps between consecutive pattern letters) and global length constraints in pattern matching.

The research team will start with an existing preliminary design and further investigate the pattern matching problem, by (1) exploring the time complexity of the problem, (2) designing new, efficient algorithms to deal with some special cases, and (3) applying these efficient algorithms in practical problems in text indexing, gene sequence analysis, network security and stream data mining.

Publications

  1. Xindong Wu and Xingquan Zhu, Mining with Noise Knowledge: Error-Aware Data Mining, IEEE Transactions on System, Man and Cybernetics, Part A, 38(2008), 4: 917-932.
  2. Gong Chen, Xindong Wu, and Xingquan Zhu, Mining Sequential Patterns across Time Sequences, New Generation Computing, 26(2008), 1: 75-96
  3. Xingquan Zhu, Peng Zhang, Xindong Wu, Dan He, Cleansing Noisy Data Streams, Proceedings of the 8th IEEE International Conference on Data Mining, accepted, December 15-19, 2008, Pisa, Italy.
  4. Dan He, Abdullah N. Arslan, Yu He and Xindong Wu, Iterative Refinement of Repeat Sequence Specification Using Constrained Pattern Matching, Proceedings of the IEEE 7th International Symposium on Bioinformatics & Bioengineering (IEEE BIBE 2007), Harvard Medical School Conference Center, Cambridge - Boston, Massachusetts, USA, October 14-17, 2007, 1199-1203.
  5. Dan He, Xindong Wu and Xingquan Zhu, SAIL-APPROX: An Efficient On-line Algorithm for Approximate Pattern Matching with Wildcards and Length Constraints, Proceedings of the 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM'07) (acceptance rate: 60/133), San Jose, CA, USA, November 2-4, 2007, 151-158.
  6. Yu He, Xindong Wu, Xingquan Zhu and Abdullah N Arslan, Mining Frequent Patterns with Wildcards from Biological Sequences, Proceedings of the 2007 IEEE International Conference on Information Reuse and Integration (IEEE IRI-07), August 13 - 15, 2007, Hilton Hotel, Las Vegas, USA, 329-334.
  7. Xingquan Zhu and Xindong Wu, Discovering Relational Patterns across Multiple Databases, Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE 2007) (acceptance rate: 122/659 for full papers), April 16-20, 2007, The Marmara Hotel, Istanbul, Turkey, 726-735.
  8. Xingquan Zhu and Xindong Wu, Mining Complex Patterns across Sequences with Gap Requirements, Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI-07) (acceptance rate: 212/1353 for regular, oral presentations), Hyderabad, India, January 6-12, 2007, 2934-2940.
  9. Gong Chen, Xindong Wu, Xingquan Zhu, Abdullah N. Arslan, and Yu He, Efficient String Matching with Wildcards and Length Constraints, Knowledge and Information Systems, 10(2006), 4: 399-419.
  10. Dan He, Abdullah N. Arslan, Alan C. H. Ling, "A fast algorithm for the constrained multiple sequence alignment problem", Acta Cybernetica, p. 701, vol. 17, (2006).
  11. Dan He and Xindong Wu, An Efficient Algorithm for Finding Approximate Complex Repetitive Patterns, Proceedings of the 2006 IASTED International Conference on Computational and Systems Biology (CASB 2006), November 13-14, 2006, Dallas, Texas, USA, 44-49.

This page has been accessed times since June 15, 2005.
Last updated: September 26, 2008.