Project Title: Pattern Matching with Wildcards and Length
Sponsor: NSF, Award No. CCF-0514819
PI: Xindong Wu; Co-PIs: Abdullah N. Arslan and Xingquan Zhu
Duration: July 15, 2005 - June 30, 2008
This research defines a unique problem of pattern matching with
wildcards and length constraints, and aims to design efficient
algorithms for the problem. Given a pattern P and a text T, a
substring S in T is a matching string of P if (1) the number of
wildcards between each two consecutive pattern letters in S and (2)
the length of S are both bounded by the user's specifications. The
project seeks to find the maximum number of ``distinct'' occurrences of
P in T. This is a complex problem that integrates
both local constraints (in the form of gaps between consecutive
pattern letters) and global length constraints in pattern
The research team will start with an existing preliminary design and
further investigate the pattern matching problem, by (1) exploring
the time complexity of the problem, (2) designing new, efficient
algorithms to deal with some special cases, and (3) applying these
efficient algorithms in practical problems in text indexing, gene
sequence analysis, network security and stream data mining.
Xindong Wu and Xingquan Zhu, Mining with Noise Knowledge: Error-Aware
Data Mining, IEEE Transactions on System, Man and Cybernetics, Part A,
38(2008), 4: 917-932.
Gong Chen, Xindong Wu, and Xingquan Zhu, Mining Sequential Patterns
across Time Sequences, New Generation Computing, 26(2008), 1: 75-96
Xingquan Zhu, Peng Zhang, Xindong Wu, Dan He, Cleansing Noisy Data
Streams, Proceedings of the 8th IEEE International Conference on Data
Mining, accepted, December 15-19, 2008, Pisa, Italy.
- Dan He, Abdullah N. Arslan, Yu He and Xindong Wu, Iterative
Refinement of Repeat Sequence Specification Using Constrained Pattern
Matching, Proceedings of the IEEE 7th
International Symposium on Bioinformatics & Bioengineering (IEEE BIBE
2007), Harvard Medical School Conference Center, Cambridge -
Boston, Massachusetts, USA, October 14-17, 2007, 1199-1203.
- Dan He, Xindong Wu and Xingquan Zhu, SAIL-APPROX: An Efficient
On-line Algorithm for Approximate Pattern Matching with Wildcards and
Length Constraints, Proceedings of the 2007 IEEE International Conference on Bioinformatics and
Biomedicine (BIBM'07) (acceptance rate: 60/133), San Jose, CA,
USA, November 2-4, 2007, 151-158.
Yu He, Xindong Wu, Xingquan Zhu and Abdullah N Arslan, Mining Frequent
Patterns with Wildcards from Biological Sequences, Proceedings of
the 2007 IEEE
International Conference on Information Reuse and Integration (IEEE
IRI-07), August 13 - 15, 2007, Hilton Hotel, Las Vegas, USA,
- Xingquan Zhu and Xindong Wu, Discovering Relational Patterns
across Multiple Databases,
Proceedings of the 23rd IEEE International Conference on Data
Engineering (ICDE 2007) (acceptance rate: 122/659 for full
papers), April 16-20, 2007, The Marmara Hotel, Istanbul, Turkey,
- Xingquan Zhu and Xindong Wu, Mining
Complex Patterns across Sequences with Gap Requirements,
Proceedings of the
20th International Joint Conference on Artificial Intelligence
(IJCAI-07) (acceptance rate: 212/1353 for regular, oral
presentations), Hyderabad, India, January 6-12, 2007, 2934-2940.
- Gong Chen, Xindong Wu, Xingquan Zhu, Abdullah N. Arslan, and Yu
He, Efficient String Matching with Wildcards and Length
Constraints, Knowledge and
Information Systems, 10(2006), 4: 399-419.
Dan He, Abdullah N. Arslan, Alan C. H. Ling, "A fast algorithm for the
constrained multiple sequence alignment problem", Acta Cybernetica,
p. 701, vol. 17, (2006).
- Dan He and Xindong Wu, An Efficient Algorithm for Finding
Approximate Complex Repetitive Patterns, Proceedings of the 2006 IASTED International Conference on Computational
and Systems Biology (CASB 2006), November 13-14, 2006, Dallas,
Texas, USA, 44-49.
This page has been accessed times since June 15, 2005.
September 26, 2008.