Pattern matching through regular expression [reviewed]
Problem
Pattern Matching
----------------
Characters: a to z
Operators: * +
* -> matches zero or more (of the character that occurs previous to this operator)
+ -> matches one or more (of the character that occurs previous to this operator)
Output if a given pattern matches a string.
Example:
pattern:a*b
string:aaab b, ab, aab, aaab, ab
output:1
pattern:a+aabc
string:ab aabc, aaabc, aaaabc ..
output:0
pattern:aa*b*ab+
string:aab aab, aabab, aaaabbab
output:1
pattern: a+a*b*
string: a ab, aab, aaabb
output: 1
Valid Assumptions: Please assume that both the pattern and string input are valid
This question is similar to "Pattern matching" problem that we previously examined. We had some success with the linear scanning. Why don't we apply that to this problem?
Here is the example of why linear scan what we used in "Pattern matching" problem would like work.
Given a pattern, "aa*b*ab+" and a input string, aab, if we scan from the beginning of the pattern and the input string, the algorithm will fail after scanning "aab" of input string since 'ab+" are still left.
Using linear approach, we can compare the pattern string and the input string as follows:
- Scan the input string and produce the list of nodes each of which has the character and its occurrence count.
For aabbc: the following nodes will be created.
[a, 2], [b, 2], [c, 1]
- Scan the pattern string and produce the list of nodes each of which has the character, its occurrence count and a additional flag to indicate whether '*' or '+' exists in the series.
For. aaa*bb+c* will produce the following nodes.
[a, 2, true], [b, 1, true], [c, 0, true]
Notice that for '*', we did not count the character preceding '*', while we did for '+'.
It is because '*' can be none as well.
Once this preliminary scanning, we can compare two list of nodes following the the rules below:
Case 1) If both node in input and pattern list are the same characters:
1) if a current node in pattern list has false, the occurrence counts in both side should be the same.
[a, 2, false] $=$ [a, 2]
[a, 3, false] $\neq$ [a, 2]
2) if a current node in pattern list has true, the occurrence count of the pattern node should be less or equal to the occurrence count of the input node.
[a, 2, true] $=$ [a, 4]
[a, 4, true] $\neq$ [a, 3]
Case 2) If both node in input and pattern list have different characters:
1) if a current node in pattern has true, meaning contains '*', skip the current node in the pattern list.
2) If a current node in a pattern has false, we returns false for the comparison.
[a, 1, true], [b, 0, true], [a, 1, false], [b, 1, true]
If the string "aab", the comparison will fail because the comparison stop after [b, 0, true]
However, aa*b*ab+ can be aab when a* and b* result in zero.
What is missing is that we might need to come up with possible permutation of the parsed pattern some thing like:
[
[a, 1, true], [b, 0, true], [a, 1, false], [b, 1, true],
[a, 1, true], [a, 1, false], [b, 1, true]
[a,1, true] , [b, 1, true]
]
When creating permutation, we branch on the node with count of zero and "true" (*) is set.
We can create a permutation using DFS without backtracking.
Overall algorithm will be as below:
1. preprocess list of input words.
2. preprocess the pattern strings.
3. create a permutation based on parsed pattern.
4. Go through the permutation list and collapse nodes with the same char.
5. Now, for each parsed word from Step 1, check if the word matches any of pattern permutation.
6. If all words matches one of pattern permutation, return True.
Previous solution in C++ did not consider the edge case.
Here is the solution in python.
Creating pattern permutation will run in $O(2 ^ M)$ where M is the number of unique section with the same char in the pattern.
Another expensive step will be comparing each word with the pattern permutation.
Overall running time will be $O(N \times 2 ^ M)$.
Practice statistics:
Initially, I wrote the very inefficient code comparing the pattern and word each character.
This code was very lengthy and still did not think of edge case (permutation).
1:20 to come up with initial code comparing each char between pattern and word.
1:23:34 : to rewrite the code based on the preprocessing idea. But the permutation case was not considered.
1:00:00: to add additional code to consider permutation and compacting the parsed patterns.
Linear scanning approach
This question is similar to "Pattern matching" problem that we previously examined. We had some success with the linear scanning. Why don't we apply that to this problem?
Here is the example of why linear scan what we used in "Pattern matching" problem would like work.
Given a pattern, "aa*b*ab+" and a input string, aab, if we scan from the beginning of the pattern and the input string, the algorithm will fail after scanning "aab" of input string since 'ab+" are still left.
Using linear approach, we can compare the pattern string and the input string as follows:
- Scan the input string and produce the list of nodes each of which has the character and its occurrence count.
For aabbc: the following nodes will be created.
[a, 2], [b, 2], [c, 1]
- Scan the pattern string and produce the list of nodes each of which has the character, its occurrence count and a additional flag to indicate whether '*' or '+' exists in the series.
For. aaa*bb+c* will produce the following nodes.
[a, 2, true], [b, 1, true], [c, 0, true]
Notice that for '*', we did not count the character preceding '*', while we did for '+'.
It is because '*' can be none as well.
Once this preliminary scanning, we can compare two list of nodes following the the rules below:
Case 1) If both node in input and pattern list are the same characters:
1) if a current node in pattern list has false, the occurrence counts in both side should be the same.
[a, 2, false] $=$ [a, 2]
[a, 3, false] $\neq$ [a, 2]
2) if a current node in pattern list has true, the occurrence count of the pattern node should be less or equal to the occurrence count of the input node.
[a, 2, true] $=$ [a, 4]
[a, 4, true] $\neq$ [a, 3]
Case 2) If both node in input and pattern list have different characters:
1) if a current node in pattern has true, meaning contains '*', skip the current node in the pattern list.
2) If a current node in a pattern has false, we returns false for the comparison.
Important edge case
Consider "aa*b*ab+". If we parse this pattern using the rules above, it will looks like below[a, 1, true], [b, 0, true], [a, 1, false], [b, 1, true]
If the string "aab", the comparison will fail because the comparison stop after [b, 0, true]
However, aa*b*ab+ can be aab when a* and b* result in zero.
What is missing is that we might need to come up with possible permutation of the parsed pattern some thing like:
[
[a, 1, true], [b, 0, true], [a, 1, false], [b, 1, true],
[a, 1, true], [a, 1, false], [b, 1, true]
[a,1, true] , [b, 1, true]
]
When creating permutation, we branch on the node with count of zero and "true" (*) is set.
We can create a permutation using DFS without backtracking.
Overall algorithm will be as below:
1. preprocess list of input words.
2. preprocess the pattern strings.
3. create a permutation based on parsed pattern.
4. Go through the permutation list and collapse nodes with the same char.
5. Now, for each parsed word from Step 1, check if the word matches any of pattern permutation.
6. If all words matches one of pattern permutation, return True.
Previous solution in C++ did not consider the edge case.
Here is the solution in python.
Creating pattern permutation will run in $O(2 ^ M)$ where M is the number of unique section with the same char in the pattern.
Another expensive step will be comparing each word with the pattern permutation.
Overall running time will be $O(N \times 2 ^ M)$.
Practice statistics:
Initially, I wrote the very inefficient code comparing the pattern and word each character.
This code was very lengthy and still did not think of edge case (permutation).
1:20 to come up with initial code comparing each char between pattern and word.
1:23:34 : to rewrite the code based on the preprocessing idea. But the permutation case was not considered.
1:00:00: to add additional code to consider permutation and compacting the parsed patterns.
Comments
Post a Comment