Using Python's finditer for Lexical Analysis
This is my OLD blog. I've copied this post over to my NEW blog at:
http://www.saltycrane.com/blog/2007/10/using-pythons-finditer-for-lexical/
You should be redirected in 2 seconds.
Fredrik Lundh wrote a good article called Using Regular Expressions for Lexical Analysis which explains how to use Python regular expressions to read an input string and group characters into lexical units, or tokens. The author's first group of examples read in a simple expression, "b = 2 + a*10", and output strings classified as one of three token types: symbols (e.g. a and b), integer literals (e.g. 2 and 10), and operators (e.g. =, +, and *). His first three examples use the findall method and his fourth example uses the undocumented scanner method from the re module. Here is the example code from the fourth example. Note that the "1" in the first column of the results corresponds to the integer literals token group, "2" corresponds to the symbols group, and "3" to the operators group.
import re
expr = "b = 2 + a*10"
pos = 0
pattern = re.compile("\s*(?:(\d+)|(\w+)|(.))")
scan = pattern.scanner(expr)
while 1:
m = scan.match()
if not m:
break
print m.lastindex, repr(m.group(m.lastindex))
Here are the results:
2 'b' 3 '=' 1 '2' 3 '+' 2 'a' 3 '*' 1 '10'
Since this article was dated 2002, and the author was using Python 2.0, I wondered if this was the most current approach. The author notes that recent versions (i.e. version 2.2 or later) of Python allow you to use the finditer method which uses an internal scanner object. Using finditer makes the example code much simpler. Here is Fredrik's example using finditer:
import re
expr = "b = 2 + a*10"
regex = re.compile("\s*(?:(\d+)|(\w+)|(.))")
for m in regex.finditer(expr):
print m.lastindex, repr(m.group(m.lastindex))
Running it produces the same results as the original.
No comments:
Post a Comment