Using Python's finditer for Lexical Analysis
This is my OLD blog. I've copied this post over to my NEW blog at:
http://www.saltycrane.com/blog/2007/10/using-pythons-finditer-for-lexical/
You should be redirected in 2 seconds.
Fredrik Lundh wrote a good article called Using Regular Expressions for Lexical Analysis which explains how to use Python regular expressions to read an input string and group characters into lexical units, or tokens. The author's first group of examples read in a simple expression, "b = 2 + a*10"
, and output strings classified as one of three token types: symbols (e.g. a
and b
), integer literals (e.g. 2
and 10
), and operators (e.g. =
, +
, and *
). His first three examples use the findall
method and his fourth example uses the undocumented scanner
method from the re
module. Here is the example code from the fourth example. Note that the "1" in the first column of the results corresponds to the integer literals token group, "2" corresponds to the symbols group, and "3" to the operators group.
import re expr = "b = 2 + a*10" pos = 0 pattern = re.compile("\s*(?:(\d+)|(\w+)|(.))") scan = pattern.scanner(expr) while 1: m = scan.match() if not m: break print m.lastindex, repr(m.group(m.lastindex))Here are the results:
2 'b' 3 '=' 1 '2' 3 '+' 2 'a' 3 '*' 1 '10'
Since this article was dated 2002, and the author was using Python 2.0, I wondered if this was the most current approach. The author notes that recent versions (i.e. version 2.2 or later) of Python allow you to use the finditer
method which uses an internal scanner
object. Using finditer
makes the example code much simpler. Here is Fredrik's example using finditer
:
import re expr = "b = 2 + a*10" regex = re.compile("\s*(?:(\d+)|(\w+)|(.))") for m in regex.finditer(expr): print m.lastindex, repr(m.group(m.lastindex))
Running it produces the same results as the original.
No comments:
Post a Comment