How to remove C style comments using Python
This is my OLD blog. I've copied this post over to my NEW blog at:
http://www.saltycrane.com/blog/2007/11/remove-c-comments-python/
You should be redirected in 2 seconds.
The Perl FAQ has an entry How do I use a regular expression to strip C style comments from a file? Since I've switched to Python, I've adapted the Perl solution to Python.
I included two versions from the Perl FAQ. The simple version removes single-line or multi-line C-style comments from a file, but has the possibly unwanted behavior of removing text which looks like comments from within a quoted string. The advantage is that it uses a regular expression that is easier to understand.
The second version is more robust-- it handles comments within a quoted string properly (i.e. it does not remove them). This is the recommended version.
Simple version
The simple version uses a regular expression that is only 9 characters long.
The key to the regular expression is the use of the lazy, or non-greedy,
quantifier, *?
. From Mastering Regular Expressiongs, Second
Edition by Jeffrey E. F. Friedl:
Quantifiers are normally "greedy", and try to match as much as possible. Conversely, these non-greedy versions match as little as possible, just the bare minimum needed to satisfy the match.Matching a C-style comment using such a simple pattern would not be possible without the use of lazy quantifiers because of the C-style comment's two character ending.
remove_comments_simple.py:
# remove_comments_simple.py import re import sys # open file filename = sys.argv[1] code_with_comments = open(filename).read() # strip comments regex = re.compile(r"/\*.*?\*/", re.MULTILINE|re.DOTALL) code_without_comments = regex.sub("", code_with_comments) # write new file fh = open(filename+".nocomments", "w") fh.write(code_without_comments) fh.close()
Example:
To test the script, I created a test file called
testfile.c
:
/* This is a C-style comment. */ This is not a comment. /* This is another * C-style comment. */ "This is /* also not a comment */"
Run the script:
To use the script, I put my script,
remove_comments_simple.py
,
and my test file, testfile.c
, in the same directory and ran the
following command:
python remove_comments_simple.py testfile.c
Results:
The script created a new file called
testfile.c.nocomments
:
This is not a comment. "This is "
Robust version (Recommended)
From the Perl FAQ, this version was created by Jeffrey Friedl and later modified by Fred Curtis. I'm not certain, but this version appears to use the "unrolling the loop" technique described in Chapter 6 of Mastering Regular Expressions.
remove_comments.py:
# remove_comments.py import re def remove_comments(text): """ remove c-style comments. text: blob of text with comments (can include newlines) returns: text with comments removed """ pattern = r""" ## --------- COMMENT --------- /\* ## Start of /* ... */ comment [^*]*\*+ ## Non-* followed by 1-or-more *'s ( ## [^/*][^*]*\*+ ## )* ## 0-or-more things which don't start with / ## but do end with '*' / ## End of /* ... */ comment | ## -OR- various things which aren't comments: ( ## ## ------ " ... " STRING ------ " ## Start of " ... " string ( ## \\. ## Escaped char | ## -OR- [^"\\] ## Non "\ characters )* ## " ## End of " ... " string | ## -OR- ## ## ------ ' ... ' STRING ------ ' ## Start of ' ... ' string ( ## \\. ## Escaped char | ## -OR- [^'\\] ## Non '\ characters )* ## ' ## End of ' ... ' string | ## -OR- ## ## ------ ANYTHING ELSE ------- . ## Anything other char [^/"'\\]* ## Chars which doesn't start a comment, string ) ## or escape """ regex = re.compile(pattern, re.VERBOSE|re.MULTILINE|re.DOTALL) noncomments = [m.group(2) for m in regex.finditer(text) if m.group(2)] return "".join(noncomments) if __name__ == '__main__': filename = sys.argv[1] code_w_comments = open(filename).read() code_wo_comments = remove_comments(code_w_comments) fh = open(filename+".nocomments", "w") fh.write(code_wo_comments) fh.close()
Example:
To test this script, I used the same test file,
testfile.c
:
/* This is a C-style comment. */ This is not a comment. /* This is another * C-style comment. */ "This is /* also not a comment */"
Run the script:
To use the script, I put the script,
remove_comments.py
,
and my test file, testfile.c
, in the same directory and ran the
following command:python remove_comments.py testfile.c
Results:
The script created a new file called
testfile.c.nocomments
:
This is not a comment. "This is /* also not a comment */"
---------------
Minor note on Perl to Python migration:
I modified the original regular expression comments a little bit. In particular, I had to put at least one character after the
##
Non "\
and ## Non '\
lines because, in Python,
the backslash was escaping the following newline character and the
closing parenthesis on the following line was being treated as a
comment by the regular expression engine. This is the error I got,
before the fix:
$ python remove_comments.py Traceback (most recent call last): File "remove_comments.py", line 39, in <module> regex = re.compile(pattern, re.VERBOSE|re.MULTILINE|re.DOTALL) File "C:\Programs\Python25\lib\re.py", line 180, in compile return _compile(pattern, flags) File "C:\Programs\Python25\lib\re.py", line 233, in _compile raise error, v # invalid expression sre_constants.error: unbalanced parenthesis
No comments:
Post a Comment