How to remove C style comments using Python



This is my OLD blog. I've copied this post over to my NEW blog at:

http://www.saltycrane.com/blog/2007/11/remove-c-comments-python/

You should be redirected in 2 seconds.



The Perl FAQ has an entry How do I use a regular expression to strip C style comments from a file? Since I've switched to Python, I've adapted the Perl solution to Python.

I included two versions from the Perl FAQ. The simple version removes single-line or multi-line C-style comments from a file, but has the possibly unwanted behavior of removing text which looks like comments from within a quoted string. The advantage is that it uses a regular expression that is easier to understand.

The second version is more robust-- it handles comments within a quoted string properly (i.e. it does not remove them). This is the recommended version.



Simple version

The simple version uses a regular expression that is only 9 characters long. The key to the regular expression is the use of the lazy, or non-greedy, quantifier, *?. From Mastering Regular Expressiongs, Second Edition by Jeffrey E. F. Friedl:

Quantifiers are normally "greedy", and try to match as much as possible. Conversely, these non-greedy versions match as little as possible, just the bare minimum needed to satisfy the match.
Matching a C-style comment using such a simple pattern would not be possible without the use of lazy quantifiers because of the C-style comment's two character ending.



remove_comments_simple.py:
# remove_comments_simple.py
import re
import sys

# open file
filename = sys.argv[1]
code_with_comments = open(filename).read()

# strip comments
regex = re.compile(r"/\*.*?\*/", re.MULTILINE|re.DOTALL)
code_without_comments = regex.sub("", code_with_comments)

# write new file
fh = open(filename+".nocomments", "w")
fh.write(code_without_comments)
fh.close()

Example:
To test the script, I created a test file called testfile.c:
/* This is a C-style comment. */
This is not a comment.
/* This is another
 * C-style comment.
 */
"This is /* also not a comment */"

Run the script:
To use the script, I put my script, remove_comments_simple.py, and my test file, testfile.c, in the same directory and ran the following command:
python remove_comments_simple.py testfile.c

Results:
The script created a new file called testfile.c.nocomments:
This is not a comment.

"This is "


Robust version (Recommended)

From the Perl FAQ, this version was created by Jeffrey Friedl and later modified by Fred Curtis. I'm not certain, but this version appears to use the "unrolling the loop" technique described in Chapter 6 of Mastering Regular Expressions.


remove_comments.py:
# remove_comments.py
import re

def remove_comments(text):
    """ remove c-style comments.
        text: blob of text with comments (can include newlines)
        returns: text with comments removed
    """
    pattern = r"""
                            ##  --------- COMMENT ---------
           /\*              ##  Start of /* ... */ comment
           [^*]*\*+         ##  Non-* followed by 1-or-more *'s
           (                ##
             [^/*][^*]*\*+  ##
           )*               ##  0-or-more things which don't start with /
                            ##    but do end with '*'
           /                ##  End of /* ... */ comment
         |                  ##  -OR-  various things which aren't comments:
           (                ## 
                            ##  ------ " ... " STRING ------
             "              ##  Start of " ... " string
             (              ##
               \\.          ##  Escaped char
             |              ##  -OR-
               [^"\\]       ##  Non "\ characters
             )*             ##
             "              ##  End of " ... " string
           |                ##  -OR-
                            ##
                            ##  ------ ' ... ' STRING ------
             '              ##  Start of ' ... ' string
             (              ##
               \\.          ##  Escaped char
             |              ##  -OR-
               [^'\\]       ##  Non '\ characters
             )*             ##
             '              ##  End of ' ... ' string
           |                ##  -OR-
                            ##
                            ##  ------ ANYTHING ELSE -------
             .              ##  Anything other char
             [^/"'\\]*      ##  Chars which doesn't start a comment, string
           )                ##    or escape
    """
    regex = re.compile(pattern, re.VERBOSE|re.MULTILINE|re.DOTALL)
    noncomments = [m.group(2) for m in regex.finditer(text) if m.group(2)]

    return "".join(noncomments)

if __name__ == '__main__':
    filename = sys.argv[1]
    code_w_comments = open(filename).read()
    code_wo_comments = remove_comments(code_w_comments)
    fh = open(filename+".nocomments", "w")
    fh.write(code_wo_comments)
    fh.close()

Example:
To test this script, I used the same test file, testfile.c:
/* This is a C-style comment. */
This is not a comment.
/* This is another
 * C-style comment.
 */
"This is /* also not a comment */"

Run the script:
To use the script, I put the script, remove_comments.py, and my test file, testfile.c, in the same directory and ran the following command:
python remove_comments.py testfile.c

Results:
The script created a new file called testfile.c.nocomments:
This is not a comment.

"This is /* also not a comment */"



---------------
Minor note on Perl to Python migration:
I modified the original regular expression comments a little bit. In particular, I had to put at least one character after the ## Non "\ and ## Non '\ lines because, in Python, the backslash was escaping the following newline character and the closing parenthesis on the following line was being treated as a comment by the regular expression engine. This is the error I got, before the fix:
$ python remove_comments.py
Traceback (most recent call last):
  File "remove_comments.py", line 39, in <module>
    regex = re.compile(pattern, re.VERBOSE|re.MULTILINE|re.DOTALL)
  File "C:\Programs\Python25\lib\re.py", line 180, in compile
    return _compile(pattern, flags)
  File "C:\Programs\Python25\lib\re.py", line 233, in _compile
    raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis

No comments:

About

This is my *OLD* blog. I've copied all of my posts and comments over to my NEW blog at:

http://www.saltycrane.com/blog/.

Please go there for my updated posts. I will leave this blog up for a short time, but eventually plan to delete it. Thanks for reading.