Python Regex to Remove Tags
I'm starting a new series of posts on coding tricks that are simple in principle but often take someone new to a technology too long to find on Stack Overflow. This serves two purposes. First, I can find them again, which always makes things easier when I haven't used something for awhile. Second, maybe other people on the web will randomly find them useful. I'm calling it ST3R, mostly because I'm a dork and I like cubing things.
In today's entry, I'm going to share a quick regular expression that will capture all the tags on a single page. This is useful for parsing HTML, XML, or other markup languages. I should note, however, that actual text processing of HTML tags is best handled by an HTML parser, not a basic regex.
In this case, however, we're going to play out a scenario where we're writing a python script that will remove all the tags from an HTML document. Let's say our HTML looks something like this:
<h1>This is an awesome Website</h1> <p>But I hate all these tags. Wouldn't it be great if we could remove them <span class="bold">all at once</span>.</p>
This is some pretty simple HTML that we're looking at, but let's look at how we'd write a python script to remove the tags:
import re #import our regex module htmlFile = "THIS STRING CONTAINS THE HTML" # now, we subsitute all tags for a simple space htmlFile = re.sub('<.*?>', ' ', htmlFile)
Here, we use the regular expression of
<.*?>, which will capture everything that is between two brackets, no matter what. Of course, more advanced processing would take into consideration what's actually between them, but our
.* will capture everything and the
? will make sure that the regex is not greedy (meaning it won't capture everything from the first < to the last > in the document).
That's all for now!