Extract source attribute from IMG tags
Last Saturday I was trying to come up with a simple algorithm to filter out source attribute from the IMG tag of a HTML page. I considered few options and finally set my sights on RegEx….! I went through few introductory web sites for RegEx and knew I was going to sit in front my computer for a while. It was like learning a foreign language! Well I guess learning an actual language like French would be much easier!!! So my advice… RegEx should be the last resort unless you are really willing to get down and dirty!
After hours and hours of tinkering, I managed to get what I want. So here’s my code…
First extract the IMG tag from content
Pattern p = Pattern.compile(”\\s*img\\s*([^>]*)\\s*>”,Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(contents);
boolean result = m.find();
StringBuffer sb = new StringBuffer();
Then loop through all the IMG tags found
while(result){
String imageSrc = m.group();
Second parse to extract SRC attribute from all the IMG tags
Pattern p1 = Pattern.compile(”([a-z]+)\\s*=\\s*\”([^\”]+)\”", Pattern.CASE_INSENSITIVE);
Matcher m1 = p1.matcher(m.group());
m1.find();
StringBuffer sb1 = new StringBuffer();
String imgSrc = m1.group();
imgSrc = imgSrc.substring(imgSrc.indexOf(”\”") + 1, imgSrc.lastIndexOf(”\”"));
Now I can process my SRC attribute and replace with anything I want
m1.appendReplacement(sb1, “src=\”" + newURL + “\”");
Append the modified SRC attribute
m1.appendTail(sb1); m.appendReplacement(sb, sb1.toString());
result = m.find();
}
Finally append the reconstructed IMG tags
m.appendTail(sb);
return sb.toString();
I know this is not very elegant… But does the job. I’m open to any suggestions… if you have any…
Popularity: 35% [?]


Spyder
September 23rd, 2007 at 6:12 pm
suggestions? I have plenty! If you’d asked me while we were at work I could’ve pulled an example out of the ELJ codebase - we do exactly what you’re trying to do. I’m too lazy to come up with a proper solution now that I’m home so this is 90% guesswork.
We’ll start with a better regex, that makes sure you get an actual image tag with a src attribute (neither of which your first regex is guaranteed to return):
(]*src=”)([^”]*)(”[^>]*>)
The first thing you’re missing is that you don’t need to do a double parse - you can do it all in one hit using groups. Instead of calling the empty group() function and finding the src attribute within, pass a number to it to receive the contents of the group you created using brackets.
Note that I created 3 groups in my regex - the real power comes from copying those groups into the output. So now you can do funky stuff like this.
String url = m.group(2);
url += “?querystring”; //or whatever you want to do
m.appendReplacement(sb, “$1″ + url + “$3″);
As you can probably tell, that would modify the url and put it back in the image tag.
Regex has this really handy feature where you can use the replace method to include groups from the match. See the java.util.regex.Matcher.appendReplacement() documentation for more info, although any regex manual should include this information (some parsers use \ instead of $ to denote group matches).
Oh, and one thing I should also mention is that your regex won’t match images with src attributes that have single quotes (’) instead of double quotes (”). To fix that, use [”|’] instead.
Uma
October 8th, 2007 at 1:56 pm
Will this procedure work when the source supplies img tags with line breaks?
Suneth Mendis
October 8th, 2007 at 4:32 pm
Technically it should. I did not encounter problems with line breaks with my code.