Extract source attribute from IMG tags

Last Saturday I was trying to come up with a simple algorithm to filter out source attribute from the IMG tag of a HTML page. I considered few options and finally set my sights on RegEx….! I went through few introductory web sites for RegEx and knew I was going to sit in front my computer for a while. It was like learning a foreign language! Well I guess learning an actual language like French would be much easier!!! So my advice… RegEx should be the last resort unless you are really willing to get down and dirty!

After hours and hours of tinkering, I managed to get what I want. So here’s my code…

First extract the IMG tag from content

Pattern p = Pattern.compile(“\s*img\s*([^>]*)\s*>”,Pattern.CASE_INSENSITIVE);

Matcher m = p.matcher(contents);

boolean result = m.find();

StringBuffer sb = new StringBuffer();

Then loop through all the IMG tags found

while(result){

String imageSrc = m.group();

Second parse to extract SRC attribute from all the IMG tags

Pattern p1 = Pattern.compile(“([a-z]+)\s*=\s*”([^"]+)”", Pattern.CASE_INSENSITIVE);

Matcher m1 = p1.matcher(m.group());

m1.find();

StringBuffer sb1 = new StringBuffer();

String imgSrc = m1.group();

imgSrc = imgSrc.substring(imgSrc.indexOf(“”") + 1, imgSrc.lastIndexOf(“”"));

Now I can process my SRC attribute and replace with anything I want

m1.appendReplacement(sb1, “src=”" + newURL + “”");

Append the modified SRC attribute

m1.appendTail(sb1); m.appendReplacement(sb, sb1.toString());

result = m.find();

}

Finally append the reconstructed IMG tags

m.appendTail(sb);

return sb.toString();

I know this is not very elegant… But does the job. I’m open to any suggestions… if you have any…


About this entry