This script is adapted from here.
HTMLParser usage
from Python documentation:
Usage: p = HTMLParser() p.feed(data) ... p.close() Start tags are handled by calling handle_starttag() or handle_startendtag(); end tags by handle_endtag(). The data between tags is passed from the parser to the derived class by calling handle_data() with the data as argument (the data may be split up in arbitrary chunks). Entity references are passed by calling handle_entityref() with the entity reference as the argument. Numeric character references are passed to handle_charref() with the string containing the reference as the argument.
import sys import urllib import HTMLParser import re class GetLinks(HTMLParser.HTMLParser): def handle_starttag(self,tag,attrs): if tag == 'a': for name,value in attrs: if name == 'href': if re.search('ArabicLanguageCourseVideos',value): print(value) gl = GetLinks() url = 'http://www.lqtoronto.com/videodl.html' urlconn = urllib.urlopen(url) # read and put the downloaded html code into url content urlcontents = urlconn.read() # input the downloaded material into HTMLParser's member function # for parsing gl.feed(urlcontents)
No comments:
Post a Comment