This script is adapted from here.
HTMLParser usage
from Python documentation:
Usage: p = HTMLParser() p.feed(data) ... p.close() Start tags are handled by calling handle_starttag() or handle_startendtag(); end tags by handle_endtag(). The data between tags is passed from the parser to the derived class by calling handle_data() with the data as argument (the data may be split up in arbitrary chunks). Entity references are passed by calling handle_entityref() with the entity reference as the argument. Numeric character references are passed to handle_charref() with the string containing the reference as the argument.
import sys
import urllib
import HTMLParser
import re
class GetLinks(HTMLParser.HTMLParser):
def handle_starttag(self,tag,attrs):
if tag == 'a':
for name,value in attrs:
if name == 'href':
if re.search('ArabicLanguageCourseVideos',value):
print(value)
gl = GetLinks()
url = 'http://www.lqtoronto.com/videodl.html'
urlconn = urllib.urlopen(url)
# read and put the downloaded html code into url content
urlcontents = urlconn.read()
# input the downloaded material into HTMLParser's member function
# for parsing
gl.feed(urlcontents)
No comments:
Post a Comment