Sunday, August 29, 2010

HTML Parsing using HTMLParser

Simple script to get the video download link from here, which are hosted at archive.org
This script is adapted from here.

HTMLParser usage
from Python documentation:

Usage:
    p = HTMLParser()
    p.feed(data)
    ...
    p.close()

Start tags are handled by calling handle_starttag() or
handle_startendtag(); end tags by handle_endtag().  The
data between tags is passed from the parser to the derived class
by calling handle_data() with the data as argument (the data
may be split up in arbitrary chunks).  Entity references are
passed by calling handle_entityref() with the entity
reference as the argument.  Numeric character references are
passed to handle_charref() with the string containing the
reference as the argument.

import sys
import urllib
import HTMLParser
import re

class GetLinks(HTMLParser.HTMLParser):
    def handle_starttag(self,tag,attrs):
        if tag == 'a':
            for name,value in attrs:
                if name == 'href':
                    if re.search('ArabicLanguageCourseVideos',value):
                        print(value)
                    
gl = GetLinks()
url = 'http://www.lqtoronto.com/videodl.html'

urlconn = urllib.urlopen(url)

# read and put the downloaded html code into url content
urlcontents = urlconn.read()

# input the downloaded material into HTMLParser's member function 
# for parsing
gl.feed(urlcontents)

No comments:

Post a Comment