Notes: HTML Parsing using HTMLParser

Simple script to get the video download link from here, which are hosted at archive.org
This script is adapted from here.

HTMLParser usage
from Python documentation:

Usage:
    p = HTMLParser()
    p.feed(data)
    ...
    p.close()

Start tags are handled by calling handle_starttag() or
handle_startendtag(); end tags by handle_endtag().  The
data between tags is passed from the parser to the derived class
by calling handle_data() with the data as argument (the data
may be split up in arbitrary chunks).  Entity references are
passed by calling handle_entityref() with the entity
reference as the argument.  Numeric character references are
passed to handle_charref() with the string containing the
reference as the argument.

import sys
import urllib
import HTMLParser
import re

class GetLinks(HTMLParser.HTMLParser):
    def handle_starttag(self,tag,attrs):
        if tag == 'a':
            for name,value in attrs:
                if name == 'href':
                    if re.search('ArabicLanguageCourseVideos',value):
                        print(value)
                    
gl = GetLinks()
url = 'http://www.lqtoronto.com/videodl.html'

urlconn = urllib.urlopen(url)

# read and put the downloaded html code into url content
urlcontents = urlconn.read()

# input the downloaded material into HTMLParser's member function 
# for parsing
gl.feed(urlcontents)

Notes

Pages

Sunday, August 29, 2010

HTML Parsing using HTMLParser

No comments:

Post a Comment

Labels

Blog Archive

Followers