Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
path: root/parsers
Commit message (Expand)AuthorAgeFilesLines
* copy-paste fix (thanks Robert M. Jansen <dutch12154@yahoo...Arthur de Jong2008-07-131-1/+1
* call tidy (if available) on HTML content (based on a patc...Arthur de Jong2008-07-042-7/+59
* fix name of fileArthur de Jong2008-07-041-1/+1
* also pick up any style attributes and parse as css, based...Arthur de Jong2008-06-212-0/+10
* add parsing of script tag and background attributes, base...Arthur de Jong2008-06-152-0/+16
* do not require src attribute for parsing inline style tagsArthur de Jong2008-06-151-1/+1
* update copyright yearArthur de Jong2008-06-151-1/+1
* fix parsing of <param> tagArthur de Jong2008-05-251-1/+1
* support <iframe> and some common usages of <object>Arthur de Jong2008-05-241-0/+15
* add a warning if the used version of BeautifulSoup contai...Arthur de Jong2007-09-171-0/+5
* also handle http-equiv refresh meta headerArthur de Jong2007-07-151-3/+13
* split out URL cleaning code into own moduleArthur de Jong2007-07-072-17/+19
* handle ID attribute as anchor on any tagArthur de Jong2007-04-241-5/+5
* correctly parse author informationArthur de Jong2007-04-201-2/+2
* introduce HTML parsing using BeautifulSoup with a fall-ba...Arthur de Jong2007-04-203-64/+255
* evaluate archive attribute of <applet> tag instead of cod...Arthur de Jong2007-03-311-2/+5
* add set_encoding method to Link object to do some basic e...Arthur de Jong2006-07-131-13/+11
* add TODOsArthur de Jong2006-05-311-0/+2
* make decoding try/fall-back code a lot simpler and handle...Arthur de Jong2006-05-151-12/+7
* improve warning text and add comment concerning trying of...Arthur de Jong2006-05-121-1/+2
* ignore unknown entities instead of throwing an errorArthur de Jong2006-05-121-2/+5
* move html escaping and unescaping functions to parsers.htmlArthur de Jong2006-05-071-11/+52
* use unichr() to generate Unicode characters, not chr()Arthur de Jong2006-05-071-1/+1
* some more small code improvements thanks to pycheckerArthur de Jong2006-05-071-0/+1
* implement checking for id and name tags in anchorsArthur de Jong2006-05-061-12/+39
* code improvements thanks to pylintArthur de Jong2006-04-233-65/+74
* do not fail on unknown encodings (fall back to system enc...Arthur de Jong2006-04-071-3/+6
* split urlescape() from _urlclean() and ensure that all an...Arthur de Jong2006-03-261-2/+2
* revert catching Exception instead of IOError that was the...Arthur de Jong2006-03-111-1/+1
* implement checking of anchors (there should be no double ...Arthur de Jong2006-03-101-4/+20
* trim spaces from title and author fields and check that t...Arthur de Jong2006-03-101-2/+2
* make sure all URLs are consistently URL-encoded where it ...Arthur de Jong2006-01-291-5/+1
* fix typo (thanks Andrew Kim <Andrew.Kim@revolution.com>)Arthur de Jong2006-01-261-2/+2
* quote links so that they do not contain any non-ASCII cha...Arthur de Jong2006-01-191-7/+13
* bug fix to handle numeric character references better (Un...Arthur de Jong2005-12-261-2/+2
* add copyright clarification to specify that generated out...Arthur de Jong2005-12-173-0/+9
* store author and title in Unicode internally and ensure t...Arthur de Jong2005-09-171-2/+23
* also try to get character encoding from XML declaration a...Arthur de Jong2005-09-171-0/+22
* parse character entries as normal data, these entities wi...Arthur de Jong2005-09-171-0/+10
* also feed style tag content to the CSS parser to parse in...Arthur de Jong2005-08-201-0/+7
* remove some debugging functions from CSS parserArthur de Jong2005-08-201-3/+0
* first attempt at a very simple CSS parser that just summa...Arthur de Jong2005-08-201-1/+28
* add checking of unescaped spaces to the html parser, incl...Arthur de Jong2005-08-201-25/+41
* split problems into page problems (parsing errors, wrong ...Arthur de Jong2005-08-191-1/+1
* also pass mimetypes to scheme modules to only fetch conte...Arthur de Jong2005-08-121-6/+18
* put compiled regular expression on module level so that i...Arthur de Jong2005-08-121-2/+4
* make parsing handle errors a little more gracefully, than...Arthur de Jong2005-08-011-3/+6
* also catch AttributeError for problem in HTMLParser not f...Arthur de Jong2005-07-311-1/+1
* replace numeric entity refs with their proper values base...Arthur de Jong2005-07-311-2/+11
* put new html parser in placeArthur de Jong2005-07-311-88/+113