Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
path: root/parsers
Commit message (Collapse)AuthorAgeFilesLines
* update copyright yearsArthur de Jong2010-09-113-3/+3
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@410 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* handle case where inline CSS is used on a page with ↵Arthur de Jong2009-01-143-7/+9
| | | | | | <base href=".."> git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@397 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* copy-paste fix (thanks Robert M. Jansen ↵Arthur de Jong2008-07-131-1/+1
| | | | | | <dutch12154@yahoo.com>) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@386 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* call tidy (if available) on HTML content (based on a ↵Arthur de Jong2008-07-042-7/+59
| | | | | | patch by Henning Sielaff <hsielaff@eformation.de>) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@383 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix name of fileArthur de Jong2008-07-041-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@382 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* also pick up any style attributes and parse as css, ↵Arthur de Jong2008-06-212-0/+10
| | | | | | based on a patch by Robert M. Jansen <dutch12154@yahoo.com> git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@381 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add parsing of script tag and background attributes, ↵Arthur de Jong2008-06-152-0/+16
| | | | | | based on a patch by Robert M. Jansen <dutch12154@yahoo.com> git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@380 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* do not require src attribute for parsing inline style tagsArthur de Jong2008-06-151-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@379 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* update copyright yearArthur de Jong2008-06-151-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@378 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix parsing of <param> tagArthur de Jong2008-05-251-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@376 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* support <iframe> and some common usages of <object>Arthur de Jong2008-05-241-0/+15
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@373 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add a warning if the used version of BeautifulSoup ↵Arthur de Jong2007-09-171-0/+5
| | | | | | contains a bug git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@357 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* also handle http-equiv refresh meta headerArthur de Jong2007-07-151-3/+13
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@349 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* split out URL cleaning code into own moduleArthur de Jong2007-07-072-17/+19
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@339 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* handle ID attribute as anchor on any tagArthur de Jong2007-04-241-5/+5
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@326 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* correctly parse author informationArthur de Jong2007-04-201-2/+2
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@324 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* introduce HTML parsing using BeautifulSoup with a ↵Arthur de Jong2007-04-203-64/+255
| | | | | | fall-back mechanism to the old HTMLParser based solution git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@323 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* evaluate archive attribute of <applet> tag instead of ↵Arthur de Jong2007-03-311-2/+5
| | | | | | code attribute if that is present git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@313 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add set_encoding method to Link object to do some basic ↵Arthur de Jong2006-07-131-13/+11
| | | | | | encoding sanity checks git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@297 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add TODOsArthur de Jong2006-05-311-0/+2
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@282 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* make decoding try/fall-back code a lot simpler and ↵Arthur de Jong2006-05-151-12/+7
| | | | | | handle case where encoding is specified as empty string git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@264 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* improve warning text and add comment concerning trying ↵Arthur de Jong2006-05-121-1/+2
| | | | | | of encodings git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@263 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* ignore unknown entities instead of throwing an errorArthur de Jong2006-05-121-2/+5
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@262 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* move html escaping and unescaping functions to parsers.htmlArthur de Jong2006-05-071-11/+52
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@255 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* use unichr() to generate Unicode characters, not chr()Arthur de Jong2006-05-071-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@254 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* some more small code improvements thanks to pycheckerArthur de Jong2006-05-071-0/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@252 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* implement checking for id and name tags in anchorsArthur de Jong2006-05-061-12/+39
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@251 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* code improvements thanks to pylintArthur de Jong2006-04-233-65/+74
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@242 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* do not fail on unknown encodings (fall back to system ↵Arthur de Jong2006-04-071-3/+6
| | | | | | encoding) and add some TODOs to do extra encoding checking git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@236 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* split urlescape() from _urlclean() and ensure that all ↵Arthur de Jong2006-03-261-2/+2
| | | | | | anchors are consistently URL-encoded git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@235 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* revert catching Exception instead of IOError that was ↵Arthur de Jong2006-03-111-1/+1
| | | | | | there for testing git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@231 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* implement checking of anchors (there should be no double ↵Arthur de Jong2006-03-101-4/+20
| | | | | | anchors and all referenced anchors should exist) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@230 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* trim spaces from title and author fields and check that ↵Arthur de Jong2006-03-101-2/+2
| | | | | | title is not empty string (apart from undefined) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@228 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* make sure all URLs are consistently URL-encoded where it ↵Arthur de Jong2006-01-291-5/+1
| | | | | | counts git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@221 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix typo (thanks Andrew Kim <Andrew.Kim@revolution.com>)Arthur de Jong2006-01-261-2/+2
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@218 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* quote links so that they do not contain any non-ASCII ↵Arthur de Jong2006-01-191-7/+13
| | | | | | characters to avoid problems later on (and add some more debugging) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@214 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* bug fix to handle numeric character references better ↵Arthur de Jong2005-12-261-2/+2
| | | | | | (Unicode characters) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@191 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add copyright clarification to specify that generated ↵Arthur de Jong2005-12-173-0/+9
| | | | | | output files are not covered by our copyright git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@186 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* store author and title in Unicode internally and ensure ↵Arthur de Jong2005-09-171-2/+23
| | | | | | that they are output as UTF-8 git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@179 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* also try to get character encoding from XML declaration ↵Arthur de Jong2005-09-171-0/+22
| | | | | | and http-equiv meta tag git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@178 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* parse character entries as normal data, these entities ↵Arthur de Jong2005-09-171-0/+10
| | | | | | will be expanded later on (they are also used in attribute values git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@176 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* also feed style tag content to the CSS parser to parse ↵Arthur de Jong2005-08-201-0/+7
| | | | | | inline CSS git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@148 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* remove some debugging functions from CSS parserArthur de Jong2005-08-201-3/+0
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@147 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* first attempt at a very simple CSS parser that just ↵Arthur de Jong2005-08-201-1/+28
| | | | | | summarises links to images and imported CSS files git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@146 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add checking of unescaped spaces to the html parser, ↵Arthur de Jong2005-08-201-25/+41
| | | | | | including line and column information git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@144 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* split problems into page problems (parsing errors, wrong ↵Arthur de Jong2005-08-191-1/+1
| | | | | | links, etc) and link problems (errors retreiving the document) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@138 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* also pass mimetypes to scheme modules to only fetch ↵Arthur de Jong2005-08-121-6/+18
| | | | | | content if we can parse the content type git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@128 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* put compiled regular expression on module level so that ↵Arthur de Jong2005-08-121-2/+4
| | | | | | it is compiled only once git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@125 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* make parsing handle errors a little more gracefully, ↵Arthur de Jong2005-08-011-3/+6
| | | | | | thanks to Stefan Schröder <stefan@tokonoma.de> for all the testing git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@122 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* also catch AttributeError for problem in HTMLParser not ↵Arthur de Jong2005-07-311-1/+1
| | | | | | fully supporting continuing after errors git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@119 86f53f14-5ff3-0310-afe5-9b438ce3f40c