Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
path: root/parsers
Commit message (Collapse)AuthorAgeFilesLines
* make sure all URLs are consistently URL-encoded where it ↵Arthur de Jong2006-01-291-5/+1
| | | | | | counts git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@221 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix typo (thanks Andrew Kim <Andrew.Kim@revolution.com>)Arthur de Jong2006-01-261-2/+2
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@218 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* quote links so that they do not contain any non-ASCII ↵Arthur de Jong2006-01-191-7/+13
| | | | | | characters to avoid problems later on (and add some more debugging) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@214 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* bug fix to handle numeric character references better ↵Arthur de Jong2005-12-261-2/+2
| | | | | | (Unicode characters) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@191 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add copyright clarification to specify that generated ↵Arthur de Jong2005-12-173-0/+9
| | | | | | output files are not covered by our copyright git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@186 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* store author and title in Unicode internally and ensure ↵Arthur de Jong2005-09-171-2/+23
| | | | | | that they are output as UTF-8 git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@179 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* also try to get character encoding from XML declaration ↵Arthur de Jong2005-09-171-0/+22
| | | | | | and http-equiv meta tag git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@178 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* parse character entries as normal data, these entities ↵Arthur de Jong2005-09-171-0/+10
| | | | | | will be expanded later on (they are also used in attribute values git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@176 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* also feed style tag content to the CSS parser to parse ↵Arthur de Jong2005-08-201-0/+7
| | | | | | inline CSS git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@148 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* remove some debugging functions from CSS parserArthur de Jong2005-08-201-3/+0
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@147 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* first attempt at a very simple CSS parser that just ↵Arthur de Jong2005-08-201-1/+28
| | | | | | summarises links to images and imported CSS files git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@146 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add checking of unescaped spaces to the html parser, ↵Arthur de Jong2005-08-201-25/+41
| | | | | | including line and column information git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@144 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* split problems into page problems (parsing errors, wrong ↵Arthur de Jong2005-08-191-1/+1
| | | | | | links, etc) and link problems (errors retreiving the document) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@138 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* also pass mimetypes to scheme modules to only fetch ↵Arthur de Jong2005-08-121-6/+18
| | | | | | content if we can parse the content type git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@128 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* put compiled regular expression on module level so that ↵Arthur de Jong2005-08-121-2/+4
| | | | | | it is compiled only once git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@125 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* make parsing handle errors a little more gracefully, ↵Arthur de Jong2005-08-011-3/+6
| | | | | | thanks to Stefan Schröder <stefan@tokonoma.de> for all the testing git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@122 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* also catch AttributeError for problem in HTMLParser not ↵Arthur de Jong2005-07-311-1/+1
| | | | | | fully supporting continuing after errors git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@119 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* replace numeric entity refs with their proper values ↵Arthur de Jong2005-07-311-2/+11
| | | | | | based on patch by Eric W.Brown <eric@saugus.net> git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@117 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* put new html parser in placeArthur de Jong2005-07-311-88/+113
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@116 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* remove references to email addresses where they are not ↵Arthur de Jong2005-07-293-5/+5
| | | | | | useful, based on a partial patch by Evelyn Mitchell <efm@tummy.com> git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@99 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* empty module as place holder to parse CSS (referenced ↵Arthur de Jong2005-07-251-0/+20
| | | | | | from __init__.py already) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@91 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* don't replace an already set titleArthur de Jong2005-07-251-1/+2
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@90 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* Mike Meyer -> Mike W. MeyerArthur de Jong2005-07-231-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@72 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* almost complete rewrite of crawling and site state code ↵Arthur de Jong2005-07-222-20/+65
| | | | | | making children and parents link objects instead of URLs and giving link member variables better names, change plugins accordingly, make scheme handling more pluggable and only use one function call and have a better pluggable structure for content parsing (currently only html) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@66 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* move htmlparse to a more generic parsers package, ↵Arthur de Jong2005-07-092-0/+128
cleaning up the code and simplifying dependencies git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@58 86f53f14-5ff3-0310-afe5-9b438ce3f40c