Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
...
* add checking of unescaped spaces to the html parser, ↵Arthur de Jong2005-08-201-25/+41
| | | | | | including line and column information git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@144 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* pass site as parameter to parse_args() instead of ↵Arthur de Jong2005-08-191-4/+2
| | | | | | declaring it global git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@143 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix bug with following redirects where otherwise ↵Arthur de Jong2005-08-191-4/+7
| | | | | | unreferenced links were removed and implement redirect loop detection git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@142 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* move redirect handling code to crawler module, including ↵Arthur de Jong2005-08-194-24/+32
| | | | | | redirect loop detection code git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@141 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix html bug and improve bad link stringArthur de Jong2005-08-191-2/+3
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@140 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* change html display of problems to a nicer listArthur de Jong2005-08-196-8/+19
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@139 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* split problems into page problems (parsing errors, wrong ↵Arthur de Jong2005-08-1911-47/+65
| | | | | | links, etc) and link problems (errors retreiving the document) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@138 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* get files ready for 1.9.3 release1.9.3Arthur de Jong2005-08-165-18/+114
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@136 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* pick up configured filenames if present in directoriesArthur de Jong2005-08-163-46/+76
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@135 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add extra debugging infoArthur de Jong2005-08-161-8/+15
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@134 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* use a pool of ftp connections to keep ftp connection to ↵Arthur de Jong2005-08-131-18/+25
| | | | | | a host open to do multiple requests (this greatly speeds up crawling of ftp sites) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@133 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* almost complete reimplementation of the ftp scheme, ↵Arthur de Jong2005-08-131-62/+64
| | | | | | handling errors more gracefully and also crawl normal ftp directories git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@132 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add missing newline and trim trailing newline of extra ↵Arthur de Jong2005-08-131-2/+3
| | | | | | link info git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@131 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* complete reimplementation of file module, reading ↵Arthur de Jong2005-08-121-21/+49
| | | | | | index.html from directory, otherwise read directory contents git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@130 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* rename parameter to acceptedtypes to not conflict with ↵Arthur de Jong2005-08-124-6/+6
| | | | | | mimetypes module git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@129 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* also pass mimetypes to scheme modules to only fetch ↵Arthur de Jong2005-08-126-15/+29
| | | | | | content if we can parse the content type git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@128 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* don't print referenced from if there are no parentsArthur de Jong2005-08-121-0/+3
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@127 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add checkurl method to clean up URLs and report problems ↵Arthur de Jong2005-08-121-2/+14
| | | | | | (currently only checks for spaces in URLs) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@126 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* put compiled regular expression on module level so that ↵Arthur de Jong2005-08-121-2/+4
| | | | | | it is compiled only once git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@125 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* small fix to render menu better under MSIEArthur de Jong2005-08-121-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@124 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add some extra information to every link with a nicely ↵Arthur de Jong2005-08-111-2/+60
| | | | | | formatted size git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@123 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* make parsing handle errors a little more gracefully, ↵Arthur de Jong2005-08-011-3/+6
| | | | | | thanks to Stefan Schröder <stefan@tokonoma.de> for all the testing git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@122 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* get files ready for 1.9.2 release1.9.2Arthur de Jong2005-07-314-248/+391
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@120 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* also catch AttributeError for problem in HTMLParser not ↵Arthur de Jong2005-07-311-1/+1
| | | | | | fully supporting continuing after errors git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@119 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add note about supported versions of pythonArthur de Jong2005-07-311-0/+4
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@118 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* replace numeric entity refs with their proper values ↵Arthur de Jong2005-07-311-2/+11
| | | | | | based on patch by Eric W.Brown <eric@saugus.net> git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@117 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* put new html parser in placeArthur de Jong2005-07-311-88/+113
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@116 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add https module as a wrapper to the http moduleArthur de Jong2005-07-311-0/+26
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@115 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* while cleaning URLs also make host part lower-case and ↵Arthur de Jong2005-07-311-3/+11
| | | | | | also clean added internal URLs git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@114 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix a thinkoArthur de Jong2005-07-301-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@113 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix typoArthur de Jong2005-07-301-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@112 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* follow_link() now returns None when trying to follow a ↵Arthur de Jong2005-07-301-7/+18
| | | | | | redirect who's target is not crawled, also don't add children and embeds when we are an external link git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@111 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* remove version and author from module as no other module ↵Arthur de Jong2005-07-301-3/+0
| | | | | | has one (except the plugins themselves) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@110 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* remove support for extra configurable headersArthur de Jong2005-07-301-4/+0
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@109 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* reimplement http module to be a little more generic and ↵Arthur de Jong2005-07-301-97/+91
| | | | | | clean and handle errors cleaner and more consistently git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@108 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* give second search through website a slightly different ↵Arthur de Jong2005-07-301-1/+1
| | | | | | debug message git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@107 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* also ignore io errors when retrieving robots.txt filesArthur de Jong2005-07-301-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@106 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* make a _urlclean() function to always store a proper URL ↵Arthur de Jong2005-07-301-2/+12
| | | | | | without a fragment and with at least a slash for URLs with path elements git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@105 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* some minor tweaks in the documentationArthur de Jong2005-07-301-6/+9
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@104 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* import time as we need it for sleepArthur de Jong2005-07-291-0/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@103 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* do an extra breadth first traversal of the site to ↵Arthur de Jong2005-07-292-29/+79
| | | | | | combine links into pages, combining page children and determining depth of every page and using all this in the sitemap git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@102 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* change email address from ↵Arthur de Jong2005-07-294-11/+9
| | | | | | arthur@tiefighter.et.tudelft.nl to arthur@ch.tudelft.nl (including URLs etc) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@101 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* remove another reference of an email addressArthur de Jong2005-07-291-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@100 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* remove references to email addresses where they are not ↵Arthur de Jong2005-07-2926-73/+73
| | | | | | useful, based on a partial patch by Evelyn Mitchell <efm@tummy.com> git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@99 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix a couple of typos, also thanks to Scott Kirkwood ↵Arthur de Jong2005-07-274-6/+7
| | | | | | <scottakirkwood@gmail.com> for spotting another one git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@98 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* turn tocheck list into fifo queueArthur de Jong2005-07-271-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@97 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix typo spotted by Scott Kirkwood ↵Arthur de Jong2005-07-262-2/+2
| | | | | | <scottakirkwood@gmail.com> git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@96 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* get files ready for 1.9.1 release1.9.1Arthur de Jong2005-07-253-2/+35
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@94 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix typo, thanks to Stefan Schröder <stefan@tokonoma.de>Arthur de Jong2005-07-251-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@93 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* only report on internal linksArthur de Jong2005-07-251-0/+2
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@92 86f53f14-5ff3-0310-afe5-9b438ce3f40c