Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
path: root/crawler.py
Commit message (Collapse)AuthorAgeFilesLines
* bug fix in matching URL-encodingArthur de Jong2006-01-291-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@224 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* actually decode URL-encoded character as hex not decimalArthur de Jong2006-01-291-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@223 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* make sure all URLs are consistently URL-encoded where it ↵Arthur de Jong2006-01-291-0/+14
| | | | | | counts git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@221 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix debug message to print url instead of object referenceArthur de Jong2006-01-191-2/+2
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@213 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* give some more debugging info while following base URLs ↵Arthur de Jong2006-01-151-11/+11
| | | | | | and no longer delete unreferenced followed links git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@212 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix copy-pasto from r204Arthur de Jong2005-12-301-3/+0
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@209 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* trim empty ports (http://host:/) from URLs and do not ↵Arthur de Jong2005-12-291-1/+5
| | | | | | crash on improperly formatted URLs git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@206 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add --internal option to match internal URLs with a ↵Arthur de Jong2005-12-281-0/+14
| | | | | | regular expression git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@204 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add copyright clarification to specify that generated ↵Arthur de Jong2005-12-171-0/+3
| | | | | | output files are not covered by our copyright git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@186 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix wrapping of text in pydocArthur de Jong2005-12-171-4/+8
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@184 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* store author and title in Unicode internally and ensure ↵Arthur de Jong2005-09-171-2/+2
| | | | | | that they are output as UTF-8 git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@179 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* try to extract character encoding from http response and ↵Arthur de Jong2005-09-171-0/+2
| | | | | | store it in the link object git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@175 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add note about making instances of Link classArthur de Jong2005-08-251-0/+3
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@155 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* set status to result of fetching the document (not an ↵Arthur de Jong2005-08-201-1/+3
| | | | | | error indicator) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@145 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix bug with following redirects where otherwise ↵Arthur de Jong2005-08-191-4/+7
| | | | | | unreferenced links were removed and implement redirect loop detection git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@142 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* move redirect handling code to crawler module, including ↵Arthur de Jong2005-08-191-5/+28
| | | | | | redirect loop detection code git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@141 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* split problems into page problems (parsing errors, wrong ↵Arthur de Jong2005-08-191-6/+15
| | | | | | links, etc) and link problems (errors retreiving the document) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@138 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* also pass mimetypes to scheme modules to only fetch ↵Arthur de Jong2005-08-121-3/+3
| | | | | | content if we can parse the content type git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@128 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add checkurl method to clean up URLs and report problems ↵Arthur de Jong2005-08-121-2/+14
| | | | | | (currently only checks for spaces in URLs) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@126 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* while cleaning URLs also make host part lower-case and ↵Arthur de Jong2005-07-311-3/+11
| | | | | | also clean added internal URLs git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@114 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix a thinkoArthur de Jong2005-07-301-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@113 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix typoArthur de Jong2005-07-301-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@112 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* follow_link() now returns None when trying to follow a ↵Arthur de Jong2005-07-301-7/+18
| | | | | | redirect who's target is not crawled, also don't add children and embeds when we are an external link git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@111 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* give second search through website a slightly different ↵Arthur de Jong2005-07-301-1/+1
| | | | | | debug message git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@107 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* also ignore io errors when retrieving robots.txt filesArthur de Jong2005-07-301-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@106 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* make a _urlclean() function to always store a proper URL ↵Arthur de Jong2005-07-301-2/+12
| | | | | | without a fragment and with at least a slash for URLs with path elements git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@105 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* import time as we need it for sleepArthur de Jong2005-07-291-0/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@103 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* do an extra breadth first traversal of the site to ↵Arthur de Jong2005-07-291-5/+61
| | | | | | combine links into pages, combining page children and determining depth of every page and using all this in the sitemap git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@102 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* remove references to email addresses where they are not ↵Arthur de Jong2005-07-291-3/+3
| | | | | | useful, based on a partial patch by Evelyn Mitchell <efm@tummy.com> git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@99 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* turn tocheck list into fifo queueArthur de Jong2005-07-271-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@97 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* only add links to crawl list if they are not in there ↵Arthur de Jong2005-07-241-2/+2
| | | | | | allready git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@79 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix regular expression matchingArthur de Jong2005-07-231-2/+3
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@77 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* Mike Meyer -> Mike W. MeyerArthur de Jong2005-07-231-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@72 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add support for sleep between requestsArthur de Jong2005-07-221-0/+4
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@71 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* almost complete rewrite of crawling and site state code ↵Arthur de Jong2005-07-221-0/+330
making children and parents link objects instead of URLs and giving link member variables better names, change plugins accordingly, make scheme handling more pluggable and only use one function call and have a better pluggable structure for content parsing (currently only html) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@66 86f53f14-5ff3-0310-afe5-9b438ce3f40c