Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
path: root/crawler.py
Commit message (Expand)AuthorAgeFilesLines
* bug fix in matching URL-encodingArthur de Jong2006-01-291-1/+1
* actually decode URL-encoded character as hex not decimalArthur de Jong2006-01-291-1/+1
* make sure all URLs are consistently URL-encoded where it ...Arthur de Jong2006-01-291-0/+14
* fix debug message to print url instead of object referenceArthur de Jong2006-01-191-2/+2
* give some more debugging info while following base URLs a...Arthur de Jong2006-01-151-11/+11
* fix copy-pasto from r204Arthur de Jong2005-12-301-3/+0
* trim empty ports (http://host:/) from URLs and do not cra...Arthur de Jong2005-12-291-1/+5
* add --internal option to match internal URLs with a regul...Arthur de Jong2005-12-281-0/+14
* add copyright clarification to specify that generated out...Arthur de Jong2005-12-171-0/+3
* fix wrapping of text in pydocArthur de Jong2005-12-171-4/+8
* store author and title in Unicode internally and ensure t...Arthur de Jong2005-09-171-2/+2
* try to extract character encoding from http response and ...Arthur de Jong2005-09-171-0/+2
* add note about making instances of Link classArthur de Jong2005-08-251-0/+3
* set status to result of fetching the document (not an err...Arthur de Jong2005-08-201-1/+3
* fix bug with following redirects where otherwise unrefere...Arthur de Jong2005-08-191-4/+7
* move redirect handling code to crawler module, including ...Arthur de Jong2005-08-191-5/+28
* split problems into page problems (parsing errors, wrong ...Arthur de Jong2005-08-191-6/+15
* also pass mimetypes to scheme modules to only fetch conte...Arthur de Jong2005-08-121-3/+3
* add checkurl method to clean up URLs and report problems ...Arthur de Jong2005-08-121-2/+14
* while cleaning URLs also make host part lower-case and al...Arthur de Jong2005-07-311-3/+11
* fix a thinkoArthur de Jong2005-07-301-1/+1
* fix typoArthur de Jong2005-07-301-1/+1
* follow_link() now returns None when trying to follow a re...Arthur de Jong2005-07-301-7/+18
* give second search through website a slightly different d...Arthur de Jong2005-07-301-1/+1
* also ignore io errors when retrieving robots.txt filesArthur de Jong2005-07-301-1/+1
* make a _urlclean() function to always store a proper URL ...Arthur de Jong2005-07-301-2/+12
* import time as we need it for sleepArthur de Jong2005-07-291-0/+1
* do an extra breadth first traversal of the site to combin...Arthur de Jong2005-07-291-5/+61
* remove references to email addresses where they are not u...Arthur de Jong2005-07-291-3/+3
* turn tocheck list into fifo queueArthur de Jong2005-07-271-1/+1
* only add links to crawl list if they are not in there all...Arthur de Jong2005-07-241-2/+2
* fix regular expression matchingArthur de Jong2005-07-231-2/+3
* Mike Meyer -> Mike W. MeyerArthur de Jong2005-07-231-1/+1
* add support for sleep between requestsArthur de Jong2005-07-221-0/+4
* almost complete rewrite of crawling and site state code m...Arthur de Jong2005-07-221-0/+330