Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
path: root/crawler.py
Commit message (Expand)AuthorAgeFilesLines
* add note about making instances of Link classArthur de Jong2005-08-251-0/+3
* set status to result of fetching the document (not an err...Arthur de Jong2005-08-201-1/+3
* fix bug with following redirects where otherwise unrefere...Arthur de Jong2005-08-191-4/+7
* move redirect handling code to crawler module, including ...Arthur de Jong2005-08-191-5/+28
* split problems into page problems (parsing errors, wrong ...Arthur de Jong2005-08-191-6/+15
* also pass mimetypes to scheme modules to only fetch conte...Arthur de Jong2005-08-121-3/+3
* add checkurl method to clean up URLs and report problems ...Arthur de Jong2005-08-121-2/+14
* while cleaning URLs also make host part lower-case and al...Arthur de Jong2005-07-311-3/+11
* fix a thinkoArthur de Jong2005-07-301-1/+1
* fix typoArthur de Jong2005-07-301-1/+1
* follow_link() now returns None when trying to follow a re...Arthur de Jong2005-07-301-7/+18
* give second search through website a slightly different d...Arthur de Jong2005-07-301-1/+1
* also ignore io errors when retrieving robots.txt filesArthur de Jong2005-07-301-1/+1
* make a _urlclean() function to always store a proper URL ...Arthur de Jong2005-07-301-2/+12
* import time as we need it for sleepArthur de Jong2005-07-291-0/+1
* do an extra breadth first traversal of the site to combin...Arthur de Jong2005-07-291-5/+61
* remove references to email addresses where they are not u...Arthur de Jong2005-07-291-3/+3
* turn tocheck list into fifo queueArthur de Jong2005-07-271-1/+1
* only add links to crawl list if they are not in there all...Arthur de Jong2005-07-241-2/+2
* fix regular expression matchingArthur de Jong2005-07-231-2/+3
* Mike Meyer -> Mike W. MeyerArthur de Jong2005-07-231-1/+1
* add support for sleep between requestsArthur de Jong2005-07-221-0/+4
* almost complete rewrite of crawling and site state code m...Arthur de Jong2005-07-221-0/+330