Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
path: root/crawler.py
Commit message (Expand)AuthorAgeFilesLines
* include list of bases in Site classArthur de Jong2006-10-231-10/+13
* add set_encoding method to Link object to do some basic e...Arthur de Jong2006-07-131-0/+11
* store internal, external and yanked regular expressions i...Arthur de Jong2006-06-241-9/+9
* split crawler.crawl() function into crawler.crawl() and c...Arthur de Jong2006-05-161-5/+7
* also serialize remaining links after crawlArthur de Jong2006-05-161-0/+8
* remove anchor debugging statementsArthur de Jong2006-05-161-2/+0
* fix some stupid typosArthur de Jong2006-05-151-3/+3
* add code to serialize links to a file while crawling the ...Arthur de Jong2006-05-151-2/+16
* add _ischanged attribute to link objects to indicate chan...Arthur de Jong2006-05-151-0/+10
* fix typo in docstring and add commentArthur de Jong2006-05-071-1/+2
* some more small code improvements thanks to pycheckerArthur de Jong2006-05-071-1/+3
* also add all unfetched links from a site to make this met...Arthur de Jong2006-04-271-0/+5
* make get_link() function a public class functionArthur de Jong2006-04-271-5/+5
* move URL checking bit to right function and improve ancho...Arthur de Jong2006-04-271-5/+5
* support passing a URL to add_reqanchor() plus some minor ...Arthur de Jong2006-04-271-3/+7
* code improvements thanks to pylintArthur de Jong2006-04-231-80/+97
* split urlescape() from _urlclean() and ensure that all an...Arthur de Jong2006-03-261-4/+12
* implement checking of anchors (there should be no double ...Arthur de Jong2006-03-101-3/+38
* bug fix in matching URL-encodingArthur de Jong2006-01-291-1/+1
* actually decode URL-encoded character as hex not decimalArthur de Jong2006-01-291-1/+1
* make sure all URLs are consistently URL-encoded where it ...Arthur de Jong2006-01-291-0/+14
* fix debug message to print url instead of object referenceArthur de Jong2006-01-191-2/+2
* give some more debugging info while following base URLs a...Arthur de Jong2006-01-151-11/+11
* fix copy-pasto from r204Arthur de Jong2005-12-301-3/+0
* trim empty ports (http://host:/) from URLs and do not cra...Arthur de Jong2005-12-291-1/+5
* add --internal option to match internal URLs with a regul...Arthur de Jong2005-12-281-0/+14
* add copyright clarification to specify that generated out...Arthur de Jong2005-12-171-0/+3
* fix wrapping of text in pydocArthur de Jong2005-12-171-4/+8
* store author and title in Unicode internally and ensure t...Arthur de Jong2005-09-171-2/+2
* try to extract character encoding from http response and ...Arthur de Jong2005-09-171-0/+2
* add note about making instances of Link classArthur de Jong2005-08-251-0/+3
* set status to result of fetching the document (not an err...Arthur de Jong2005-08-201-1/+3
* fix bug with following redirects where otherwise unrefere...Arthur de Jong2005-08-191-4/+7
* move redirect handling code to crawler module, including ...Arthur de Jong2005-08-191-5/+28
* split problems into page problems (parsing errors, wrong ...Arthur de Jong2005-08-191-6/+15
* also pass mimetypes to scheme modules to only fetch conte...Arthur de Jong2005-08-121-3/+3
* add checkurl method to clean up URLs and report problems ...Arthur de Jong2005-08-121-2/+14
* while cleaning URLs also make host part lower-case and al...Arthur de Jong2005-07-311-3/+11
* fix a thinkoArthur de Jong2005-07-301-1/+1
* fix typoArthur de Jong2005-07-301-1/+1
* follow_link() now returns None when trying to follow a re...Arthur de Jong2005-07-301-7/+18
* give second search through website a slightly different d...Arthur de Jong2005-07-301-1/+1
* also ignore io errors when retrieving robots.txt filesArthur de Jong2005-07-301-1/+1
* make a _urlclean() function to always store a proper URL ...Arthur de Jong2005-07-301-2/+12
* import time as we need it for sleepArthur de Jong2005-07-291-0/+1
* do an extra breadth first traversal of the site to combin...Arthur de Jong2005-07-291-5/+61
* remove references to email addresses where they are not u...Arthur de Jong2005-07-291-3/+3
* turn tocheck list into fifo queueArthur de Jong2005-07-271-1/+1
* only add links to crawl list if they are not in there all...Arthur de Jong2005-07-241-2/+2
* fix regular expression matchingArthur de Jong2005-07-231-2/+3