Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
path: root/crawler.py
Commit message (Expand)AuthorAgeFilesLines
* add workaround for Python 2.3 (based on a patch by Claire...Arthur de Jong2007-10-091-0/+6
* output which parser module is used in debug modeArthur de Jong2007-07-151-0/+1
* just ignore setting encoding to NoneArthur de Jong2007-07-151-1/+1
* fix printing of None encodingArthur de Jong2007-07-141-1/+1
* use sets instead of sequences for children, embedded, etc...Arthur de Jong2007-07-131-49/+39
* split out URL cleaning code into own moduleArthur de Jong2007-07-071-48/+3
* improve deserialization and handling of Unicode stringsArthur de Jong2007-07-061-2/+1
* also lower-case reqanchorArthur de Jong2007-05-121-0/+2
* fix some copyright datesArthur de Jong2007-05-121-1/+1
* lower-case anchor and errors to include id as optionArthur de Jong2007-04-241-1/+3
* mark encoding problems and output more debuggingArthur de Jong2007-04-201-2/+2
* add some comments to the follow_link() methodArthur de Jong2007-04-061-0/+4
* make parsing of URLs and conversion to Link objects a lit...Arthur de Jong2007-04-061-9/+28
* get rid of old base (singular) as bases is now used every...Arthur de Jong2007-03-141-3/+0
* include list of bases in Site classArthur de Jong2006-10-231-10/+13
* add set_encoding method to Link object to do some basic e...Arthur de Jong2006-07-131-0/+11
* store internal, external and yanked regular expressions i...Arthur de Jong2006-06-241-9/+9
* split crawler.crawl() function into crawler.crawl() and c...Arthur de Jong2006-05-161-5/+7
* also serialize remaining links after crawlArthur de Jong2006-05-161-0/+8
* remove anchor debugging statementsArthur de Jong2006-05-161-2/+0
* fix some stupid typosArthur de Jong2006-05-151-3/+3
* add code to serialize links to a file while crawling the ...Arthur de Jong2006-05-151-2/+16
* add _ischanged attribute to link objects to indicate chan...Arthur de Jong2006-05-151-0/+10
* fix typo in docstring and add commentArthur de Jong2006-05-071-1/+2
* some more small code improvements thanks to pycheckerArthur de Jong2006-05-071-1/+3
* also add all unfetched links from a site to make this met...Arthur de Jong2006-04-271-0/+5
* make get_link() function a public class functionArthur de Jong2006-04-271-5/+5
* move URL checking bit to right function and improve ancho...Arthur de Jong2006-04-271-5/+5
* support passing a URL to add_reqanchor() plus some minor ...Arthur de Jong2006-04-271-3/+7
* code improvements thanks to pylintArthur de Jong2006-04-231-80/+97
* split urlescape() from _urlclean() and ensure that all an...Arthur de Jong2006-03-261-4/+12
* implement checking of anchors (there should be no double ...Arthur de Jong2006-03-101-3/+38
* bug fix in matching URL-encodingArthur de Jong2006-01-291-1/+1
* actually decode URL-encoded character as hex not decimalArthur de Jong2006-01-291-1/+1
* make sure all URLs are consistently URL-encoded where it ...Arthur de Jong2006-01-291-0/+14
* fix debug message to print url instead of object referenceArthur de Jong2006-01-191-2/+2
* give some more debugging info while following base URLs a...Arthur de Jong2006-01-151-11/+11
* fix copy-pasto from r204Arthur de Jong2005-12-301-3/+0
* trim empty ports (http://host:/) from URLs and do not cra...Arthur de Jong2005-12-291-1/+5
* add --internal option to match internal URLs with a regul...Arthur de Jong2005-12-281-0/+14
* add copyright clarification to specify that generated out...Arthur de Jong2005-12-171-0/+3
* fix wrapping of text in pydocArthur de Jong2005-12-171-4/+8
* store author and title in Unicode internally and ensure t...Arthur de Jong2005-09-171-2/+2
* try to extract character encoding from http response and ...Arthur de Jong2005-09-171-0/+2
* add note about making instances of Link classArthur de Jong2005-08-251-0/+3
* set status to result of fetching the document (not an err...Arthur de Jong2005-08-201-1/+3
* fix bug with following redirects where otherwise unrefere...Arthur de Jong2005-08-191-4/+7
* move redirect handling code to crawler module, including ...Arthur de Jong2005-08-191-5/+28
* split problems into page problems (parsing errors, wrong ...Arthur de Jong2005-08-191-6/+15
* also pass mimetypes to scheme modules to only fetch conte...Arthur de Jong2005-08-121-3/+3