Arthur de Jong
Open Source / Free Software developer
index
:
webcheck
master
A website link and structure checker
summary
refs
log
tree
commit
diff
stats
log msg
author
committer
range
path:
root
/
crawler.py
Commit message (
Expand
)
Author
Age
Files
Lines
*
add docstring
Arthur de Jong
2008-07-19
1
-0
/
+2
*
update copyright year
Arthur de Jong
2008-06-15
1
-1
/
+1
*
also log exception information to the output
Arthur de Jong
2008-05-25
1
-0
/
+3
*
catch exceptions from parsing module
Arthur de Jong
2008-05-24
1
-1
/
+4
*
avoid reporting a problem more than once
Arthur de Jong
2007-12-14
1
-2
/
+4
*
add workaround for Python 2.3 (based on a patch by Claire...
Arthur de Jong
2007-10-09
1
-0
/
+6
*
output which parser module is used in debug mode
Arthur de Jong
2007-07-15
1
-0
/
+1
*
just ignore setting encoding to None
Arthur de Jong
2007-07-15
1
-1
/
+1
*
fix printing of None encoding
Arthur de Jong
2007-07-14
1
-1
/
+1
*
use sets instead of sequences for children, embedded, etc...
Arthur de Jong
2007-07-13
1
-49
/
+39
*
split out URL cleaning code into own module
Arthur de Jong
2007-07-07
1
-48
/
+3
*
improve deserialization and handling of Unicode strings
Arthur de Jong
2007-07-06
1
-2
/
+1
*
also lower-case reqanchor
Arthur de Jong
2007-05-12
1
-0
/
+2
*
fix some copyright dates
Arthur de Jong
2007-05-12
1
-1
/
+1
*
lower-case anchor and errors to include id as option
Arthur de Jong
2007-04-24
1
-1
/
+3
*
mark encoding problems and output more debugging
Arthur de Jong
2007-04-20
1
-2
/
+2
*
add some comments to the follow_link() method
Arthur de Jong
2007-04-06
1
-0
/
+4
*
make parsing of URLs and conversion to Link objects a lit...
Arthur de Jong
2007-04-06
1
-9
/
+28
*
get rid of old base (singular) as bases is now used every...
Arthur de Jong
2007-03-14
1
-3
/
+0
*
include list of bases in Site class
Arthur de Jong
2006-10-23
1
-10
/
+13
*
add set_encoding method to Link object to do some basic e...
Arthur de Jong
2006-07-13
1
-0
/
+11
*
store internal, external and yanked regular expressions i...
Arthur de Jong
2006-06-24
1
-9
/
+9
*
split crawler.crawl() function into crawler.crawl() and c...
Arthur de Jong
2006-05-16
1
-5
/
+7
*
also serialize remaining links after crawl
Arthur de Jong
2006-05-16
1
-0
/
+8
*
remove anchor debugging statements
Arthur de Jong
2006-05-16
1
-2
/
+0
*
fix some stupid typos
Arthur de Jong
2006-05-15
1
-3
/
+3
*
add code to serialize links to a file while crawling the ...
Arthur de Jong
2006-05-15
1
-2
/
+16
*
add _ischanged attribute to link objects to indicate chan...
Arthur de Jong
2006-05-15
1
-0
/
+10
*
fix typo in docstring and add comment
Arthur de Jong
2006-05-07
1
-1
/
+2
*
some more small code improvements thanks to pychecker
Arthur de Jong
2006-05-07
1
-1
/
+3
*
also add all unfetched links from a site to make this met...
Arthur de Jong
2006-04-27
1
-0
/
+5
*
make get_link() function a public class function
Arthur de Jong
2006-04-27
1
-5
/
+5
*
move URL checking bit to right function and improve ancho...
Arthur de Jong
2006-04-27
1
-5
/
+5
*
support passing a URL to add_reqanchor() plus some minor ...
Arthur de Jong
2006-04-27
1
-3
/
+7
*
code improvements thanks to pylint
Arthur de Jong
2006-04-23
1
-80
/
+97
*
split urlescape() from _urlclean() and ensure that all an...
Arthur de Jong
2006-03-26
1
-4
/
+12
*
implement checking of anchors (there should be no double ...
Arthur de Jong
2006-03-10
1
-3
/
+38
*
bug fix in matching URL-encoding
Arthur de Jong
2006-01-29
1
-1
/
+1
*
actually decode URL-encoded character as hex not decimal
Arthur de Jong
2006-01-29
1
-1
/
+1
*
make sure all URLs are consistently URL-encoded where it ...
Arthur de Jong
2006-01-29
1
-0
/
+14
*
fix debug message to print url instead of object reference
Arthur de Jong
2006-01-19
1
-2
/
+2
*
give some more debugging info while following base URLs a...
Arthur de Jong
2006-01-15
1
-11
/
+11
*
fix copy-pasto from r204
Arthur de Jong
2005-12-30
1
-3
/
+0
*
trim empty ports (http://host:/) from URLs and do not cra...
Arthur de Jong
2005-12-29
1
-1
/
+5
*
add --internal option to match internal URLs with a regul...
Arthur de Jong
2005-12-28
1
-0
/
+14
*
add copyright clarification to specify that generated out...
Arthur de Jong
2005-12-17
1
-0
/
+3
*
fix wrapping of text in pydoc
Arthur de Jong
2005-12-17
1
-4
/
+8
*
store author and title in Unicode internally and ensure t...
Arthur de Jong
2005-09-17
1
-2
/
+2
*
try to extract character encoding from http response and ...
Arthur de Jong
2005-09-17
1
-0
/
+2
*
add note about making instances of Link class
Arthur de Jong
2005-08-25
1
-0
/
+3
[next]