Arthur de Jong
Open Source / Free Software developer
index
:
webcheck
master
A website link and structure checker
summary
refs
log
tree
commit
diff
stats
log msg
author
committer
range
path:
root
/
crawler.py
Commit message (
Expand
)
Author
Age
Files
Lines
*
bug fix in matching URL-encoding
Arthur de Jong
2006-01-29
1
-1
/
+1
*
actually decode URL-encoded character as hex not decimal
Arthur de Jong
2006-01-29
1
-1
/
+1
*
make sure all URLs are consistently URL-encoded where it ...
Arthur de Jong
2006-01-29
1
-0
/
+14
*
fix debug message to print url instead of object reference
Arthur de Jong
2006-01-19
1
-2
/
+2
*
give some more debugging info while following base URLs a...
Arthur de Jong
2006-01-15
1
-11
/
+11
*
fix copy-pasto from r204
Arthur de Jong
2005-12-30
1
-3
/
+0
*
trim empty ports (http://host:/) from URLs and do not cra...
Arthur de Jong
2005-12-29
1
-1
/
+5
*
add --internal option to match internal URLs with a regul...
Arthur de Jong
2005-12-28
1
-0
/
+14
*
add copyright clarification to specify that generated out...
Arthur de Jong
2005-12-17
1
-0
/
+3
*
fix wrapping of text in pydoc
Arthur de Jong
2005-12-17
1
-4
/
+8
*
store author and title in Unicode internally and ensure t...
Arthur de Jong
2005-09-17
1
-2
/
+2
*
try to extract character encoding from http response and ...
Arthur de Jong
2005-09-17
1
-0
/
+2
*
add note about making instances of Link class
Arthur de Jong
2005-08-25
1
-0
/
+3
*
set status to result of fetching the document (not an err...
Arthur de Jong
2005-08-20
1
-1
/
+3
*
fix bug with following redirects where otherwise unrefere...
Arthur de Jong
2005-08-19
1
-4
/
+7
*
move redirect handling code to crawler module, including ...
Arthur de Jong
2005-08-19
1
-5
/
+28
*
split problems into page problems (parsing errors, wrong ...
Arthur de Jong
2005-08-19
1
-6
/
+15
*
also pass mimetypes to scheme modules to only fetch conte...
Arthur de Jong
2005-08-12
1
-3
/
+3
*
add checkurl method to clean up URLs and report problems ...
Arthur de Jong
2005-08-12
1
-2
/
+14
*
while cleaning URLs also make host part lower-case and al...
Arthur de Jong
2005-07-31
1
-3
/
+11
*
fix a thinko
Arthur de Jong
2005-07-30
1
-1
/
+1
*
fix typo
Arthur de Jong
2005-07-30
1
-1
/
+1
*
follow_link() now returns None when trying to follow a re...
Arthur de Jong
2005-07-30
1
-7
/
+18
*
give second search through website a slightly different d...
Arthur de Jong
2005-07-30
1
-1
/
+1
*
also ignore io errors when retrieving robots.txt files
Arthur de Jong
2005-07-30
1
-1
/
+1
*
make a _urlclean() function to always store a proper URL ...
Arthur de Jong
2005-07-30
1
-2
/
+12
*
import time as we need it for sleep
Arthur de Jong
2005-07-29
1
-0
/
+1
*
do an extra breadth first traversal of the site to combin...
Arthur de Jong
2005-07-29
1
-5
/
+61
*
remove references to email addresses where they are not u...
Arthur de Jong
2005-07-29
1
-3
/
+3
*
turn tocheck list into fifo queue
Arthur de Jong
2005-07-27
1
-1
/
+1
*
only add links to crawl list if they are not in there all...
Arthur de Jong
2005-07-24
1
-2
/
+2
*
fix regular expression matching
Arthur de Jong
2005-07-23
1
-2
/
+3
*
Mike Meyer -> Mike W. Meyer
Arthur de Jong
2005-07-23
1
-1
/
+1
*
add support for sleep between requests
Arthur de Jong
2005-07-22
1
-0
/
+4
*
almost complete rewrite of crawling and site state code m...
Arthur de Jong
2005-07-22
1
-0
/
+330