Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorArthur de Jong <arthur@arthurdejong.org>2006-01-30 17:27:40 +0100
committerArthur de Jong <arthur@arthurdejong.org>2006-01-30 17:27:40 +0100
commit1ca76b07d03be977f894e724bc5153a2d18d956d (patch)
treee19eb8c02afc7cf54beed546afe3f4c72db89865
parent90d6025f1e2b10391a8abd56dc78b3aa76ae0112 (diff)
get files ready for 1.9.6 release1.9.6
git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@225 86f53f14-5ff3-0310-afe5-9b438ce3f40c
-rw-r--r--ChangeLog71
-rw-r--r--NEWS101
-rw-r--r--README2
-rw-r--r--TODO19
-rw-r--r--config.py4
-rw-r--r--debian/changelog18
-rw-r--r--debian/copyright2
-rw-r--r--webcheck.16
-rwxr-xr-xwebcheck.py6
9 files changed, 165 insertions, 64 deletions
diff --git a/ChangeLog b/ChangeLog
index f1d7ce9..851bbfb 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,74 @@
+2006-01-29 22:39 arthur
+
+ * [r224] crawler.py: bugfix in matching url encoding
+
+2006-01-29 21:24 arthur
+
+ * [r223] crawler.py: actually decode urlencoded character as hex
+ not decimal
+
+2006-01-29 20:50 arthur
+
+ * [r222] fancytooltips/fancytooltips.js: html escape all content
+ that is retreived from attributes
+
+2006-01-29 20:48 arthur
+
+ * [r221] crawler.py, parsers/html.py: make sure all urls are
+ consistently url encoded where it counts
+
+2006-01-29 20:15 arthur
+
+ * [r220] schemes/http.py: add some more debugging information
+ (cache hit or miss)
+
+2006-01-29 20:14 arthur
+
+ * [r219] plugins/about.py: update copyright notice and indicate
+ that we're using gpl2+
+
+2006-01-25 23:16 arthur
+
+ * [r218] parsers/html.py: fix typo (thanks Andrew Kim
+ <Andrew.Kim@revolution.com>)
+
+2006-01-19 21:38 arthur
+
+ * [r217] plugins/__init__.py: ignore errors when converting to
+ unicode string and uses system encoding instead of utf-8 as
+ default
+
+2006-01-19 21:35 arthur
+
+ * [r216] plugins/__init__.py: also escape the url when generating
+ links
+
+2006-01-19 20:46 arthur
+
+ * [r215] plugins/__init__.py: explictly convert strings to unicode
+ to avoid potential problems with non-ascii charaters in strings
+
+2006-01-19 20:45 arthur
+
+ * [r214] parsers/html.py: quote links so that they do not contain
+ any non-ascii characters to avoid problems later on (and add
+ some more debugging)
+
+2006-01-19 20:32 arthur
+
+ * [r213] crawler.py: fix debug message to print url instead of
+ object reference
+
+2006-01-15 08:44 arthur
+
+ * [r212] crawler.py: give some more debugging info while following
+ base urls and no longer delete unreferenced followed links
+
+2005-12-30 22:33 arthur
+
+ * [r210] ChangeLog, NEWS, TODO, config.py, debian/changelog,
+ webcheck.1: get files ready for 1.9.5 release
+
2005-12-30 22:09 arthur
* [r209] crawler.py: fix copy-pasto from r204
diff --git a/NEWS b/NEWS
index ec72cd0..a45f688 100644
--- a/NEWS
+++ b/NEWS
@@ -1,3 +1,17 @@
+changes from 1.9.5 to 1.9.6
+---------------------------
+
+* SECURITY FIX: a cross-site scripting vulnerability with content in the
+ tooltips of generated report was fixed by properly escaping
+ all output
+* urls are now url encoded into a consistent form, solving some problems with
+ urls with non-ascii characters
+* no longer remove unreferenced redirects
+* more debugging info in debug mode
+* more fixes for escaping in generated reports and more support for sites in
+ different character sets
+
+
changes from 1.9.4 to 1.9.5
---------------------------
@@ -10,7 +24,8 @@ changes from 1.9.4 to 1.9.5
* documentation improvements
* several bugfixes to get webcheck more robust
* included fancytooltips by Victor Kulinski to have nicer tooltips
-* generated reports now have friendlier messages for when there is nothing to report
+* generated reports now have friendlier messages for when there is nothing to
+ report
* there is a Debian package
@@ -48,7 +63,7 @@ changes from 1.9.2 to 1.9.3
changes from 1.9.1 to 1.9.2
---------------------------
-* complete reimiplementation of the html and http modules
+* complete reimplementation of the html and http modules
* added https support
* some spelling and typo fixes contributed by several people
* site map now does a proper breadth first traversal of the site structure
@@ -92,25 +107,25 @@ b Fixed bug when server redirects to a document in robots.txt (does not show
up as broken (hopefully))
+ Filename mangling in filelink.py to help OS/2 (and Win32) (Patch submitted
by Steffen Siebert)
-+ Added WARN_OLD_VERSION config.py option. If this option is set to true
- (the default) Linbot will check it's version number and the version
- numbers of it's plugins against a global registry on the Net. If it
- finds that a version is not the latest, it will print a warning on the
- reports along with a link you can follow to download the latest version.
- I think it's neat. You might find it annoying.
++ Added WARN_OLD_VERSION config.py option. If this option is set to true (the
+ default) Linbot will check it's version number and the version numbers of
+ it's plugins against a global registry on the Net. If it finds that a
+ version is not the latest, it will print a warning on the reports along with
+ a link you can follow to download the latest version. I think it's neat. You
+ might find it annoying.
+ Added preliminary support for authenticating proxies, though it does not
work correctly yet.
+ Added -r (redirect depth) and REDIRECT_DEPTH option in config.py to indicate
- the amount of redirects Linbot should follow when following a link. Thanks
+ the amount of redirects Linbot should follow when following a link. Thanks
to Andrea Glorioso for the patch.
+ Added debugio module that handles debugging and I/O
-+ Added -q (quiet option). Use it to suppress output
++ Added -q (quiet option). Use it to suppress output
+ Added -d (debug) option and DEBUG_LEVEL variable in config.py for debugging
+ added version module and removed __version__ and __author__ from all the
modules (except plugins).
-b Fixed bug in Linbot using putrequest() instead of putheader() when requesting
- header information. Thanks to Andrea Glorioso for
- fixing this glitch (and Seth Chaiklin for noticing).
+b Fixed bug in Linbot using putrequest() instead of putheader() when
+ requesting header information. Thanks to Andrea Glorioso for fixing this
+ glitch (and Seth Chaiklin for noticing).
changes from 1.0b8 to 1.0b9
@@ -118,27 +133,27 @@ changes from 1.0b8 to 1.0b9
+ If you use the -o command-line option or the OUTPUT_DIR config file option
and the directory does not exist, linbot will create it for you (provided
- that it has the correct permissions, etc.) Thanks to Andrea Glorioso
- for this feature.
-+ Added a CREDITS file and probably left a lot of people out. If you think
- you should be in it let me know.
+ that it has the correct permissions, etc.). Thanks to Andrea Glorioso for
+ this feature.
++ Added a CREDITS file and probably left a lot of people out. If you think you
+ should be in it let me know.
b Linbot will now report to the server that it can accept any MIME type (found
- in mimetypes.py. This should fix the "406: No acceptable objects found"
+ in mimetypes.py. This should fix the "406: No acceptable objects found"
error that some servers report.
-b Linbot correctly identifies itself as "Linbot <version>" on HEAD requests
- as well as GET requests.
+b Linbot correctly identifies itself as "Linbot <version>" on HEAD requests as
+ well as GET requests.
changes from 1.0b6 to 1.0b8
---------------------------
-b Fixed bug when no images are reported for documents having 0 links
- If you don't know what this means it probably wasn't a problem for you.
+b Fixed bug when no images are reported for documents having 0 links If you
+ don't know what this means it probably wasn't a problem for you.
b Fixed code that was messing with arguments passed via -x and -y and caused
unexpected results and/or errors.
b -b flag should work this time (for real)
b Cosmetic changes (reports didn't look the way I thought they should in IE4.
- (and may not still as I havent' had a chance to check it yet)
+ (and may not still as I haven't had a chance to check it yet)
b Linbot won't follow infinite redirects (currently hardcoded to max of 5
redirects per document)
@@ -147,13 +162,13 @@ changes from 1.0b5 to 1.0b6
---------------------------
+ Minor change in ftplink.py should allow better ftp link checking
-+ You can now press CTRL-C (or whatever your operating system supports) to break
- out of a linbot run. However, the work linbot does is not saved (yet).
-b Fixed problem when server redirects a URL to itself. This fix seems to work
- for most servers I've tried but there are a few more out there that I need to
- take a look at.
++ You can now press CTRL-C (or whatever your operating system supports) to
+ break out of a linbot run. However, the work linbot does is not saved (yet).
+b Fixed problem when server redirects a URL to itself. This fix seems to work
+ for most servers I've tried but there are a few more out there that I need
+ to take a look at.
b Fixed bug that caused linbot to not check for yanked URLs
-+ Added -l command-line option. Usage: -l <url> where <url> is a url pointing
++ Added -l command-line option. Usage: -l <url> where <url> is a url pointing
to an image to be used as the report's logo.
b "patched" strings.py so that it can better parse html files created in
Windows/DOS (I think).
@@ -161,27 +176,27 @@ b "patched" strings.py so that it can better parse html files created in
+ httplink does not HEAD a redirected URL if it is already in the link list
(performance improvement)
- Removed LOGO_ALT from config.py
-+ Changed my email address to marduk@python.net. The official home page of
- Linbot will probaby also change with the next release so stay tuned.
++ Changed my email address to marduk@python.net. The official home page of
+ Linbot will probably also change with the next release so stay tuned.
changes from 1.0b4 to 1.0b5
---------------------------
-+ Added a contrib directory. Right now it just contains the about plugin. Other
- plugins will be included if people contribute them. Also, the man page will
- return once I have updated it. Those ugly buttons are obsolete.
-+ Linbot now "inlines" stylesheets. This has the benefits of 1) better support
++ Added a contrib directory. Right now it just contains the about plugin.
+ Other plugins will be included if people contribute them. Also, the man page
+ will return once I have updated it. Those ugly buttons are obsolete.
++ Linbot now "inlines" stylesheets. This has the benefits of 1) better support
of Netscape browsers (so I hear) and 2) I don't have to document to put
linbot.css in the output directory since it grabs it from starship 8*)
-b Handling of error for when robots.txt cannot be retreived.
+b Handling of error for when robots.txt cannot be retrieved.
+ Malformed urls are trapped (sorry, I had that commented out)
-b FTP link handling is totally rewritten. Fortunately it shouldn't crash anymore
- Unfortunately it doesn't really work reliably and probably never will. See
- README.ftp for details.
+b FTP link handling is totally rewritten. Fortunately it shouldn't crash
+ anymore Unfortunately it doesn't really work reliably and probably never
+ will. See README.ftp for details.
b Two bugs in HTTP proxy handling made it almost completely unusable, though
conveniently seemed to cancel each other out when I was testing.
-b Too many files error on large sites should be fixed. Thanks to Andrew Kuchling
- et al for suggestions.
-b Bug when some servers erroneously report (or don't report) Content-Length header
- fixed.
+b Too many files error on large sites should be fixed. Thanks to Andrew
+ Kuchling et al for suggestions.
+b Bug when some servers erroneously report (or don't report) Content-Length
+ header fixed.
diff --git a/README b/README
index 2918b6a..bde64ec 100644
--- a/README
+++ b/README
@@ -11,7 +11,7 @@
Copyright (C) 1998, 1999 Albert Hopkins (marduk)
Copyright (C) 2002 Mike W. Meyer
- Copyright (C) 2005 Arthur de Jong
+ Copyright (C) 2005, 2006 Arthur de Jong
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
diff --git a/TODO b/TODO
index ee55dc2..7f74d7e 100644
--- a/TODO
+++ b/TODO
@@ -1,6 +1,9 @@
before next release
-------------------
-* go over all FIXMEs in code
+* go over all FIXMEs in code (ftp)
+* check that sleep acutally sleeps for the advertised time
+* follow redirects (to a limit) of external sites
+* check that scheme names are clean so that we do not import strange python modules
probably before 2.0 release
---------------------------
@@ -11,12 +14,10 @@ probably before 2.0 release
* figure out if we need parents and pageparents
* make configurable time-out when retrieving a document
* support for mult-threading (use -t, --threads as option)
-* clean up printing of messages, especially needed for multi-threading
* implement a fix for redirecting stdout and stderr to work properly
* implement a maximum transfer size for downloading files and things over http
* support ftp proxies
* support proxying https traffic
-* check that scheme names are clean so that we do not import strange python modules
wishlist
--------
@@ -25,16 +26,13 @@ wishlist
* new config file format (if we want a configfile at all)
* cookies support (maybe)
* integration with weblint
-* maybe combine with a logfile checker to also show number of hits per link
* do form checking of crawled pages
* do spelling checking of crawled pages
* test w3c conformance of pages (already done a little)
-* maybe make broken links not clickable in report (configurable?)
* maybe store crawled site's data in some format for later processing or continuing after interruption
* add support for fetching gzipped content to improve performance
* maybe do http pipelining
-* add a favicon to reports
-* follow redirects of external links
+* add a favicon to report
* make error handling of HTMLParser more robust (maybe send a patch for html parser upstream)
* maybe use this as a html parser: http://www.crummy.com/software/BeautifulSoup/examples.html
* maybe have a way to output google sitemap files: http://www.google.com/webmasters/sitemaps/docs/en/protocol.html
@@ -51,6 +49,9 @@ wishlist
* maybe implement news, nntp, gopher and telnet schemes (if there is anyone that wants them)
* maybe add custom bullets in problem lists, depending on problem type
* maybe make -b the default
-* maybe add copyright notice in generated files (that we don't claim copyright)
-* make a test site or test framework (this is a lot of work)
* prompt for authentication (detecting realms)
+* present age for times long ago in a friendlier format (.. days ago, .. months ago, .. years ago)
+* maybe unescaped spaces aren't always a real problem (e.g. in mailto: urls)
+* give a warning when no encoding is specified, an error if non-ascii characters are used
+* maybe give a warning for urls that have non-ascii characters
+* maybe fetch and store desription and other meta information about page (keywords) (just like author)
diff --git a/config.py b/config.py
index 93e5f50..409788f 100644
--- a/config.py
+++ b/config.py
@@ -3,7 +3,7 @@
#
# Copyright (C) 1998, 1999 Albert Hopkins (marduk)
# Copyright (C) 2002 Mike Meyer
-# Copyright (C) 2005 Arthur de Jong
+# Copyright (C) 2005, 2006 Arthur de Jong
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
@@ -30,7 +30,7 @@ items should be changeble from the command line."""
import urllib
# Current version of webcheck.
-VERSION = "1.9.5"
+VERSION = "1.9.6"
# The homepage of webcheck.
HOMEPAGE = "http://ch.tudelft.nl/~arthur/webcheck/"
diff --git a/debian/changelog b/debian/changelog
index 6623fd1..7db3ea8 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,15 +1,29 @@
+webcheck (1.9.6) unstable; urgency=low
+
+ * SECURITY FIX: a cross-site scripting vulnerability with content in the
+ tooltips of generated report was fixed by properly escaping
+ all output
+ * urls are now url encoded into a consistent form, solving some problems
+ with urls with non-ascii characters
+ * no longer remove unreferenced redirects
+ * more debugging info in debug mode
+ * more fixes for escaping in generated reports and more support for sites
+ in different character sets
+
+ -- Arthur de Jong <adejong@debian.org> Mon, 30 Jan 2006 17:00:00 +0100
+
webcheck (1.9.5) unstable; urgency=low
* initial (re)release of webcheck Debian package (closes: #326429)
* this release should fix all open bugs on the time of the former webcheck
page removal, except wishlist bugs #71419 (i18n) and #271085 (html
validation), anyone interested can refile those bugs (preferably with
- patches or some pointers on how to implemement the changes)
+ patches or some pointers on how to implement the changes)
* /etc/webcheck is completely removed on upgrades since a site wide
configuration file is no longer supported (webcheck is a user level tool
that should not be configured site wide)
- -- Arthur de Jong <adejong@debian.org> Fre, 30 Dec 2005 23:00:00 +0100
+ -- Arthur de Jong <adejong@debian.org> Fri, 30 Dec 2005 23:00:00 +0100
webcheck (1.0-10) unstable; urgency=low
diff --git a/debian/copyright b/debian/copyright
index 2f6465a..e82120b 100644
--- a/debian/copyright
+++ b/debian/copyright
@@ -20,7 +20,7 @@ LeJacq <jplejacq@quoininc.com>.
Copyright (C) 1998, 1999 Albert Hopkins (marduk)
Copyright (C) 2002 Mike W. Meyer
-Copyright (C) 2005 Arthur de Jong
+Copyright (C) 2005, 2006 Arthur de Jong
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
diff --git a/webcheck.1 b/webcheck.1
index 536ac3d..a6978fc 100644
--- a/webcheck.1
+++ b/webcheck.1
@@ -1,4 +1,4 @@
-.\" Copyright (C) 2005 Arthur de Jong
+.\" Copyright (C) 2005, 2006 Arthur de Jong
.\"
.\" This program is free software; you can redistribute it and/or modify
.\" it under the terms of the GNU General Public License as published by
@@ -15,7 +15,7 @@
.\" Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
.\" .nh
.\"
-.TH "webcheck" "1" "Dec 2005" "Version 1.9.5" "User Commands"
+.TH "webcheck" "1" "Jan 2006" "Version 1.9.6" "User Commands"
.nh
.SH "NAME"
webcheck \- website link checker
@@ -181,7 +181,7 @@ Copyright \(co 1998, 1999 Albert Hopkins (marduk)
.br
Copyright \(co 2002 Mike W. Meyer
.br
-Copyright \(co 2005 Arthur de Jong
+Copyright \(co 2005, 2006 Arthur de Jong
.br
webcheck is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
diff --git a/webcheck.py b/webcheck.py
index 8930805..2c57326 100755
--- a/webcheck.py
+++ b/webcheck.py
@@ -4,7 +4,7 @@
#
# Copyright (C) 1998, 1999 Albert Hopkins (marduk)
# Copyright (C) 2002 Mike W. Meyer
-# Copyright (C) 2005 Arthur de Jong
+# Copyright (C) 2005, 2006 Arthur de Jong
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
@@ -39,8 +39,8 @@ def print_version():
"webcheck "+config.VERSION+"\n" \
"Written by Albert Hopkins (marduk), Mike W. Meyer and Arthur de Jong.\n" \
"\n" \
- "Copyright (C) 1998, 1999, 2002, 2005 Albert Hopkins (marduk), Mike W. Meyer\n" \
- "and Arthur de Jong.\n" \
+ "Copyright (C) 1998, 1999, 2002, 2005, 2006 Albert Hopkins (marduk),\n" \
+ "Mike W. Meyer and Arthur de Jong.\n" \
"This is free software; see the source for copying conditions. There is NO\n" \
"warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE."