diff options
author | Arthur de Jong <arthur@arthurdejong.org> | 2006-01-30 17:27:40 +0100 |
---|---|---|
committer | Arthur de Jong <arthur@arthurdejong.org> | 2006-01-30 17:27:40 +0100 |
commit | 1ca76b07d03be977f894e724bc5153a2d18d956d (patch) | |
tree | e19eb8c02afc7cf54beed546afe3f4c72db89865 | |
parent | 90d6025f1e2b10391a8abd56dc78b3aa76ae0112 (diff) |
get files ready for 1.9.6 release1.9.6
git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@225 86f53f14-5ff3-0310-afe5-9b438ce3f40c
-rw-r--r-- | ChangeLog | 71 | ||||
-rw-r--r-- | NEWS | 101 | ||||
-rw-r--r-- | README | 2 | ||||
-rw-r--r-- | TODO | 19 | ||||
-rw-r--r-- | config.py | 4 | ||||
-rw-r--r-- | debian/changelog | 18 | ||||
-rw-r--r-- | debian/copyright | 2 | ||||
-rw-r--r-- | webcheck.1 | 6 | ||||
-rwxr-xr-x | webcheck.py | 6 |
9 files changed, 165 insertions, 64 deletions
@@ -1,3 +1,74 @@ +2006-01-29 22:39 arthur + + * [r224] crawler.py: bugfix in matching url encoding + +2006-01-29 21:24 arthur + + * [r223] crawler.py: actually decode urlencoded character as hex + not decimal + +2006-01-29 20:50 arthur + + * [r222] fancytooltips/fancytooltips.js: html escape all content + that is retreived from attributes + +2006-01-29 20:48 arthur + + * [r221] crawler.py, parsers/html.py: make sure all urls are + consistently url encoded where it counts + +2006-01-29 20:15 arthur + + * [r220] schemes/http.py: add some more debugging information + (cache hit or miss) + +2006-01-29 20:14 arthur + + * [r219] plugins/about.py: update copyright notice and indicate + that we're using gpl2+ + +2006-01-25 23:16 arthur + + * [r218] parsers/html.py: fix typo (thanks Andrew Kim + <Andrew.Kim@revolution.com>) + +2006-01-19 21:38 arthur + + * [r217] plugins/__init__.py: ignore errors when converting to + unicode string and uses system encoding instead of utf-8 as + default + +2006-01-19 21:35 arthur + + * [r216] plugins/__init__.py: also escape the url when generating + links + +2006-01-19 20:46 arthur + + * [r215] plugins/__init__.py: explictly convert strings to unicode + to avoid potential problems with non-ascii charaters in strings + +2006-01-19 20:45 arthur + + * [r214] parsers/html.py: quote links so that they do not contain + any non-ascii characters to avoid problems later on (and add + some more debugging) + +2006-01-19 20:32 arthur + + * [r213] crawler.py: fix debug message to print url instead of + object reference + +2006-01-15 08:44 arthur + + * [r212] crawler.py: give some more debugging info while following + base urls and no longer delete unreferenced followed links + +2005-12-30 22:33 arthur + + * [r210] ChangeLog, NEWS, TODO, config.py, debian/changelog, + webcheck.1: get files ready for 1.9.5 release + 2005-12-30 22:09 arthur * [r209] crawler.py: fix copy-pasto from r204 @@ -1,3 +1,17 @@ +changes from 1.9.5 to 1.9.6 +--------------------------- + +* SECURITY FIX: a cross-site scripting vulnerability with content in the + tooltips of generated report was fixed by properly escaping + all output +* urls are now url encoded into a consistent form, solving some problems with + urls with non-ascii characters +* no longer remove unreferenced redirects +* more debugging info in debug mode +* more fixes for escaping in generated reports and more support for sites in + different character sets + + changes from 1.9.4 to 1.9.5 --------------------------- @@ -10,7 +24,8 @@ changes from 1.9.4 to 1.9.5 * documentation improvements * several bugfixes to get webcheck more robust * included fancytooltips by Victor Kulinski to have nicer tooltips -* generated reports now have friendlier messages for when there is nothing to report +* generated reports now have friendlier messages for when there is nothing to + report * there is a Debian package @@ -48,7 +63,7 @@ changes from 1.9.2 to 1.9.3 changes from 1.9.1 to 1.9.2 --------------------------- -* complete reimiplementation of the html and http modules +* complete reimplementation of the html and http modules * added https support * some spelling and typo fixes contributed by several people * site map now does a proper breadth first traversal of the site structure @@ -92,25 +107,25 @@ b Fixed bug when server redirects to a document in robots.txt (does not show up as broken (hopefully)) + Filename mangling in filelink.py to help OS/2 (and Win32) (Patch submitted by Steffen Siebert) -+ Added WARN_OLD_VERSION config.py option. If this option is set to true - (the default) Linbot will check it's version number and the version - numbers of it's plugins against a global registry on the Net. If it - finds that a version is not the latest, it will print a warning on the - reports along with a link you can follow to download the latest version. - I think it's neat. You might find it annoying. ++ Added WARN_OLD_VERSION config.py option. If this option is set to true (the + default) Linbot will check it's version number and the version numbers of + it's plugins against a global registry on the Net. If it finds that a + version is not the latest, it will print a warning on the reports along with + a link you can follow to download the latest version. I think it's neat. You + might find it annoying. + Added preliminary support for authenticating proxies, though it does not work correctly yet. + Added -r (redirect depth) and REDIRECT_DEPTH option in config.py to indicate - the amount of redirects Linbot should follow when following a link. Thanks + the amount of redirects Linbot should follow when following a link. Thanks to Andrea Glorioso for the patch. + Added debugio module that handles debugging and I/O -+ Added -q (quiet option). Use it to suppress output ++ Added -q (quiet option). Use it to suppress output + Added -d (debug) option and DEBUG_LEVEL variable in config.py for debugging + added version module and removed __version__ and __author__ from all the modules (except plugins). -b Fixed bug in Linbot using putrequest() instead of putheader() when requesting - header information. Thanks to Andrea Glorioso for - fixing this glitch (and Seth Chaiklin for noticing). +b Fixed bug in Linbot using putrequest() instead of putheader() when + requesting header information. Thanks to Andrea Glorioso for fixing this + glitch (and Seth Chaiklin for noticing). changes from 1.0b8 to 1.0b9 @@ -118,27 +133,27 @@ changes from 1.0b8 to 1.0b9 + If you use the -o command-line option or the OUTPUT_DIR config file option and the directory does not exist, linbot will create it for you (provided - that it has the correct permissions, etc.) Thanks to Andrea Glorioso - for this feature. -+ Added a CREDITS file and probably left a lot of people out. If you think - you should be in it let me know. + that it has the correct permissions, etc.). Thanks to Andrea Glorioso for + this feature. ++ Added a CREDITS file and probably left a lot of people out. If you think you + should be in it let me know. b Linbot will now report to the server that it can accept any MIME type (found - in mimetypes.py. This should fix the "406: No acceptable objects found" + in mimetypes.py. This should fix the "406: No acceptable objects found" error that some servers report. -b Linbot correctly identifies itself as "Linbot <version>" on HEAD requests - as well as GET requests. +b Linbot correctly identifies itself as "Linbot <version>" on HEAD requests as + well as GET requests. changes from 1.0b6 to 1.0b8 --------------------------- -b Fixed bug when no images are reported for documents having 0 links - If you don't know what this means it probably wasn't a problem for you. +b Fixed bug when no images are reported for documents having 0 links If you + don't know what this means it probably wasn't a problem for you. b Fixed code that was messing with arguments passed via -x and -y and caused unexpected results and/or errors. b -b flag should work this time (for real) b Cosmetic changes (reports didn't look the way I thought they should in IE4. - (and may not still as I havent' had a chance to check it yet) + (and may not still as I haven't had a chance to check it yet) b Linbot won't follow infinite redirects (currently hardcoded to max of 5 redirects per document) @@ -147,13 +162,13 @@ changes from 1.0b5 to 1.0b6 --------------------------- + Minor change in ftplink.py should allow better ftp link checking -+ You can now press CTRL-C (or whatever your operating system supports) to break - out of a linbot run. However, the work linbot does is not saved (yet). -b Fixed problem when server redirects a URL to itself. This fix seems to work - for most servers I've tried but there are a few more out there that I need to - take a look at. ++ You can now press CTRL-C (or whatever your operating system supports) to + break out of a linbot run. However, the work linbot does is not saved (yet). +b Fixed problem when server redirects a URL to itself. This fix seems to work + for most servers I've tried but there are a few more out there that I need + to take a look at. b Fixed bug that caused linbot to not check for yanked URLs -+ Added -l command-line option. Usage: -l <url> where <url> is a url pointing ++ Added -l command-line option. Usage: -l <url> where <url> is a url pointing to an image to be used as the report's logo. b "patched" strings.py so that it can better parse html files created in Windows/DOS (I think). @@ -161,27 +176,27 @@ b "patched" strings.py so that it can better parse html files created in + httplink does not HEAD a redirected URL if it is already in the link list (performance improvement) - Removed LOGO_ALT from config.py -+ Changed my email address to marduk@python.net. The official home page of - Linbot will probaby also change with the next release so stay tuned. ++ Changed my email address to marduk@python.net. The official home page of + Linbot will probably also change with the next release so stay tuned. changes from 1.0b4 to 1.0b5 --------------------------- -+ Added a contrib directory. Right now it just contains the about plugin. Other - plugins will be included if people contribute them. Also, the man page will - return once I have updated it. Those ugly buttons are obsolete. -+ Linbot now "inlines" stylesheets. This has the benefits of 1) better support ++ Added a contrib directory. Right now it just contains the about plugin. + Other plugins will be included if people contribute them. Also, the man page + will return once I have updated it. Those ugly buttons are obsolete. ++ Linbot now "inlines" stylesheets. This has the benefits of 1) better support of Netscape browsers (so I hear) and 2) I don't have to document to put linbot.css in the output directory since it grabs it from starship 8*) -b Handling of error for when robots.txt cannot be retreived. +b Handling of error for when robots.txt cannot be retrieved. + Malformed urls are trapped (sorry, I had that commented out) -b FTP link handling is totally rewritten. Fortunately it shouldn't crash anymore - Unfortunately it doesn't really work reliably and probably never will. See - README.ftp for details. +b FTP link handling is totally rewritten. Fortunately it shouldn't crash + anymore Unfortunately it doesn't really work reliably and probably never + will. See README.ftp for details. b Two bugs in HTTP proxy handling made it almost completely unusable, though conveniently seemed to cancel each other out when I was testing. -b Too many files error on large sites should be fixed. Thanks to Andrew Kuchling - et al for suggestions. -b Bug when some servers erroneously report (or don't report) Content-Length header - fixed. +b Too many files error on large sites should be fixed. Thanks to Andrew + Kuchling et al for suggestions. +b Bug when some servers erroneously report (or don't report) Content-Length + header fixed. @@ -11,7 +11,7 @@ Copyright (C) 1998, 1999 Albert Hopkins (marduk) Copyright (C) 2002 Mike W. Meyer - Copyright (C) 2005 Arthur de Jong + Copyright (C) 2005, 2006 Arthur de Jong This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by @@ -1,6 +1,9 @@ before next release ------------------- -* go over all FIXMEs in code +* go over all FIXMEs in code (ftp) +* check that sleep acutally sleeps for the advertised time +* follow redirects (to a limit) of external sites +* check that scheme names are clean so that we do not import strange python modules probably before 2.0 release --------------------------- @@ -11,12 +14,10 @@ probably before 2.0 release * figure out if we need parents and pageparents * make configurable time-out when retrieving a document * support for mult-threading (use -t, --threads as option) -* clean up printing of messages, especially needed for multi-threading * implement a fix for redirecting stdout and stderr to work properly * implement a maximum transfer size for downloading files and things over http * support ftp proxies * support proxying https traffic -* check that scheme names are clean so that we do not import strange python modules wishlist -------- @@ -25,16 +26,13 @@ wishlist * new config file format (if we want a configfile at all) * cookies support (maybe) * integration with weblint -* maybe combine with a logfile checker to also show number of hits per link * do form checking of crawled pages * do spelling checking of crawled pages * test w3c conformance of pages (already done a little) -* maybe make broken links not clickable in report (configurable?) * maybe store crawled site's data in some format for later processing or continuing after interruption * add support for fetching gzipped content to improve performance * maybe do http pipelining -* add a favicon to reports -* follow redirects of external links +* add a favicon to report * make error handling of HTMLParser more robust (maybe send a patch for html parser upstream) * maybe use this as a html parser: http://www.crummy.com/software/BeautifulSoup/examples.html * maybe have a way to output google sitemap files: http://www.google.com/webmasters/sitemaps/docs/en/protocol.html @@ -51,6 +49,9 @@ wishlist * maybe implement news, nntp, gopher and telnet schemes (if there is anyone that wants them) * maybe add custom bullets in problem lists, depending on problem type * maybe make -b the default -* maybe add copyright notice in generated files (that we don't claim copyright) -* make a test site or test framework (this is a lot of work) * prompt for authentication (detecting realms) +* present age for times long ago in a friendlier format (.. days ago, .. months ago, .. years ago) +* maybe unescaped spaces aren't always a real problem (e.g. in mailto: urls) +* give a warning when no encoding is specified, an error if non-ascii characters are used +* maybe give a warning for urls that have non-ascii characters +* maybe fetch and store desription and other meta information about page (keywords) (just like author) @@ -3,7 +3,7 @@ # # Copyright (C) 1998, 1999 Albert Hopkins (marduk) # Copyright (C) 2002 Mike Meyer -# Copyright (C) 2005 Arthur de Jong +# Copyright (C) 2005, 2006 Arthur de Jong # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by @@ -30,7 +30,7 @@ items should be changeble from the command line.""" import urllib # Current version of webcheck. -VERSION = "1.9.5" +VERSION = "1.9.6" # The homepage of webcheck. HOMEPAGE = "http://ch.tudelft.nl/~arthur/webcheck/" diff --git a/debian/changelog b/debian/changelog index 6623fd1..7db3ea8 100644 --- a/debian/changelog +++ b/debian/changelog @@ -1,15 +1,29 @@ +webcheck (1.9.6) unstable; urgency=low + + * SECURITY FIX: a cross-site scripting vulnerability with content in the + tooltips of generated report was fixed by properly escaping + all output + * urls are now url encoded into a consistent form, solving some problems + with urls with non-ascii characters + * no longer remove unreferenced redirects + * more debugging info in debug mode + * more fixes for escaping in generated reports and more support for sites + in different character sets + + -- Arthur de Jong <adejong@debian.org> Mon, 30 Jan 2006 17:00:00 +0100 + webcheck (1.9.5) unstable; urgency=low * initial (re)release of webcheck Debian package (closes: #326429) * this release should fix all open bugs on the time of the former webcheck page removal, except wishlist bugs #71419 (i18n) and #271085 (html validation), anyone interested can refile those bugs (preferably with - patches or some pointers on how to implemement the changes) + patches or some pointers on how to implement the changes) * /etc/webcheck is completely removed on upgrades since a site wide configuration file is no longer supported (webcheck is a user level tool that should not be configured site wide) - -- Arthur de Jong <adejong@debian.org> Fre, 30 Dec 2005 23:00:00 +0100 + -- Arthur de Jong <adejong@debian.org> Fri, 30 Dec 2005 23:00:00 +0100 webcheck (1.0-10) unstable; urgency=low diff --git a/debian/copyright b/debian/copyright index 2f6465a..e82120b 100644 --- a/debian/copyright +++ b/debian/copyright @@ -20,7 +20,7 @@ LeJacq <jplejacq@quoininc.com>. Copyright (C) 1998, 1999 Albert Hopkins (marduk) Copyright (C) 2002 Mike W. Meyer -Copyright (C) 2005 Arthur de Jong +Copyright (C) 2005, 2006 Arthur de Jong This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by @@ -1,4 +1,4 @@ -.\" Copyright (C) 2005 Arthur de Jong +.\" Copyright (C) 2005, 2006 Arthur de Jong .\" .\" This program is free software; you can redistribute it and/or modify .\" it under the terms of the GNU General Public License as published by @@ -15,7 +15,7 @@ .\" Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA .\" .nh .\" -.TH "webcheck" "1" "Dec 2005" "Version 1.9.5" "User Commands" +.TH "webcheck" "1" "Jan 2006" "Version 1.9.6" "User Commands" .nh .SH "NAME" webcheck \- website link checker @@ -181,7 +181,7 @@ Copyright \(co 1998, 1999 Albert Hopkins (marduk) .br Copyright \(co 2002 Mike W. Meyer .br -Copyright \(co 2005 Arthur de Jong +Copyright \(co 2005, 2006 Arthur de Jong .br webcheck is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. diff --git a/webcheck.py b/webcheck.py index 8930805..2c57326 100755 --- a/webcheck.py +++ b/webcheck.py @@ -4,7 +4,7 @@ # # Copyright (C) 1998, 1999 Albert Hopkins (marduk) # Copyright (C) 2002 Mike W. Meyer -# Copyright (C) 2005 Arthur de Jong +# Copyright (C) 2005, 2006 Arthur de Jong # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by @@ -39,8 +39,8 @@ def print_version(): "webcheck "+config.VERSION+"\n" \ "Written by Albert Hopkins (marduk), Mike W. Meyer and Arthur de Jong.\n" \ "\n" \ - "Copyright (C) 1998, 1999, 2002, 2005 Albert Hopkins (marduk), Mike W. Meyer\n" \ - "and Arthur de Jong.\n" \ + "Copyright (C) 1998, 1999, 2002, 2005, 2006 Albert Hopkins (marduk),\n" \ + "Mike W. Meyer and Arthur de Jong.\n" \ "This is free software; see the source for copying conditions. There is NO\n" \ "warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE." |