diff options
-rw-r--r-- | ChangeLog | 90 | ||||
-rw-r--r-- | NEWS | 49 | ||||
-rw-r--r-- | README | 8 | ||||
-rw-r--r-- | TODO | 6 | ||||
-rw-r--r-- | config.py | 2 | ||||
-rw-r--r-- | debian/changelog | 15 | ||||
-rw-r--r-- | webcheck.1 | 2 |
7 files changed, 147 insertions, 25 deletions
@@ -1,3 +1,93 @@ +2007-07-15 08:23 arthur + + * [r351] crawler.py: output which parser module is used in debug + mode + +2007-07-15 08:13 arthur + + * [r350] ChangeLog: fix spelling in ChangeLog messages + +2007-07-15 07:56 arthur + + * [r349] parsers/html/beautifulsoup.py: also handle http-equiv + refresh meta header + +2007-07-15 07:27 arthur + + * [r348] crawler.py: just ignore setting encoding to None + +2007-07-14 18:20 arthur + + * [r347] crawler.py: fix printing of None encoding + +2007-07-14 18:18 arthur + + * [r346] myurllib.py: simplify _normalize_escapes() function to + improve performance + +2007-07-14 10:26 arthur + + * [r345] myurllib.py: replace double slashes in file URL paths with + single ones + +2007-07-13 18:48 arthur + + * [r344] myurllib.py: add note about improving performance more + +2007-07-13 18:47 arthur + + * [r343] crawler.py, plugins/__init__.py, plugins/sitemap.py, + serialize.py: use sets instead of sequences for children, + embedded, etc to improve deserialization performance with a + factor 25 but now require python 2.4 of more recent + +2007-07-13 13:56 arthur + + * [r342] serialize.py: give the matched URL a name to make code + more readable + +2007-07-13 13:55 arthur + + * [r341] serialize.py: be a little more verbose when raising + parsing exceptions + +2007-07-13 13:50 arthur + + * [r340] plugins/badlinks.py: get rid of unneeded sort + +2007-07-07 14:02 arthur + + * [r339] crawler.py, myurllib.py, parsers/html/beautifulsoup.py, + parsers/html/htmlparser.py: split out URL cleaning code into own + module + +2007-07-07 13:54 arthur + + * [r338] schemes/http.py: do not handle control-C and pass it along + to the main exception handler and log http exceptions with a + higher level + +2007-07-07 13:39 arthur + + * [r337] debian/control: added XS-Vcs-Svn and XS-Vcs-Browser as + specified in #391023 + +2007-07-06 14:18 arthur + + * [r336] crawler.py, serialize.py: improve deserialization and + handling of Unicode strings + +2007-07-06 13:51 arthur + + * [r335] plugins/problems.py, plugins/size.py: some extra + precautions for handling Unicode data and correct HTML escaping + +2007-05-12 20:57 arthur + + * [r333] ChangeLog, NEWS, README, TODO, config.py, + debian/changelog, debian/copyright, webcheck.1: get files ready + for 1.10.0 release + 2007-05-12 07:49 arthur * [r332] crawler.py: also lower-case reqanchor @@ -1,3 +1,17 @@ +changes from 1.10.0 to 1.10.1 +----------------------------- + +* some extra Unicode handling precautions +* fix problem in reading webcheck.dat for non-ASCII text +* be more verbose about HTTP retrieval failures +* split out URL normalization code into own module and do some basic protocol- + specific normalizations +* a number of big performance improvements +* fix a bug in handling some zero-size pages +* parse http-equiv meta HTML header to parse refresh option +* webcheck now requires python 2.4 or more recent + + changes from 1.9.8 to 1.10.0 ---------------------------- @@ -12,6 +26,7 @@ changes from 1.9.8 to 1.10.0 * re-enable robots.txt parsing that was disabled in 1.9.8 and add an --ignore-robots option + changes from 1.9.7 to 1.9.8 --------------------------- @@ -19,7 +34,7 @@ changes from 1.9.7 to 1.9.8 added * added proper error handling for SSL related socket problems (exceptions are not a subclass of regular socket exceptions) -* a bugfix for urls that contain a user name without a password or the other +* a bug fix for URLs that contain a user name without a password or the other way around * miscellaneous small report improvements @@ -30,12 +45,12 @@ changes from 1.9.6 to 1.9.7 * site data is now stored to a file while crawling the site, this can be used to resume a crawl with the --continue option and for debugging purposes * implemented checking of link anchors -* small improvements to generated reports (favicon included, css fix) +* small improvements to generated reports (favicon included, CSS fix) * documentation improvements * properly handle float values for --wait * unreachable sites will time out faster * added support for plugins that don't output html -* half a dozen other small bugfixes (stability fixes, code cleanups and +* half a dozen other small bug fixes (stability fixes, code clean-ups and improvements) @@ -45,8 +60,8 @@ changes from 1.9.5 to 1.9.6 * SECURITY FIX: a cross-site scripting vulnerability with content in the tooltips of generated report was fixed by properly escaping all output (CVE-2006-1321) -* urls are now url encoded into a consistent form, solving some problems with - urls with non-ascii characters +* URLs are now URL encoded into a consistent form, solving some problems with + URLs with non-ASCII characters * no longer remove unreferenced redirects * more debugging info in debug mode * more fixes for escaping in generated reports and more support for sites in @@ -63,9 +78,9 @@ changes from 1.9.4 to 1.9.5 * ensure that all generated html output is properly escaped * implemented --internal option to flag internal URLs with regular expressions * documentation improvements -* several bugfixes to get webcheck more robust -* included fancytooltips by Victor Kulinski to have nicer tooltips -* generated reports now have friendlier messages for when there is nothing to +* several bug fixes to get webcheck more robust +* included FancyTooltips by Victor Kulinski to have nicer tooltips +* generated reports now have friendlier messages for when there is nothing to report * there is a Debian package @@ -78,9 +93,9 @@ changes from 1.9.3 to 1.9.4 * some fixes and improvements to the layout of the generated pages * redirect loops are now detected * transfer result status is now stored -* addition of a limited css parser that handles imports and url() entries +* addition of a limited CSS parser that handles imports and url() entries * support reading file names for checking from the command line (turning them - into file:// urls internally) + into file:// URLs internally) * better error handling of problems writing generated pages and check that we are not overwriting input files @@ -90,14 +105,14 @@ changes from 1.9.2 to 1.9.3 * several improvements to the generated reports, including tooltips with some useful information for the links (does not seem to work very well in - firefox) + Firefox) * stability improvements to the html parser (thanks to everyone who reported problems) not all problems have been solved but it shouldn't stop webcheck any more * reimplementation of the file and ftp modules to read directory contents or read index.html file if present (there are known problems in the ftp module regarding empty directories and recovering from errors) -* improvements to the url parsing code to warn about spaces in urls +* improvements to the URL parsing code to warn about spaces in URLs * only fetch content if we can parse it @@ -105,18 +120,18 @@ changes from 1.9.1 to 1.9.2 --------------------------- * complete reimplementation of the html and http modules -* added https support +* added HTTPS support * some spelling and typo fixes contributed by several people * site map now does a proper breadth first traversal of the site structure * webcheck homepage has been changed to http://ch.tudelft.nl/~arthur/webcheck/ -* several minor bugfixes and tweaks +* several minor bug fixes and tweaks changes from 1.9.0 to 1.9.1 --------------------------- * ship an empty css.py to actually run -* small bugfixes for pages with multiple titles and slow plugin +* small bug fixes for pages with multiple titles and slow plugin changes from 1.0 to 1.9.0 @@ -209,11 +224,11 @@ b Fixed problem when server redirects a URL to itself. This fix seems to work for most servers I've tried but there are a few more out there that I need to take a look at. b Fixed bug that caused linbot to not check for yanked URLs -+ Added -l command-line option. Usage: -l <url> where <url> is a url pointing ++ Added -l command-line option. Usage: -l <url> where <url> is a URL pointing to an image to be used as the report's logo. b "patched" strings.py so that it can better parse html files created in Windows/DOS (I think). -+ Made report LOGO a link to the base url ++ Made report LOGO a link to the base URL + httplink does not HEAD a redirected URL if it is already in the link list (performance improvement) - Removed LOGO_ALT from config.py @@ -69,13 +69,13 @@ Unix-like systems. Other operating systems may differ. 1. Unpack the tarball in the location that you want to have it installed. Maybe something like /usr/local/lib/python/site-packages or /opt. - % tar -xvzf webcheck-1.10.0.tar.gz + % tar -xvzf webcheck-1.10.1.tar.gz 2. Add a symbolic link to some place in your PATH. - % ln -s /opt/webcheck-1.10.0/webcheck.py /usr/local/bin/webcheck + % ln -s /opt/webcheck-1.10.1/webcheck.py /usr/local/bin/webcheck 3. Put the manual page in the MANPATH. - % ln -s /opt/webcheck-1.10.0/webcheck.1 /usr/local/man/man1/webcheck.1 + % ln -s /opt/webcheck-1.10.1/webcheck.1 /usr/local/man/man1/webcheck.1 RUNNING WEBCHECK @@ -97,7 +97,7 @@ browsers. For more information on webcheck usage and command line options see the webcheck manual page. If the manual page is not in the MANPATH you can probably open the manual with something like: - % man -l /opt/webcheck-1.10.0/webcheck.1 + % man -l /opt/webcheck-1.10.1/webcheck.1 FEEDBACK AND BUG REPORTS @@ -10,11 +10,10 @@ probably before 2.0 release * implement a maximum transfer size for downloading * support ftp proxies * support proxying https traffic -* give problems different levels (info, warning, error) +* give problems different levels (info, warning, error) or categories * option to only force overwrite generated files and leave static files (css, js) alone * implement a --html-only option to not copy css and other files * check for missing encoding (report problem) -* implement parsing of meta http-equiv="refresh" content="0;url=CHILD"> * for FTP: don't fail if SIZE is not allowed wishlist @@ -60,3 +59,6 @@ wishlist * maybe use urllib2 instead of our own custom code (redirects may be a problem here though) * add support for robots meta tag: http://www.robotstxt.org/wc/meta-user.html * only report multiple definitions of a single anchor once +* warn if URL contains unencoded characters +* see section 6 of rfc3986.txt for URL comparison (esp. 6.2.2.) +* implement paging for huge reports @@ -30,7 +30,7 @@ items should be changeble from the command line.""" import urllib # Current version of webcheck. -VERSION = '1.10.0' +VERSION = '1.10.1' # The homepage of webcheck. HOMEPAGE = 'http://ch.tudelft.nl/~arthur/webcheck/' diff --git a/debian/changelog b/debian/changelog index d3267f3..4b4051b 100644 --- a/debian/changelog +++ b/debian/changelog @@ -1,3 +1,18 @@ +webcheck (1.10.1) unstable; urgency=low + + * some extra Unicode handling precautions + * fix problem in reading webcheck.dat for non-ASCII text (closes: #431625) + * be more verbose about HTTP retrieval failures + * split out URL normalization code into own module and do some basic + protocol-specific normalizations (closes: #425004) + * a number of big performance improvements + * fix a bug in handling some zero-size pages + * parse http-equiv meta HTML header to parse refresh option + * webcheck now requires python 2.4 or more recent + * added XS-Vcs-Svn and XS-Vcs-Browser as specified in #391023 + + -- Arthur de Jong <adejong@debian.org> Sun, 15 Jul 2007 15:00:00 +0200 + webcheck (1.10.0) unstable; urgency=low * switched HTML parsing to using BeautifulSoup with a fall-back mechanism to @@ -15,7 +15,7 @@ .\" Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA .\" .nh .\" -.TH "webcheck" "1" "May 2007" "Version 1.10.0" "User Commands" +.TH "webcheck" "1" "Jul 2007" "Version 1.10.1" "User Commands" .nh .SH "NAME" webcheck \- website link checker |