Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--ChangeLog90
-rw-r--r--NEWS49
-rw-r--r--README8
-rw-r--r--TODO6
-rw-r--r--config.py2
-rw-r--r--debian/changelog15
-rw-r--r--webcheck.12
7 files changed, 147 insertions, 25 deletions
diff --git a/ChangeLog b/ChangeLog
index d82875d..b80db78 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,93 @@
+2007-07-15 08:23 arthur
+
+ * [r351] crawler.py: output which parser module is used in debug
+ mode
+
+2007-07-15 08:13 arthur
+
+ * [r350] ChangeLog: fix spelling in ChangeLog messages
+
+2007-07-15 07:56 arthur
+
+ * [r349] parsers/html/beautifulsoup.py: also handle http-equiv
+ refresh meta header
+
+2007-07-15 07:27 arthur
+
+ * [r348] crawler.py: just ignore setting encoding to None
+
+2007-07-14 18:20 arthur
+
+ * [r347] crawler.py: fix printing of None encoding
+
+2007-07-14 18:18 arthur
+
+ * [r346] myurllib.py: simplify _normalize_escapes() function to
+ improve performance
+
+2007-07-14 10:26 arthur
+
+ * [r345] myurllib.py: replace double slashes in file URL paths with
+ single ones
+
+2007-07-13 18:48 arthur
+
+ * [r344] myurllib.py: add note about improving performance more
+
+2007-07-13 18:47 arthur
+
+ * [r343] crawler.py, plugins/__init__.py, plugins/sitemap.py,
+ serialize.py: use sets instead of sequences for children,
+ embedded, etc to improve deserialization performance with a
+ factor 25 but now require python 2.4 of more recent
+
+2007-07-13 13:56 arthur
+
+ * [r342] serialize.py: give the matched URL a name to make code
+ more readable
+
+2007-07-13 13:55 arthur
+
+ * [r341] serialize.py: be a little more verbose when raising
+ parsing exceptions
+
+2007-07-13 13:50 arthur
+
+ * [r340] plugins/badlinks.py: get rid of unneeded sort
+
+2007-07-07 14:02 arthur
+
+ * [r339] crawler.py, myurllib.py, parsers/html/beautifulsoup.py,
+ parsers/html/htmlparser.py: split out URL cleaning code into own
+ module
+
+2007-07-07 13:54 arthur
+
+ * [r338] schemes/http.py: do not handle control-C and pass it along
+ to the main exception handler and log http exceptions with a
+ higher level
+
+2007-07-07 13:39 arthur
+
+ * [r337] debian/control: added XS-Vcs-Svn and XS-Vcs-Browser as
+ specified in #391023
+
+2007-07-06 14:18 arthur
+
+ * [r336] crawler.py, serialize.py: improve deserialization and
+ handling of Unicode strings
+
+2007-07-06 13:51 arthur
+
+ * [r335] plugins/problems.py, plugins/size.py: some extra
+ precautions for handling Unicode data and correct HTML escaping
+
+2007-05-12 20:57 arthur
+
+ * [r333] ChangeLog, NEWS, README, TODO, config.py,
+ debian/changelog, debian/copyright, webcheck.1: get files ready
+ for 1.10.0 release
+
2007-05-12 07:49 arthur
* [r332] crawler.py: also lower-case reqanchor
diff --git a/NEWS b/NEWS
index 441cd11..01c318e 100644
--- a/NEWS
+++ b/NEWS
@@ -1,3 +1,17 @@
+changes from 1.10.0 to 1.10.1
+-----------------------------
+
+* some extra Unicode handling precautions
+* fix problem in reading webcheck.dat for non-ASCII text
+* be more verbose about HTTP retrieval failures
+* split out URL normalization code into own module and do some basic protocol-
+ specific normalizations
+* a number of big performance improvements
+* fix a bug in handling some zero-size pages
+* parse http-equiv meta HTML header to parse refresh option
+* webcheck now requires python 2.4 or more recent
+
+
changes from 1.9.8 to 1.10.0
----------------------------
@@ -12,6 +26,7 @@ changes from 1.9.8 to 1.10.0
* re-enable robots.txt parsing that was disabled in 1.9.8 and add an
--ignore-robots option
+
changes from 1.9.7 to 1.9.8
---------------------------
@@ -19,7 +34,7 @@ changes from 1.9.7 to 1.9.8
added
* added proper error handling for SSL related socket problems (exceptions are
not a subclass of regular socket exceptions)
-* a bugfix for urls that contain a user name without a password or the other
+* a bug fix for URLs that contain a user name without a password or the other
way around
* miscellaneous small report improvements
@@ -30,12 +45,12 @@ changes from 1.9.6 to 1.9.7
* site data is now stored to a file while crawling the site, this can be used
to resume a crawl with the --continue option and for debugging purposes
* implemented checking of link anchors
-* small improvements to generated reports (favicon included, css fix)
+* small improvements to generated reports (favicon included, CSS fix)
* documentation improvements
* properly handle float values for --wait
* unreachable sites will time out faster
* added support for plugins that don't output html
-* half a dozen other small bugfixes (stability fixes, code cleanups and
+* half a dozen other small bug fixes (stability fixes, code clean-ups and
improvements)
@@ -45,8 +60,8 @@ changes from 1.9.5 to 1.9.6
* SECURITY FIX: a cross-site scripting vulnerability with content in the
tooltips of generated report was fixed by properly escaping
all output (CVE-2006-1321)
-* urls are now url encoded into a consistent form, solving some problems with
- urls with non-ascii characters
+* URLs are now URL encoded into a consistent form, solving some problems with
+ URLs with non-ASCII characters
* no longer remove unreferenced redirects
* more debugging info in debug mode
* more fixes for escaping in generated reports and more support for sites in
@@ -63,9 +78,9 @@ changes from 1.9.4 to 1.9.5
* ensure that all generated html output is properly escaped
* implemented --internal option to flag internal URLs with regular expressions
* documentation improvements
-* several bugfixes to get webcheck more robust
-* included fancytooltips by Victor Kulinski to have nicer tooltips
-* generated reports now have friendlier messages for when there is nothing to
+* several bug fixes to get webcheck more robust
+* included FancyTooltips by Victor Kulinski to have nicer tooltips
+* generated reports now have friendlier messages for when there is nothing to
report
* there is a Debian package
@@ -78,9 +93,9 @@ changes from 1.9.3 to 1.9.4
* some fixes and improvements to the layout of the generated pages
* redirect loops are now detected
* transfer result status is now stored
-* addition of a limited css parser that handles imports and url() entries
+* addition of a limited CSS parser that handles imports and url() entries
* support reading file names for checking from the command line (turning them
- into file:// urls internally)
+ into file:// URLs internally)
* better error handling of problems writing generated pages and check that we
are not overwriting input files
@@ -90,14 +105,14 @@ changes from 1.9.2 to 1.9.3
* several improvements to the generated reports, including tooltips with some
useful information for the links (does not seem to work very well in
- firefox)
+ Firefox)
* stability improvements to the html parser (thanks to everyone who reported
problems) not all problems have been solved but it shouldn't stop webcheck
any more
* reimplementation of the file and ftp modules to read directory contents or
read index.html file if present (there are known problems in the ftp module
regarding empty directories and recovering from errors)
-* improvements to the url parsing code to warn about spaces in urls
+* improvements to the URL parsing code to warn about spaces in URLs
* only fetch content if we can parse it
@@ -105,18 +120,18 @@ changes from 1.9.1 to 1.9.2
---------------------------
* complete reimplementation of the html and http modules
-* added https support
+* added HTTPS support
* some spelling and typo fixes contributed by several people
* site map now does a proper breadth first traversal of the site structure
* webcheck homepage has been changed to http://ch.tudelft.nl/~arthur/webcheck/
-* several minor bugfixes and tweaks
+* several minor bug fixes and tweaks
changes from 1.9.0 to 1.9.1
---------------------------
* ship an empty css.py to actually run
-* small bugfixes for pages with multiple titles and slow plugin
+* small bug fixes for pages with multiple titles and slow plugin
changes from 1.0 to 1.9.0
@@ -209,11 +224,11 @@ b Fixed problem when server redirects a URL to itself. This fix seems to work
for most servers I've tried but there are a few more out there that I need
to take a look at.
b Fixed bug that caused linbot to not check for yanked URLs
-+ Added -l command-line option. Usage: -l <url> where <url> is a url pointing
++ Added -l command-line option. Usage: -l <url> where <url> is a URL pointing
to an image to be used as the report's logo.
b "patched" strings.py so that it can better parse html files created in
Windows/DOS (I think).
-+ Made report LOGO a link to the base url
++ Made report LOGO a link to the base URL
+ httplink does not HEAD a redirected URL if it is already in the link list
(performance improvement)
- Removed LOGO_ALT from config.py
diff --git a/README b/README
index 4debdd4..6038f53 100644
--- a/README
+++ b/README
@@ -69,13 +69,13 @@ Unix-like systems. Other operating systems may differ.
1. Unpack the tarball in the location that you want to have it installed.
Maybe something like /usr/local/lib/python/site-packages or /opt.
- % tar -xvzf webcheck-1.10.0.tar.gz
+ % tar -xvzf webcheck-1.10.1.tar.gz
2. Add a symbolic link to some place in your PATH.
- % ln -s /opt/webcheck-1.10.0/webcheck.py /usr/local/bin/webcheck
+ % ln -s /opt/webcheck-1.10.1/webcheck.py /usr/local/bin/webcheck
3. Put the manual page in the MANPATH.
- % ln -s /opt/webcheck-1.10.0/webcheck.1 /usr/local/man/man1/webcheck.1
+ % ln -s /opt/webcheck-1.10.1/webcheck.1 /usr/local/man/man1/webcheck.1
RUNNING WEBCHECK
@@ -97,7 +97,7 @@ browsers.
For more information on webcheck usage and command line options see the
webcheck manual page. If the manual page is not in the MANPATH you can
probably open the manual with something like:
- % man -l /opt/webcheck-1.10.0/webcheck.1
+ % man -l /opt/webcheck-1.10.1/webcheck.1
FEEDBACK AND BUG REPORTS
diff --git a/TODO b/TODO
index 29ea381..3f58ecf 100644
--- a/TODO
+++ b/TODO
@@ -10,11 +10,10 @@ probably before 2.0 release
* implement a maximum transfer size for downloading
* support ftp proxies
* support proxying https traffic
-* give problems different levels (info, warning, error)
+* give problems different levels (info, warning, error) or categories
* option to only force overwrite generated files and leave static files (css, js) alone
* implement a --html-only option to not copy css and other files
* check for missing encoding (report problem)
-* implement parsing of meta http-equiv="refresh" content="0;url=CHILD">
* for FTP: don't fail if SIZE is not allowed
wishlist
@@ -60,3 +59,6 @@ wishlist
* maybe use urllib2 instead of our own custom code (redirects may be a problem here though)
* add support for robots meta tag: http://www.robotstxt.org/wc/meta-user.html
* only report multiple definitions of a single anchor once
+* warn if URL contains unencoded characters
+* see section 6 of rfc3986.txt for URL comparison (esp. 6.2.2.)
+* implement paging for huge reports
diff --git a/config.py b/config.py
index bf93113..a5df060 100644
--- a/config.py
+++ b/config.py
@@ -30,7 +30,7 @@ items should be changeble from the command line."""
import urllib
# Current version of webcheck.
-VERSION = '1.10.0'
+VERSION = '1.10.1'
# The homepage of webcheck.
HOMEPAGE = 'http://ch.tudelft.nl/~arthur/webcheck/'
diff --git a/debian/changelog b/debian/changelog
index d3267f3..4b4051b 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,3 +1,18 @@
+webcheck (1.10.1) unstable; urgency=low
+
+ * some extra Unicode handling precautions
+ * fix problem in reading webcheck.dat for non-ASCII text (closes: #431625)
+ * be more verbose about HTTP retrieval failures
+ * split out URL normalization code into own module and do some basic
+ protocol-specific normalizations (closes: #425004)
+ * a number of big performance improvements
+ * fix a bug in handling some zero-size pages
+ * parse http-equiv meta HTML header to parse refresh option
+ * webcheck now requires python 2.4 or more recent
+ * added XS-Vcs-Svn and XS-Vcs-Browser as specified in #391023
+
+ -- Arthur de Jong <adejong@debian.org> Sun, 15 Jul 2007 15:00:00 +0200
+
webcheck (1.10.0) unstable; urgency=low
* switched HTML parsing to using BeautifulSoup with a fall-back mechanism to
diff --git a/webcheck.1 b/webcheck.1
index 36d63e1..06fadf9 100644
--- a/webcheck.1
+++ b/webcheck.1
@@ -15,7 +15,7 @@
.\" Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
.\" .nh
.\"
-.TH "webcheck" "1" "May 2007" "Version 1.10.0" "User Commands"
+.TH "webcheck" "1" "Jul 2007" "Version 1.10.1" "User Commands"
.nh
.SH "NAME"
webcheck \- website link checker