Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorArthur de Jong <arthur@arthurdejong.org>2007-05-12 22:57:01 +0200
committerArthur de Jong <arthur@arthurdejong.org>2007-05-12 22:57:01 +0200
commit9984ec8062a302d03bf68172b0e6fb2d180f62d7 (patch)
treed711aabf781800b58082a2ebb36aff9e51c2605a
parentc4d61641829600fb6b9cd680c3dcc90350a2e340 (diff)
get files ready for 1.10.0 release1.10.0
git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@333 86f53f14-5ff3-0310-afe5-9b438ce3f40c
-rw-r--r--ChangeLog111
-rw-r--r--NEWS14
-rw-r--r--README8
-rw-r--r--TODO16
-rw-r--r--config.py2
-rw-r--r--debian/changelog15
-rw-r--r--debian/copyright2
-rw-r--r--webcheck.14
8 files changed, 154 insertions, 18 deletions
diff --git a/ChangeLog b/ChangeLog
index f66009c..31dfee5 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,114 @@
+2007-05-12 07:49 arthur
+
+ * [r332] crawler.py: also lowercase reqanchor
+
+2007-05-11 22:03 arthur
+
+ * [r331] crawler.py, plugins/anchors.py, plugins/badlinks.py,
+ plugins/problems.py, schemes/http.py: fix some copyright dates
+
+2007-05-11 22:01 arthur
+
+ * [r330] config.py, webcheck.1, webcheck.py: switch robots.txt
+ handling to default on again (broken in 1.9.8) and add new
+ --ignore-robots option to be able to ignore robots retrieval
+
+2007-05-09 19:58 arthur
+
+ * [r329] webcheck.py: present the default number of redirects
+
+2007-05-08 21:33 arthur
+
+ * [r328] plugins/about.py: update copyright information
+
+2007-04-24 20:09 arthur
+
+ * [r327] plugins/__init__.py, plugins/badlinks.py,
+ plugins/problems.py: fixes to make output XHTML 1.1 compliant
+
+2007-04-24 18:53 arthur
+
+ * [r326] parsers/html/beautifulsoup.py: handle ID attribute as
+ anchor on any tag
+
+2007-04-24 18:52 arthur
+
+ * [r325] crawler.py, plugins/anchors.py: lowercase anchor and
+ errors to include id as option
+
+2007-04-20 10:11 arthur
+
+ * [r324] parsers/html/beautifulsoup.py: correctly parse author
+ information
+
+2007-04-20 09:42 arthur
+
+ * [r323] debian/control, parsers/html, parsers/html.py,
+ parsers/html/__init__.py, parsers/html/beautifulsoup.py,
+ parsers/html/htmlparser.py: introduce HTML parsing using
+ BeautifulSoup with a fallback mechanism to the old HTMLParser
+ based solution
+
+2007-04-20 09:40 arthur
+
+ * [r322] crawler.py: mark encoding problems and output more
+ debugging
+
+2007-04-20 08:34 arthur
+
+ * [r321] debian/changelog: fix formatting of previous changelog
+ entry
+
+2007-04-20 08:20 arthur
+
+ * [r320] plugins/anchors.py: fix typo
+
+2007-04-06 12:38 arthur
+
+ * [r319] schemes/http.py: add workaround for bug in idna module
+
+2007-04-06 12:31 arthur
+
+ * [r318] crawler.py: add some comments to the follow_link() method
+
+2007-04-06 12:29 arthur
+
+ * [r317] crawler.py: make parsing of urls and conversion to Link
+ objects a little more consistent
+
+2007-04-06 12:02 arthur
+
+ * [r316] plugins/__init__.py: use consistent unicode conversion
+
+2007-04-06 11:46 arthur
+
+ * [r315] webcheck.1: document the fact that --force should be used
+ for non-interactive use
+
+2007-04-06 11:35 arthur
+
+ * [r314] plugins/__init__.py: bail out if reading user input failed
+
+2007-03-31 11:39 arthur
+
+ * [r313] parsers/html.py: evaluate archive attribute of <applet>
+ tag instead of code attribute if that is present
+
+2007-03-14 21:47 arthur
+
+ * [r312] crawler.py: get rid of old base (singular) as bases is now
+ used everywhere
+
+2007-03-10 12:49 arthur
+
+ * [r311] plugins/sitemap.py: clean up a little and simplify
+
+2007-01-15 20:27 arthur
+
+ * [r309] ChangeLog, NEWS, README, TODO, config.py,
+ debian/changelog, webcheck.1, webcheck.py: get files ready for
+ 1.9.8 release
+
2007-01-15 20:26 arthur
* [r308] schemes/http.py: catch any exception in HTTP module and
diff --git a/NEWS b/NEWS
index b20a1c3..441cd11 100644
--- a/NEWS
+++ b/NEWS
@@ -1,3 +1,17 @@
+changes from 1.9.8 to 1.10.0
+----------------------------
+
+* switched HTML parsing to using BeautifulSoup with a fall-back mechanism to
+ the old HTMLParser based solution
+* the new parser is much more error-tolerant but is reportedly somewhat slower
+ and does not include line numbers in errors
+* new features will likely only be added to the new parser
+* some small improvements to the output to make it XHTML 1.1 compliant
+* internal improvements for handling Unicode strings
+* better support for parsing <applet> tags and anchors using id attributes
+* re-enable robots.txt parsing that was disabled in 1.9.8 and add an
+ --ignore-robots option
+
changes from 1.9.7 to 1.9.8
---------------------------
diff --git a/README b/README
index c7ec8d5..4debdd4 100644
--- a/README
+++ b/README
@@ -69,13 +69,13 @@ Unix-like systems. Other operating systems may differ.
1. Unpack the tarball in the location that you want to have it installed.
Maybe something like /usr/local/lib/python/site-packages or /opt.
- % tar -xvzf webcheck-1.9.8.tar.gz
+ % tar -xvzf webcheck-1.10.0.tar.gz
2. Add a symbolic link to some place in your PATH.
- % ln -s /opt/webcheck-1.9.8/webcheck.py /usr/local/bin/webcheck
+ % ln -s /opt/webcheck-1.10.0/webcheck.py /usr/local/bin/webcheck
3. Put the manual page in the MANPATH.
- % ln -s /opt/webcheck-1.9.8/webcheck.1 /usr/local/man/man1/webcheck.1
+ % ln -s /opt/webcheck-1.10.0/webcheck.1 /usr/local/man/man1/webcheck.1
RUNNING WEBCHECK
@@ -97,7 +97,7 @@ browsers.
For more information on webcheck usage and command line options see the
webcheck manual page. If the manual page is not in the MANPATH you can
probably open the manual with something like:
- % man -l /opt/webcheck-1.9.8/webcheck.1
+ % man -l /opt/webcheck-1.10.0/webcheck.1
FEEDBACK AND BUG REPORTS
diff --git a/TODO b/TODO
index 246add1..29ea381 100644
--- a/TODO
+++ b/TODO
@@ -13,24 +13,21 @@ probably before 2.0 release
* give problems different levels (info, warning, error)
* option to only force overwrite generated files and leave static files (css, js) alone
* implement a --html-only option to not copy css and other files
-* do not overwrite (maybe) webcheck.css if it is already there
* check for missing encoding (report problem)
* implement parsing of meta http-equiv="refresh" content="0;url=CHILD">
-* in --help output: show default number of redirects to follow
+* for FTP: don't fail if SIZE is not allowed
wishlist
--------
* make code for stripping last part of a url (e.g. foo/index.html -> foo/)
* maybe set referer (configurable)
-* cookies support (maybe)
+* cookies support (maybe) (not difficult with urllib2)
* integration with weblint
* do form checking of crawled pages
* do spelling checking of crawled pages
-* test w3c conformance of pages (already done a little)
+* test w3c conformance of pages
* add support for fetching gzipped content to improve performance
* maybe do http pipelining
-* make error handling of HTMLParser more robust (maybe send a patch for html parser upstream)
-* maybe use this as a html parser: http://www.crummy.com/software/BeautifulSoup/examples.html
* maybe output a google sitemap file: http://www.google.com/webmasters/sitemaps/docs/en/protocol.html
* maybe trim titles that are too long
* maybe check that documents referenced in <img> tags are really images
@@ -44,7 +41,6 @@ wishlist
* maybe add custom bullets in problem lists, depending on problem type
* present age for times long ago in a friendlier format (.. days ago, .. months ago, .. years ago)
* maybe unescaped spaces aren't always a real problem (e.g. in mailto: urls)
-* give a warning when no encoding is specified, an error if non-ascii characters are used
* maybe give a warning for urls that have non-ascii characters
* maybe fetch and store description and other meta information about page (keywords) (just like author)
* connect to w3c-markup-validator and tidy (and possibly other tools)
@@ -57,10 +53,10 @@ wishlist
* webcheck does not give an error when accessing http://site:443/ ??
* improve data structures (e.g. see if pop() is faster than pop(0))
* do not use string for serializing child, embed, anchor and reqanchor as they are already url-encoded
-* automatically strip beginning and trailing spaces from links (but warn though)
-* try python-beautifulsoup
* there seem to be some issues with generating site maps for ftp directories
* document serialized file format in manual page (if it is stabilized)
* look into python-spf to see how DNS queries are done
-* try to use python-chardet in case of missing encoding
* implement an option to ignore problems on pages (but do consider internal, etc) (e.g. for generated or legacy html)
+* maybe use urllib2 instead of our own custom code (redirects may be a problem here though)
+* add support for robots meta tag: http://www.robotstxt.org/wc/meta-user.html
+* only report multiple definitions of a single anchor once
diff --git a/config.py b/config.py
index 1c379e7..bf93113 100644
--- a/config.py
+++ b/config.py
@@ -30,7 +30,7 @@ items should be changeble from the command line."""
import urllib
# Current version of webcheck.
-VERSION = '1.9.8'
+VERSION = '1.10.0'
# The homepage of webcheck.
HOMEPAGE = 'http://ch.tudelft.nl/~arthur/webcheck/'
diff --git a/debian/changelog b/debian/changelog
index 95a7761..d3267f3 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,3 +1,18 @@
+webcheck (1.10.0) unstable; urgency=low
+
+ * switched HTML parsing to using BeautifulSoup with a fall-back mechanism to
+ the old HTMLParser based solution
+ * the new parser is much more error-tolerant but is reportedly somewhat
+ slower and does not include line numbers in errors
+ * new features will likely only be added to the new parser
+ * some small improvements to the output to make it XHTML 1.1 compliant
+ * internal improvements for handling Unicode strings
+ * better support for parsing <applet> tags and anchors using id attributes
+ * re-enable robots.txt parsing that was disabled in 1.9.8 and add an
+ --ignore-robots option
+
+ -- Arthur de Jong <adejong@debian.org> Wed, 09 May 2007 21:48:11 +0200
+
webcheck (1.9.8) unstable; urgency=low
* some checks for properly handling unknown and wrong encodings have been
diff --git a/debian/copyright b/debian/copyright
index e82120b..2e88bd8 100644
--- a/debian/copyright
+++ b/debian/copyright
@@ -20,7 +20,7 @@ LeJacq <jplejacq@quoininc.com>.
Copyright (C) 1998, 1999 Albert Hopkins (marduk)
Copyright (C) 2002 Mike W. Meyer
-Copyright (C) 2005, 2006 Arthur de Jong
+Copyright (C) 2005, 2006, 2007 Arthur de Jong
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
diff --git a/webcheck.1 b/webcheck.1
index bfcadf0..36d63e1 100644
--- a/webcheck.1
+++ b/webcheck.1
@@ -15,7 +15,7 @@
.\" Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
.\" .nh
.\"
-.TH "webcheck" "1" "Jan 2007" "Version 1.9.8" "User Commands"
+.TH "webcheck" "1" "May 2007" "Version 1.10.0" "User Commands"
.nh
.SH "NAME"
webcheck \- website link checker
@@ -208,7 +208,7 @@ Copyright \(co 1998, 1999 Albert Hopkins (marduk)
.br
Copyright \(co 2002 Mike W. Meyer
.br
-Copyright \(co 2005, 2006 Arthur de Jong
+Copyright \(co 2005, 2006, 2007 Arthur de Jong
.br
webcheck is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.