Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorArthur de Jong <arthur@arthurdejong.org>2006-07-02 23:42:49 +0200
committerArthur de Jong <arthur@arthurdejong.org>2006-07-02 23:42:49 +0200
commit6455a467047f57d15f74fd0f7985f559b8f32f57 (patch)
tree289074c3b49f6ab1e12ae41a657b3e1642847550
parent6d8981eaff6a92359282e7ddc6688c6716b888eb (diff)
get files ready for 1.9.7 release1.9.7
git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@295 86f53f14-5ff3-0310-afe5-9b438ce3f40c
-rw-r--r--ChangeLog376
-rw-r--r--NEWS17
-rw-r--r--TODO49
-rw-r--r--config.py2
-rw-r--r--debian/changelog22
-rw-r--r--webcheck.12
6 files changed, 442 insertions, 26 deletions
diff --git a/ChangeLog b/ChangeLog
index 851bbfb..a7a4452 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,361 @@
+2006-06-29 21:09 arthur
+
+ * [r294] webcheck.css: always keep navigation on top
+
+2006-06-24 15:35 arthur
+
+ * [r293] crawler.py, serialize.py: store internal, external and
+ yanked regular expressions in a map allowing them to be
+ serialized
+
+2006-06-23 21:01 arthur
+
+ * [r292] debian/control, debian/pycompat, debian/rules: switch to
+ using python-support and follow recent python policy
+
+2006-06-05 20:21 arthur
+
+ * [r291] debian/control: split Build-Depends-Indep into
+ Build-Depends and Build-Depends-Indep
+
+2006-06-05 20:19 arthur
+
+ * [r290] debian/rules: also install favicon.ico in deb package
+ (plus cosmetic fix)
+
+2006-06-04 21:28 arthur
+
+ * [r289] webcheck.1: fix typos and fix example explanation
+
+2006-06-04 21:09 arthur
+
+ * [r288] serialize.py: do not split list of strings on comma's
+ inside the quoted strings
+
+2006-06-04 20:41 arthur
+
+ * [r287] serialize.py: make DeSerializeException a class instead
+ of a function and add fixme
+
+2006-06-04 20:40 arthur
+
+ * [r286] config.py, webcheck.1, webcheck.py: add --continue option
+ to resume the crawling from the point where the previous crawl
+ stopped
+
+2006-06-02 11:47 arthur
+
+ * [r285] webcheck.py: handle break signals in all code
+
+2006-06-02 11:42 arthur
+
+ * [r284] webcheck.py: add code to serialize crawled data during
+ crawl and again after crawl
+
+2006-06-02 11:37 arthur
+
+ * [r283] serialize.py: raise a custom exception instead of IOError
+
+2006-05-31 20:24 arthur
+
+ * [r282] parsers/html.py: add TODOs
+
+2006-05-31 20:23 arthur
+
+ * [r281] debian/control: upgrade to standards-version 3.7.2 (no
+ changes needed)
+
+2006-05-31 20:22 arthur
+
+ * [r280] README: update feature list from deb package description
+
+2006-05-16 19:18 arthur
+
+ * [r279] crawler.py, serialize.py, webcheck.py: split
+ crawler.crawl() function into crawler.crawl() and
+ crawler.postprocess() functions
+
+2006-05-16 19:07 arthur
+
+ * [r278] crawler.py: also serialize remaining links after crawl
+
+2006-05-16 19:05 arthur
+
+ * [r277] crawler.py: remove anchor debugging statements
+
+2006-05-16 18:23 arthur
+
+ * [r276] serialize.py: flag deserialized links as changed so they
+ will be reserialized again
+
+2006-05-16 18:21 arthur
+
+ * [r275] plugins/size.py: fix sorting
+
+2006-05-16 18:19 arthur
+
+ * [r274] plugins/about.py, webcheck.css: update link to
+ fancytooltips
+
+2006-05-15 21:30 arthur
+
+ * [r273] plugins/__init__.py: add makebackup option to open_file()
+ so we can implement updating files (e.g. serialization files)
+
+2006-05-15 21:00 arthur
+
+ * [r272] crawler.py: fix some stupid typos
+
+2006-05-15 20:51 arthur
+
+ * [r271] crawler.py: add code to serialize links to a file while
+ crawling the site
+
+2006-05-15 20:50 arthur
+
+ * [r270] serialize.py: import crawler late as to simplify
+ dependencies
+
+2006-05-15 20:36 arthur
+
+ * [r269] serialize.py: fix typo in fixme
+
+2006-05-15 20:35 arthur
+
+ * [r268] crawler.py: add _ischanged attribute to link objects to
+ indicate change since the constructor (or serialization)
+
+2006-05-15 19:17 arthur
+
+ * [r267] serialize.py: only write serialized data if it is
+ diferent from the constructor's default value
+
+2006-05-15 19:15 arthur
+
+ * [r266] serialize.py: clear anchors, linkproblems and
+ pageproblems from to be deserialized links to avoid duplicates
+ as a link can be deserialized multiple times
+
+2006-05-15 19:13 arthur
+
+ * [r265] serialize.py: remove the call to crawl() from deserialize
+ as this could be a partial deserialize that needs more tweaking
+ to the site before the call to crawl()
+
+2006-05-15 17:26 arthur
+
+ * [r264] parsers/html.py: make decoding try/fallback code a lot
+ simpler and handle case where encoding is specified as empty
+ string
+
+2006-05-12 21:32 arthur
+
+ * [r263] parsers/html.py: improve warning text and add comment
+ concerning trying of encodings
+
+2006-05-12 21:23 arthur
+
+ * [r262] parsers/html.py: ignore unknown entities instead of
+ throwing an error
+
+2006-05-07 10:31 arthur
+
+ * [r261] favicon.ico, plugins/__init__.py, webcheck.py: include
+ favicon.ico file in generated report
+
+2006-05-07 10:31 arthur
+
+ * [r260] schemes/__init__.py: ensure that we are not importing
+ anything weird by using invalid scheme names
+
+2006-05-07 10:26 arthur
+
+ * [r259] webcheck.py: support floats as parameter for --wait
+
+2006-05-07 10:25 arthur
+
+ * [r258] webcheck.1: fix useage of dash
+
+2006-05-07 10:19 arthur
+
+ * [r257] serialize.py: add serialize module that alows serializing
+ and deserializing all crawler state (site and links) to and from
+ a file, this module is not called anywhere yet
+
+2006-05-07 09:56 arthur
+
+ * [r256] crawler.py: fix typo in docstring and add comment
+
+2006-05-07 09:36 arthur
+
+ * [r255] parsers/html.py, plugins/__init__.py: move html escaping
+ and unescaping functions to parsers.html
+
+2006-05-07 09:25 arthur
+
+ * [r254] parsers/html.py: use unichr() to generate unicode
+ characters, not chr()
+
+2006-05-07 09:20 arthur
+
+ * [r253] schemes/file.py: return None explicitly
+
+2006-05-07 09:12 arthur
+
+ * [r252] crawler.py, parsers/html.py, plugins/__init__.py,
+ schemes/file.py, schemes/ftp.py: some more small code
+ improvements thanks to pychecker
+
+2006-05-06 15:44 arthur
+
+ * [r251] parsers/html.py: implement checking for id and name tags
+ in anchors
+
+2006-05-06 15:28 arthur
+
+ * [r250] schemes/ftp.py, schemes/https.py, webcheck.css: bump
+ copyright notices
+
+2006-04-27 21:53 arthur
+
+ * [r249] crawler.py: also add all unfetched links from a site to
+ make this mehtod recallable
+
+2006-04-27 21:47 arthur
+
+ * [r248] crawler.py: make get_link() function a public class
+ function
+
+2006-04-27 21:43 arthur
+
+ * [r247] crawler.py: move url checking bit to right function and
+ improve anchor debugging messages even further
+
+2006-04-27 21:39 arthur
+
+ * [r246] plugins/__init__.py: fix remaining references to escape
+ instead of htmlescape
+
+2006-04-27 21:32 arthur
+
+ * [r245] crawler.py: support passing a url to add_reqanchor() plus
+ some minor comments changes
+
+2006-04-27 21:25 arthur
+
+ * [r244] webcheck.py: handle problems in regular expressions
+ passed on the command line a little more gracefully
+
+2006-04-23 14:52 arthur
+
+ * [r243] plugins/__init__.py, plugins/about.py,
+ plugins/badlinks.py, plugins/problems.py: rename escape()
+ function to htmlescape() to make it a little clearer what we're
+ escaping
+
+2006-04-23 11:31 arthur
+
+ * [r242] TODO, config.py, crawler.py, debugio.py,
+ parsers/__init__.py, parsers/css.py, parsers/html.py,
+ plugins/__init__.py, plugins/about.py, plugins/anchors.py,
+ plugins/badlinks.py, plugins/external.py, plugins/images.py,
+ plugins/new.py, plugins/notchkd.py, plugins/notitles.py,
+ plugins/old.py, plugins/problems.py, plugins/sitemap.py,
+ plugins/size.py, plugins/urllist.py, schemes/__init__.py,
+ schemes/file.py, schemes/ftp.py, schemes/http.py,
+ schemes/https.py, webcheck.py: code improvements thanks to pylint
+
+2006-04-23 11:26 arthur
+
+ * [r241] plugins/__init__.py: also sort parent list by url if
+ titles are the same
+
+2006-04-23 11:25 arthur
+
+ * [r240] schemes/http.py: also properly handle timout problems
+ which only pass one parameter with the exception
+
+2006-04-11 21:40 arthur
+
+ * [r239] config.py, schemes/http.py: implement a timeout setting
+ with a default of 10 seconds
+
+2006-04-11 21:35 arthur
+
+ * [r238] webcheck.css: revert to borderless links as they look
+ ugly in some (most) cases
+
+2006-04-11 21:06 arthur
+
+ * [r237] config.py, plugins/size.py, plugins/slow.py: rename slow
+ plugin to size
+
+2006-04-07 17:58 arthur
+
+ * [r236] parsers/html.py: do not fail on unknown encodings (fall
+ back to system encoding) and add some TODO's to do extra
+ encoding checking
+
+2006-03-26 19:05 arthur
+
+ * [r235] crawler.py, parsers/html.py: split urlescape() from
+ _urlclean() and ensure that all anchors are consitently
+ url-encoded
+
+2006-03-26 19:01 arthur
+
+ * [r234] plugins/anchors.py: only report missing anchors for pages
+ that were fetched and some cleanups
+
+2006-03-26 18:58 arthur
+
+ * [r233] webcheck.css: put a boder around links
+
+2006-03-26 16:47 arthur
+
+ * [r232] plugins/badlinks.py, plugins/external.py,
+ plugins/images.py, plugins/new.py, plugins/notchkd.py,
+ plugins/notitles.py, plugins/old.py, plugins/problems.py,
+ plugins/slow.py: properly close html files on no output
+
+2006-03-10 23:02 arthur
+
+ * [r231] parsers/html.py: revert catching Exception instead of
+ IOError that was there for testing
+
+2006-03-10 22:58 arthur
+
+ * [r230] config.py, crawler.py, parsers/html.py,
+ plugins/anchors.py: implement checking of anchors (there should
+ be no double anchors and all referenced anchors should exist)
+
+2006-03-10 22:56 arthur
+
+ * [r229] plugins/__init__.py: do not include navigation for
+ plugins that do not generate output
+
+2006-03-10 22:48 arthur
+
+ * [r228] parsers/html.py, plugins/notitles.py: trim spaces from
+ title and author fields and check that title is not empty string
+ (apart from undefined)
+
+2006-03-09 22:10 arthur
+
+ * [r227] plugins/__init__.py, plugins/about.py,
+ plugins/badlinks.py, plugins/external.py, plugins/images.py,
+ plugins/new.py, plugins/notchkd.py, plugins/notitles.py,
+ plugins/old.py, plugins/problems.py, plugins/sitemap.py,
+ plugins/slow.py, plugins/urllist.py, webcheck.py: restructure
+ plugin code to open output files from within plugin itself to be
+ able to write different kinds of files
+
+2006-01-30 16:27 arthur
+
+ * [r225] ChangeLog, NEWS, README, TODO, config.py,
+ debian/changelog, debian/copyright, webcheck.1, webcheck.py: get
+ files ready for 1.9.6 release
+
2006-01-29 22:39 arthur
* [r224] crawler.py: bugfix in matching url encoding
@@ -159,6 +517,9 @@
* [r194] plugins/about.py: include reference to FancyTooltips in
about screen
+
+2005-12-27 20:26 arthur
+
* [r193] README: s/contains/includes/ FancyTooltips
2005-12-26 08:47 arthur
@@ -581,6 +942,9 @@
2005-07-30 14:04 arthur
* [r109] config.py: remove support for extra configurable headers
+
+2005-07-30 14:04 arthur
+
* [r108] schemes/http.py: reimplement http module to be a little
more generic and clean and handle errors cleaner and more
consistently
@@ -594,6 +958,9 @@
* [r106] crawler.py: also ignore io errors when retrieving
robots.txt files
+
+2005-07-30 13:59 arthur
+
* [r105] crawler.py: make a _urlclean() function to always store a
proper url without a fragment and with at least a slash for urls
with path elements
@@ -697,6 +1064,9 @@
2005-07-24 08:52 arthur
* [r84] schemes/http.py: handle socket errors properly
+
+2005-07-24 08:52 arthur
+
* [r83] schemes/http.py: fix for incomplete change in r76, now
version should not be referenced any more
@@ -713,6 +1083,9 @@
* [r81] plugins/badlinks.py: remove HTTP status code handling from
here as this should be done by the HTTP module
+
+2005-07-24 08:47 arthur
+
* [r80] plugins/whatsnew.py, plugins/whatsold.py: only report on
internal links
@@ -1051,6 +1424,9 @@
2005-04-13 19:41 arthur
* [r25] contrib/plugins/about.py: general cleanup
+
+2005-04-13 19:41 arthur
+
* [r24] plugins/sitemap.py: rework recursion to make it simpler
plus some general cleanups
diff --git a/NEWS b/NEWS
index a45f688..062b2e7 100644
--- a/NEWS
+++ b/NEWS
@@ -1,9 +1,24 @@
+changes from 1.9.6 to 1.9.7
+---------------------------
+
+* site data is now stored to a file while crawling the site, this can be used
+ to resume a crawl with the --continue option and for debugging purposes
+* implemented checking of link anchors
+* small improvements to generated reports (favicon included, css fix)
+* documentation improvements
+* properly handle float values for --wait
+* unreachable sites will time out faster
+* added support for plugins that don't output html
+* half a dozen other small bugfixes (stability fixes, code cleanups and
+ improvements)
+
+
changes from 1.9.5 to 1.9.6
---------------------------
* SECURITY FIX: a cross-site scripting vulnerability with content in the
tooltips of generated report was fixed by properly escaping
- all output
+ all output (CVE-2006-1321)
* urls are now url encoded into a consistent form, solving some problems with
urls with non-ascii characters
* no longer remove unreferenced redirects
diff --git a/TODO b/TODO
index 7bf5e7a..e341686 100644
--- a/TODO
+++ b/TODO
@@ -1,58 +1,65 @@
before next release
-------------------
* go over all FIXMEs in code (ftp)
-* check that sleep acutally sleeps for the advertised time
* follow redirects (to a limit) of external sites
-* check that scheme names are clean so that we do not import strange python modules
probably before 2.0 release
---------------------------
-* make it possible to copy or reference webcheck.css
-* make it possible to copy http:.../webcheck.css into place (maybe use scheme system, probably just urllib)
-* make more things configurable
-* maybe generate a list of page parents (this is useful to list proper parent links for problem pages)
-* figure out if we need parents and pageparents
-* make configurable time-out when retrieving a document
* support for mult-threading (use -t, --threads as option)
-* implement a fix for redirecting stdout and stderr to work properly
-* implement a maximum transfer size for downloading files and things over http
+* find a fix for redirecting stdout and stderr to work properly
+* implement a maximum transfer size for downloading
* support ftp proxies
* support proxying https traffic
+* give problems different levels (info, warning, error)
+* option to only force overwrite generated files and leave static files (css, js) alone
+* implement a --html-only option to not copy css and other files
+* do not overwrite (maybe) webcheck.css if it is already there
+* check for missing encoding (report problem)
+* implement parsing of meta http-equiv="refresh" content="0;url=CHILD">
+* in --help output: show default number of redirects to follow
wishlist
--------
* make code for stripping last part of a url (e.g. foo/index.html -> foo/)
* maybe set referer (configurable)
-* new config file format (if we want a configfile at all)
* cookies support (maybe)
* integration with weblint
* do form checking of crawled pages
* do spelling checking of crawled pages
* test w3c conformance of pages (already done a little)
-* maybe store crawled site's data in some format for later processing or continuing after interruption
* add support for fetching gzipped content to improve performance
* maybe do http pipelining
-* add a favicon to report
* make error handling of HTMLParser more robust (maybe send a patch for html parser upstream)
* maybe use this as a html parser: http://www.crummy.com/software/BeautifulSoup/examples.html
-* maybe have a way to output google sitemap files: http://www.google.com/webmasters/sitemaps/docs/en/protocol.html
+* maybe output a google sitemap file: http://www.google.com/webmasters/sitemaps/docs/en/protocol.html
* maybe trim titles that are too long
* maybe check that documents referenced in <img> tags are really images
* maybe split out plugins in check() and generate() functions
* make FAQ
-* maybe report unknown/unsupported content in the report
* use gettext to present output to enable translations of messages and html
-* maybe mark embedded content that is external
+* maybe report on embedded content that is external
* present an overview of problem pages: "100 problems in 10 pages" (per author)
* check of email addresses that they are formatted properly and check that host part has an MX record (make it a problem for no record or only an A record)
-* output a csv file with some useful information
* maybe implement news, nntp, gopher and telnet schemes (if there is anyone that wants them)
* maybe add custom bullets in problem lists, depending on problem type
-* maybe make -b the default
-* prompt for authentication (detecting realms)
* present age for times long ago in a friendlier format (.. days ago, .. months ago, .. years ago)
* maybe unescaped spaces aren't always a real problem (e.g. in mailto: urls)
* give a warning when no encoding is specified, an error if non-ascii characters are used
* maybe give a warning for urls that have non-ascii characters
-* maybe fetch and store desription and other meta information about page (keywords) (just like author)
-* make it possible to use multiple stylesheets (possibly reference external stylesheets)
+* maybe fetch and store description and other meta information about page (keywords) (just like author)
+* connect to w3c-markup-validator and tidy (and possibly other tools)
+* find out why title does not show up correctly for file?:// urls if they contain non-ascii chars
+* output scan took so long
+* support unicode strings for all string values in link objects (url, status, mimetype, encoding, etc)
+* maybe also serialize robotparsers
+* maybe also add robots.txt to urllist if fetched successfully
+* support CSS encoding: http://www.w3.org/International/questions/qa-css-charset
+* webcheck does not give an error when accessing http://site:443/ ??
+* improve data structures (e.g. see if pop() is faster than pop(0))
+* do not use string for serializing child, embed, anchor and reqanchor as they are already url-encoded
+* automatically strip beginning and trailing spaces from links (but warn though)
+* try python-beautifulsoup
+* there seem to be some issues with generating site maps for ftp directories
+* document serialized file format in manual page (if it is stabilized)
+* look into python-spf to see how DNS queries are done
+* try to use python-chardet in case of missing encoding
diff --git a/config.py b/config.py
index dfb456b..3ea325a 100644
--- a/config.py
+++ b/config.py
@@ -30,7 +30,7 @@ items should be changeble from the command line."""
import urllib
# Current version of webcheck.
-VERSION = '1.9.6'
+VERSION = '1.9.7'
# The homepage of webcheck.
HOMEPAGE = 'http://ch.tudelft.nl/~arthur/webcheck/'
diff --git a/debian/changelog b/debian/changelog
index 7db3ea8..4d309d9 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,10 +1,28 @@
+webcheck (1.9.7) unstable; urgency=low
+
+ * switch to using python-support and follow new policy (closes: #373902)
+ * upgrade to standards-version 3.7.2 (no changes needed)
+ * site data is now stored to a file while crawling the site, this can be
+ used to resume a crawl with the --continue option and for debugging
+ purposes
+ * implemented checking of link anchors
+ * small improvements to generated reports (favicon included, css fix)
+ * documentation improvements
+ * properly handle float values for --wait
+ * unreachable sites will time out faster
+ * added support for plugins that don't output html
+ * half a dozen other small bugfixes (stability fixes, code cleanups and
+ improvements)
+
+ -- Arthur de Jong <adejong@debian.org> Sun, 2 Jul 2006 23:30:00 +0200
+
webcheck (1.9.6) unstable; urgency=low
* SECURITY FIX: a cross-site scripting vulnerability with content in the
tooltips of generated report was fixed by properly escaping
- all output
+ all output (CVE-2006-1321)
* urls are now url encoded into a consistent form, solving some problems
- with urls with non-ascii characters
+ with urls with non-ascii characters (closes: #348377)
* no longer remove unreferenced redirects
* more debugging info in debug mode
* more fixes for escaping in generated reports and more support for sites
diff --git a/webcheck.1 b/webcheck.1
index e67e2bb..ffb4275 100644
--- a/webcheck.1
+++ b/webcheck.1
@@ -15,7 +15,7 @@
.\" Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
.\" .nh
.\"
-.TH "webcheck" "1" "Jan 2006" "Version 1.9.6" "User Commands"
+.TH "webcheck" "1" "Jul 2006" "Version 1.9.7" "User Commands"
.nh
.SH "NAME"
webcheck \- website link checker