diff options
author | Arthur de Jong <arthur@arthurdejong.org> | 2006-07-02 23:42:49 +0200 |
---|---|---|
committer | Arthur de Jong <arthur@arthurdejong.org> | 2006-07-02 23:42:49 +0200 |
commit | 6455a467047f57d15f74fd0f7985f559b8f32f57 (patch) | |
tree | 289074c3b49f6ab1e12ae41a657b3e1642847550 | |
parent | 6d8981eaff6a92359282e7ddc6688c6716b888eb (diff) |
get files ready for 1.9.7 release1.9.7
git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@295 86f53f14-5ff3-0310-afe5-9b438ce3f40c
-rw-r--r-- | ChangeLog | 376 | ||||
-rw-r--r-- | NEWS | 17 | ||||
-rw-r--r-- | TODO | 49 | ||||
-rw-r--r-- | config.py | 2 | ||||
-rw-r--r-- | debian/changelog | 22 | ||||
-rw-r--r-- | webcheck.1 | 2 |
6 files changed, 442 insertions, 26 deletions
@@ -1,3 +1,361 @@ +2006-06-29 21:09 arthur + + * [r294] webcheck.css: always keep navigation on top + +2006-06-24 15:35 arthur + + * [r293] crawler.py, serialize.py: store internal, external and + yanked regular expressions in a map allowing them to be + serialized + +2006-06-23 21:01 arthur + + * [r292] debian/control, debian/pycompat, debian/rules: switch to + using python-support and follow recent python policy + +2006-06-05 20:21 arthur + + * [r291] debian/control: split Build-Depends-Indep into + Build-Depends and Build-Depends-Indep + +2006-06-05 20:19 arthur + + * [r290] debian/rules: also install favicon.ico in deb package + (plus cosmetic fix) + +2006-06-04 21:28 arthur + + * [r289] webcheck.1: fix typos and fix example explanation + +2006-06-04 21:09 arthur + + * [r288] serialize.py: do not split list of strings on comma's + inside the quoted strings + +2006-06-04 20:41 arthur + + * [r287] serialize.py: make DeSerializeException a class instead + of a function and add fixme + +2006-06-04 20:40 arthur + + * [r286] config.py, webcheck.1, webcheck.py: add --continue option + to resume the crawling from the point where the previous crawl + stopped + +2006-06-02 11:47 arthur + + * [r285] webcheck.py: handle break signals in all code + +2006-06-02 11:42 arthur + + * [r284] webcheck.py: add code to serialize crawled data during + crawl and again after crawl + +2006-06-02 11:37 arthur + + * [r283] serialize.py: raise a custom exception instead of IOError + +2006-05-31 20:24 arthur + + * [r282] parsers/html.py: add TODOs + +2006-05-31 20:23 arthur + + * [r281] debian/control: upgrade to standards-version 3.7.2 (no + changes needed) + +2006-05-31 20:22 arthur + + * [r280] README: update feature list from deb package description + +2006-05-16 19:18 arthur + + * [r279] crawler.py, serialize.py, webcheck.py: split + crawler.crawl() function into crawler.crawl() and + crawler.postprocess() functions + +2006-05-16 19:07 arthur + + * [r278] crawler.py: also serialize remaining links after crawl + +2006-05-16 19:05 arthur + + * [r277] crawler.py: remove anchor debugging statements + +2006-05-16 18:23 arthur + + * [r276] serialize.py: flag deserialized links as changed so they + will be reserialized again + +2006-05-16 18:21 arthur + + * [r275] plugins/size.py: fix sorting + +2006-05-16 18:19 arthur + + * [r274] plugins/about.py, webcheck.css: update link to + fancytooltips + +2006-05-15 21:30 arthur + + * [r273] plugins/__init__.py: add makebackup option to open_file() + so we can implement updating files (e.g. serialization files) + +2006-05-15 21:00 arthur + + * [r272] crawler.py: fix some stupid typos + +2006-05-15 20:51 arthur + + * [r271] crawler.py: add code to serialize links to a file while + crawling the site + +2006-05-15 20:50 arthur + + * [r270] serialize.py: import crawler late as to simplify + dependencies + +2006-05-15 20:36 arthur + + * [r269] serialize.py: fix typo in fixme + +2006-05-15 20:35 arthur + + * [r268] crawler.py: add _ischanged attribute to link objects to + indicate change since the constructor (or serialization) + +2006-05-15 19:17 arthur + + * [r267] serialize.py: only write serialized data if it is + diferent from the constructor's default value + +2006-05-15 19:15 arthur + + * [r266] serialize.py: clear anchors, linkproblems and + pageproblems from to be deserialized links to avoid duplicates + as a link can be deserialized multiple times + +2006-05-15 19:13 arthur + + * [r265] serialize.py: remove the call to crawl() from deserialize + as this could be a partial deserialize that needs more tweaking + to the site before the call to crawl() + +2006-05-15 17:26 arthur + + * [r264] parsers/html.py: make decoding try/fallback code a lot + simpler and handle case where encoding is specified as empty + string + +2006-05-12 21:32 arthur + + * [r263] parsers/html.py: improve warning text and add comment + concerning trying of encodings + +2006-05-12 21:23 arthur + + * [r262] parsers/html.py: ignore unknown entities instead of + throwing an error + +2006-05-07 10:31 arthur + + * [r261] favicon.ico, plugins/__init__.py, webcheck.py: include + favicon.ico file in generated report + +2006-05-07 10:31 arthur + + * [r260] schemes/__init__.py: ensure that we are not importing + anything weird by using invalid scheme names + +2006-05-07 10:26 arthur + + * [r259] webcheck.py: support floats as parameter for --wait + +2006-05-07 10:25 arthur + + * [r258] webcheck.1: fix useage of dash + +2006-05-07 10:19 arthur + + * [r257] serialize.py: add serialize module that alows serializing + and deserializing all crawler state (site and links) to and from + a file, this module is not called anywhere yet + +2006-05-07 09:56 arthur + + * [r256] crawler.py: fix typo in docstring and add comment + +2006-05-07 09:36 arthur + + * [r255] parsers/html.py, plugins/__init__.py: move html escaping + and unescaping functions to parsers.html + +2006-05-07 09:25 arthur + + * [r254] parsers/html.py: use unichr() to generate unicode + characters, not chr() + +2006-05-07 09:20 arthur + + * [r253] schemes/file.py: return None explicitly + +2006-05-07 09:12 arthur + + * [r252] crawler.py, parsers/html.py, plugins/__init__.py, + schemes/file.py, schemes/ftp.py: some more small code + improvements thanks to pychecker + +2006-05-06 15:44 arthur + + * [r251] parsers/html.py: implement checking for id and name tags + in anchors + +2006-05-06 15:28 arthur + + * [r250] schemes/ftp.py, schemes/https.py, webcheck.css: bump + copyright notices + +2006-04-27 21:53 arthur + + * [r249] crawler.py: also add all unfetched links from a site to + make this mehtod recallable + +2006-04-27 21:47 arthur + + * [r248] crawler.py: make get_link() function a public class + function + +2006-04-27 21:43 arthur + + * [r247] crawler.py: move url checking bit to right function and + improve anchor debugging messages even further + +2006-04-27 21:39 arthur + + * [r246] plugins/__init__.py: fix remaining references to escape + instead of htmlescape + +2006-04-27 21:32 arthur + + * [r245] crawler.py: support passing a url to add_reqanchor() plus + some minor comments changes + +2006-04-27 21:25 arthur + + * [r244] webcheck.py: handle problems in regular expressions + passed on the command line a little more gracefully + +2006-04-23 14:52 arthur + + * [r243] plugins/__init__.py, plugins/about.py, + plugins/badlinks.py, plugins/problems.py: rename escape() + function to htmlescape() to make it a little clearer what we're + escaping + +2006-04-23 11:31 arthur + + * [r242] TODO, config.py, crawler.py, debugio.py, + parsers/__init__.py, parsers/css.py, parsers/html.py, + plugins/__init__.py, plugins/about.py, plugins/anchors.py, + plugins/badlinks.py, plugins/external.py, plugins/images.py, + plugins/new.py, plugins/notchkd.py, plugins/notitles.py, + plugins/old.py, plugins/problems.py, plugins/sitemap.py, + plugins/size.py, plugins/urllist.py, schemes/__init__.py, + schemes/file.py, schemes/ftp.py, schemes/http.py, + schemes/https.py, webcheck.py: code improvements thanks to pylint + +2006-04-23 11:26 arthur + + * [r241] plugins/__init__.py: also sort parent list by url if + titles are the same + +2006-04-23 11:25 arthur + + * [r240] schemes/http.py: also properly handle timout problems + which only pass one parameter with the exception + +2006-04-11 21:40 arthur + + * [r239] config.py, schemes/http.py: implement a timeout setting + with a default of 10 seconds + +2006-04-11 21:35 arthur + + * [r238] webcheck.css: revert to borderless links as they look + ugly in some (most) cases + +2006-04-11 21:06 arthur + + * [r237] config.py, plugins/size.py, plugins/slow.py: rename slow + plugin to size + +2006-04-07 17:58 arthur + + * [r236] parsers/html.py: do not fail on unknown encodings (fall + back to system encoding) and add some TODO's to do extra + encoding checking + +2006-03-26 19:05 arthur + + * [r235] crawler.py, parsers/html.py: split urlescape() from + _urlclean() and ensure that all anchors are consitently + url-encoded + +2006-03-26 19:01 arthur + + * [r234] plugins/anchors.py: only report missing anchors for pages + that were fetched and some cleanups + +2006-03-26 18:58 arthur + + * [r233] webcheck.css: put a boder around links + +2006-03-26 16:47 arthur + + * [r232] plugins/badlinks.py, plugins/external.py, + plugins/images.py, plugins/new.py, plugins/notchkd.py, + plugins/notitles.py, plugins/old.py, plugins/problems.py, + plugins/slow.py: properly close html files on no output + +2006-03-10 23:02 arthur + + * [r231] parsers/html.py: revert catching Exception instead of + IOError that was there for testing + +2006-03-10 22:58 arthur + + * [r230] config.py, crawler.py, parsers/html.py, + plugins/anchors.py: implement checking of anchors (there should + be no double anchors and all referenced anchors should exist) + +2006-03-10 22:56 arthur + + * [r229] plugins/__init__.py: do not include navigation for + plugins that do not generate output + +2006-03-10 22:48 arthur + + * [r228] parsers/html.py, plugins/notitles.py: trim spaces from + title and author fields and check that title is not empty string + (apart from undefined) + +2006-03-09 22:10 arthur + + * [r227] plugins/__init__.py, plugins/about.py, + plugins/badlinks.py, plugins/external.py, plugins/images.py, + plugins/new.py, plugins/notchkd.py, plugins/notitles.py, + plugins/old.py, plugins/problems.py, plugins/sitemap.py, + plugins/slow.py, plugins/urllist.py, webcheck.py: restructure + plugin code to open output files from within plugin itself to be + able to write different kinds of files + +2006-01-30 16:27 arthur + + * [r225] ChangeLog, NEWS, README, TODO, config.py, + debian/changelog, debian/copyright, webcheck.1, webcheck.py: get + files ready for 1.9.6 release + 2006-01-29 22:39 arthur * [r224] crawler.py: bugfix in matching url encoding @@ -159,6 +517,9 @@ * [r194] plugins/about.py: include reference to FancyTooltips in about screen + +2005-12-27 20:26 arthur + * [r193] README: s/contains/includes/ FancyTooltips 2005-12-26 08:47 arthur @@ -581,6 +942,9 @@ 2005-07-30 14:04 arthur * [r109] config.py: remove support for extra configurable headers + +2005-07-30 14:04 arthur + * [r108] schemes/http.py: reimplement http module to be a little more generic and clean and handle errors cleaner and more consistently @@ -594,6 +958,9 @@ * [r106] crawler.py: also ignore io errors when retrieving robots.txt files + +2005-07-30 13:59 arthur + * [r105] crawler.py: make a _urlclean() function to always store a proper url without a fragment and with at least a slash for urls with path elements @@ -697,6 +1064,9 @@ 2005-07-24 08:52 arthur * [r84] schemes/http.py: handle socket errors properly + +2005-07-24 08:52 arthur + * [r83] schemes/http.py: fix for incomplete change in r76, now version should not be referenced any more @@ -713,6 +1083,9 @@ * [r81] plugins/badlinks.py: remove HTTP status code handling from here as this should be done by the HTTP module + +2005-07-24 08:47 arthur + * [r80] plugins/whatsnew.py, plugins/whatsold.py: only report on internal links @@ -1051,6 +1424,9 @@ 2005-04-13 19:41 arthur * [r25] contrib/plugins/about.py: general cleanup + +2005-04-13 19:41 arthur + * [r24] plugins/sitemap.py: rework recursion to make it simpler plus some general cleanups @@ -1,9 +1,24 @@ +changes from 1.9.6 to 1.9.7 +--------------------------- + +* site data is now stored to a file while crawling the site, this can be used + to resume a crawl with the --continue option and for debugging purposes +* implemented checking of link anchors +* small improvements to generated reports (favicon included, css fix) +* documentation improvements +* properly handle float values for --wait +* unreachable sites will time out faster +* added support for plugins that don't output html +* half a dozen other small bugfixes (stability fixes, code cleanups and + improvements) + + changes from 1.9.5 to 1.9.6 --------------------------- * SECURITY FIX: a cross-site scripting vulnerability with content in the tooltips of generated report was fixed by properly escaping - all output + all output (CVE-2006-1321) * urls are now url encoded into a consistent form, solving some problems with urls with non-ascii characters * no longer remove unreferenced redirects @@ -1,58 +1,65 @@ before next release ------------------- * go over all FIXMEs in code (ftp) -* check that sleep acutally sleeps for the advertised time * follow redirects (to a limit) of external sites -* check that scheme names are clean so that we do not import strange python modules probably before 2.0 release --------------------------- -* make it possible to copy or reference webcheck.css -* make it possible to copy http:.../webcheck.css into place (maybe use scheme system, probably just urllib) -* make more things configurable -* maybe generate a list of page parents (this is useful to list proper parent links for problem pages) -* figure out if we need parents and pageparents -* make configurable time-out when retrieving a document * support for mult-threading (use -t, --threads as option) -* implement a fix for redirecting stdout and stderr to work properly -* implement a maximum transfer size for downloading files and things over http +* find a fix for redirecting stdout and stderr to work properly +* implement a maximum transfer size for downloading * support ftp proxies * support proxying https traffic +* give problems different levels (info, warning, error) +* option to only force overwrite generated files and leave static files (css, js) alone +* implement a --html-only option to not copy css and other files +* do not overwrite (maybe) webcheck.css if it is already there +* check for missing encoding (report problem) +* implement parsing of meta http-equiv="refresh" content="0;url=CHILD"> +* in --help output: show default number of redirects to follow wishlist -------- * make code for stripping last part of a url (e.g. foo/index.html -> foo/) * maybe set referer (configurable) -* new config file format (if we want a configfile at all) * cookies support (maybe) * integration with weblint * do form checking of crawled pages * do spelling checking of crawled pages * test w3c conformance of pages (already done a little) -* maybe store crawled site's data in some format for later processing or continuing after interruption * add support for fetching gzipped content to improve performance * maybe do http pipelining -* add a favicon to report * make error handling of HTMLParser more robust (maybe send a patch for html parser upstream) * maybe use this as a html parser: http://www.crummy.com/software/BeautifulSoup/examples.html -* maybe have a way to output google sitemap files: http://www.google.com/webmasters/sitemaps/docs/en/protocol.html +* maybe output a google sitemap file: http://www.google.com/webmasters/sitemaps/docs/en/protocol.html * maybe trim titles that are too long * maybe check that documents referenced in <img> tags are really images * maybe split out plugins in check() and generate() functions * make FAQ -* maybe report unknown/unsupported content in the report * use gettext to present output to enable translations of messages and html -* maybe mark embedded content that is external +* maybe report on embedded content that is external * present an overview of problem pages: "100 problems in 10 pages" (per author) * check of email addresses that they are formatted properly and check that host part has an MX record (make it a problem for no record or only an A record) -* output a csv file with some useful information * maybe implement news, nntp, gopher and telnet schemes (if there is anyone that wants them) * maybe add custom bullets in problem lists, depending on problem type -* maybe make -b the default -* prompt for authentication (detecting realms) * present age for times long ago in a friendlier format (.. days ago, .. months ago, .. years ago) * maybe unescaped spaces aren't always a real problem (e.g. in mailto: urls) * give a warning when no encoding is specified, an error if non-ascii characters are used * maybe give a warning for urls that have non-ascii characters -* maybe fetch and store desription and other meta information about page (keywords) (just like author) -* make it possible to use multiple stylesheets (possibly reference external stylesheets) +* maybe fetch and store description and other meta information about page (keywords) (just like author) +* connect to w3c-markup-validator and tidy (and possibly other tools) +* find out why title does not show up correctly for file?:// urls if they contain non-ascii chars +* output scan took so long +* support unicode strings for all string values in link objects (url, status, mimetype, encoding, etc) +* maybe also serialize robotparsers +* maybe also add robots.txt to urllist if fetched successfully +* support CSS encoding: http://www.w3.org/International/questions/qa-css-charset +* webcheck does not give an error when accessing http://site:443/ ?? +* improve data structures (e.g. see if pop() is faster than pop(0)) +* do not use string for serializing child, embed, anchor and reqanchor as they are already url-encoded +* automatically strip beginning and trailing spaces from links (but warn though) +* try python-beautifulsoup +* there seem to be some issues with generating site maps for ftp directories +* document serialized file format in manual page (if it is stabilized) +* look into python-spf to see how DNS queries are done +* try to use python-chardet in case of missing encoding @@ -30,7 +30,7 @@ items should be changeble from the command line.""" import urllib # Current version of webcheck. -VERSION = '1.9.6' +VERSION = '1.9.7' # The homepage of webcheck. HOMEPAGE = 'http://ch.tudelft.nl/~arthur/webcheck/' diff --git a/debian/changelog b/debian/changelog index 7db3ea8..4d309d9 100644 --- a/debian/changelog +++ b/debian/changelog @@ -1,10 +1,28 @@ +webcheck (1.9.7) unstable; urgency=low + + * switch to using python-support and follow new policy (closes: #373902) + * upgrade to standards-version 3.7.2 (no changes needed) + * site data is now stored to a file while crawling the site, this can be + used to resume a crawl with the --continue option and for debugging + purposes + * implemented checking of link anchors + * small improvements to generated reports (favicon included, css fix) + * documentation improvements + * properly handle float values for --wait + * unreachable sites will time out faster + * added support for plugins that don't output html + * half a dozen other small bugfixes (stability fixes, code cleanups and + improvements) + + -- Arthur de Jong <adejong@debian.org> Sun, 2 Jul 2006 23:30:00 +0200 + webcheck (1.9.6) unstable; urgency=low * SECURITY FIX: a cross-site scripting vulnerability with content in the tooltips of generated report was fixed by properly escaping - all output + all output (CVE-2006-1321) * urls are now url encoded into a consistent form, solving some problems - with urls with non-ascii characters + with urls with non-ascii characters (closes: #348377) * no longer remove unreferenced redirects * more debugging info in debug mode * more fixes for escaping in generated reports and more support for sites @@ -15,7 +15,7 @@ .\" Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA .\" .nh .\" -.TH "webcheck" "1" "Jan 2006" "Version 1.9.6" "User Commands" +.TH "webcheck" "1" "Jul 2006" "Version 1.9.7" "User Commands" .nh .SH "NAME" webcheck \- website link checker |