before next release ------------------- * go over all FIXMEs in code (ftp) * check that sleep acutally sleeps for the advertised time * follow redirects (to a limit) of external sites * check that scheme names are clean so that we do not import strange python modules probably before 2.0 release --------------------------- * make it possible to copy or reference webcheck.css * make it possible to copy http:.../webcheck.css into place (maybe use scheme system, probably just urllib) * make more things configurable * maybe generate a list of page parents (this is useful to list proper parent links for problem pages) * figure out if we need parents and pageparents * make configurable time-out when retrieving a document * support for mult-threading (use -t, --threads as option) * implement a fix for redirecting stdout and stderr to work properly * implement a maximum transfer size for downloading files and things over http * support ftp proxies * support proxying https traffic wishlist -------- * make code for stripping last part of a url (e.g. foo/index.html -> foo/) * maybe set referer (configurable) * new config file format (if we want a configfile at all) * cookies support (maybe) * integration with weblint * do form checking of crawled pages * do spelling checking of crawled pages * test w3c conformance of pages (already done a little) * maybe store crawled site's data in some format for later processing or continuing after interruption * add support for fetching gzipped content to improve performance * maybe do http pipelining * add a favicon to report * make error handling of HTMLParser more robust (maybe send a patch for html parser upstream) * maybe use this as a html parser: http://www.crummy.com/software/BeautifulSoup/examples.html * maybe have a way to output google sitemap files: http://www.google.com/webmasters/sitemaps/docs/en/protocol.html * maybe trim titles that are too long * maybe check that documents referenced in tags are really images * maybe split out plugins in check() and generate() functions * make FAQ * maybe report unknown/unsupported content in the report * use gettext to present output to enable translations of messages and html * maybe mark embedded content that is external * present an overview of problem pages: "100 problems in 10 pages" (per author) * check of email addresses that they are formatted properly and check that host part has an MX record (make it a problem for no record or only an A record) * output a csv file with some useful information * maybe implement news, nntp, gopher and telnet schemes (if there is anyone that wants them) * maybe add custom bullets in problem lists, depending on problem type * maybe make -b the default * prompt for authentication (detecting realms) * present age for times long ago in a friendlier format (.. days ago, .. months ago, .. years ago) * maybe unescaped spaces aren't always a real problem (e.g. in mailto: urls) * give a warning when no encoding is specified, an error if non-ascii characters are used * maybe give a warning for urls that have non-ascii characters * maybe fetch and store desription and other meta information about page (keywords) (just like author)