TODO


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57

before next release
-------------------
* go over all FIXMEs in code (ftp)
* check that sleep acutally sleeps for the advertised time
* follow redirects (to a limit) of external sites
* check that scheme names are clean so that we do not import strange python modules

probably before 2.0 release
---------------------------
* make it possible to copy or reference webcheck.css
* make it possible to copy http:.../webcheck.css into place (maybe use scheme system, probably just urllib)
* make more things configurable
* maybe generate a list of page parents (this is useful to list proper parent links for problem pages)
* figure out if we need parents and pageparents
* make configurable time-out when retrieving a document
* support for mult-threading (use -t, --threads as option)
* implement a fix for redirecting stdout and stderr to work properly
* implement a maximum transfer size for downloading files and things over http
* support ftp proxies
* support proxying https traffic

wishlist
--------
* make code for stripping last part of a url (e.g. foo/index.html -> foo/)
* maybe set referer (configurable)
* new config file format (if we want a configfile at all)
* cookies support (maybe)
* integration with weblint
* do form checking of crawled pages
* do spelling checking of crawled pages
* test w3c conformance of pages (already done a little)
* maybe store crawled site's data in some format for later processing or continuing after interruption
* add support for fetching gzipped content to improve performance
* maybe do http pipelining
* add a favicon to report
* make error handling of HTMLParser more robust (maybe send a patch for html parser upstream)
* maybe use this as a html parser: http://www.crummy.com/software/BeautifulSoup/examples.html
* maybe have a way to output google sitemap files: http://www.google.com/webmasters/sitemaps/docs/en/protocol.html
* maybe trim titles that are too long
* maybe check that documents referenced in <img> tags are really images
* maybe split out plugins in check() and generate() functions
* make FAQ
* maybe report unknown/unsupported content in the report
* use gettext to present output to enable translations of messages and html
* maybe mark embedded content that is external
* present an overview of problem pages: "100 problems in 10 pages" (per author)
* check of email addresses that they are formatted properly and check that host part has an MX record (make it a problem for no record or only an A record)
* output a csv file with some useful information
* maybe implement news, nntp, gopher and telnet schemes (if there is anyone that wants them)
* maybe add custom bullets in problem lists, depending on problem type
* maybe make -b the default
* prompt for authentication (detecting realms)
* present age for times long ago in a friendlier format (.. days ago, .. months ago, .. years ago)
* maybe unescaped spaces aren't always a real problem (e.g. in mailto: urls)
* give a warning when no encoding is specified, an error if non-ascii characters are used
* maybe give a warning for urls that have non-ascii characters
* maybe fetch and store desription and other meta information about page (keywords) (just like author)