TODO


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

before next release
-------------------
* go over all FIXMEs in code (ftp)
* follow redirects (to a limit) of external sites
* -U,  --user-agent=AGENT identify as AGENT instead of webcheck VERSION

probably before 3.0 release
---------------------------
* support for multi-threading (use -t, --threads as option)
* implement a maximum transfer size for downloading
* support ftp proxies
* support proxying https traffic
* option to only force overwrite generated files and leave static files (css, js) alone
* implement a --html-only option to not copy css and other files
* check for missing encoding (report problem)
* for FTP: don't fail if SIZE is not allowed
* record with which parameters webcheck was started

wishlist
--------
* make code for stripping last part of a url (e.g. foo/index.html -> foo/)
* integration with weblint
* do form checking of crawled pages
* do spelling checking of crawled pages
* add support for fetching gzipped content to improve performance
* maybe output a google sitemap file: http://www.google.com/webmasters/sitemaps/docs/en/protocol.html
* maybe trim titles that are too long
* maybe check that documents referenced in <img> tags are really images
* use gettext to present output to enable translations of messages and html
* maybe report on embedded content that is external
* present an overview of problem pages: "100 problems in 10 pages" (per author)
* check of email addresses that they are formatted properly and check that host part has an MX record (make it a problem for no record or only an A record)
* present age for times long ago in a friendlier format (.. days ago, .. months ago, .. years ago)
* maybe unescaped spaces aren't always a real problem (e.g. in mailto: urls)
* maybe give a warning for urls that have non-ascii characters
* maybe fetch and store description and other meta information about page (keywords) (just like author)
* output scan took so long
* maybe also add robots.txt to urllist if fetched successfully
* support CSS encoding: http://www.w3.org/International/questions/qa-css-charset
* webcheck does not give an error when accessing http://site:443/ ??
* look into python-spf to see how DNS queries are done
* implement an option to ignore problems on pages (but do consider internal, etc) (e.g. for generated or legacy html)
* add support for robots meta tag: http://www.robotstxt.org/wc/meta-user.html
* only report multiple definitions of a single anchor once
* warn if URL contains unencoded characters
* see section 6 of rfc3986.txt for URL comparison (esp. 6.2.2.)
* implement paging for huge reports
* output timing information on scan (e.g. scan took 30 minutes)