Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
path: root/TODO
blob: 246add1a5e2173e80eed06d5fd2fed4a664acfa5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
before next release
-------------------
* go over all FIXMEs in code (ftp)
* follow redirects (to a limit) of external sites

probably before 2.0 release
---------------------------
* support for mult-threading (use -t, --threads as option)
* find a fix for redirecting stdout and stderr to work properly
* implement a maximum transfer size for downloading
* support ftp proxies
* support proxying https traffic
* give problems different levels (info, warning, error)
* option to only force overwrite generated files and leave static files (css, js) alone
* implement a --html-only option to not copy css and other files
* do not overwrite (maybe) webcheck.css if it is already there
* check for missing encoding (report problem)
* implement parsing of meta http-equiv="refresh" content="0;url=CHILD">
* in --help output: show default number of redirects to follow

wishlist
--------
* make code for stripping last part of a url (e.g. foo/index.html -> foo/)
* maybe set referer (configurable)
* cookies support (maybe)
* integration with weblint
* do form checking of crawled pages
* do spelling checking of crawled pages
* test w3c conformance of pages (already done a little)
* add support for fetching gzipped content to improve performance
* maybe do http pipelining
* make error handling of HTMLParser more robust (maybe send a patch for html parser upstream)
* maybe use this as a html parser: http://www.crummy.com/software/BeautifulSoup/examples.html
* maybe output a google sitemap file: http://www.google.com/webmasters/sitemaps/docs/en/protocol.html
* maybe trim titles that are too long
* maybe check that documents referenced in <img> tags are really images
* maybe split out plugins in check() and generate() functions
* make FAQ
* use gettext to present output to enable translations of messages and html
* maybe report on embedded content that is external
* present an overview of problem pages: "100 problems in 10 pages" (per author)
* check of email addresses that they are formatted properly and check that host part has an MX record (make it a problem for no record or only an A record)
* maybe implement news, nntp, gopher and telnet schemes (if there is anyone that wants them)
* maybe add custom bullets in problem lists, depending on problem type
* present age for times long ago in a friendlier format (.. days ago, .. months ago, .. years ago)
* maybe unescaped spaces aren't always a real problem (e.g. in mailto: urls)
* give a warning when no encoding is specified, an error if non-ascii characters are used
* maybe give a warning for urls that have non-ascii characters
* maybe fetch and store description and other meta information about page (keywords) (just like author)
* connect to w3c-markup-validator and tidy (and possibly other tools)
* find out why title does not show up correctly for file?:// urls if they contain non-ascii chars
* output scan took so long
* support unicode strings for all string values in link objects (url, status, mimetype, encoding, etc)
* maybe also serialize robotparsers
* maybe also add robots.txt to urllist if fetched successfully
* support CSS encoding: http://www.w3.org/International/questions/qa-css-charset
* webcheck does not give an error when accessing http://site:443/ ??
* improve data structures (e.g. see if pop() is faster than pop(0))
* do not use string for serializing child, embed, anchor and reqanchor as they are already url-encoded
* automatically strip beginning and trailing spaces from links (but warn though)
* try python-beautifulsoup
* there seem to be some issues with generating site maps for ftp directories
* document serialized file format in manual page (if it is stabilized)
* look into python-spf to see how DNS queries are done
* try to use python-chardet in case of missing encoding
* implement an option to ignore problems on pages (but do consider internal, etc) (e.g. for generated or legacy html)