1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
|
[Webcheck]
------------------------------------------------------------------------
Installing Webcheck
Installation is relatively easy. Note these installation instructions are
for Unix-like systems. Other operating systems may differ.
1. Unpack the gzipped tarchive. Be sure to add the directory to your
PYTHONPATH environment variable.
$ tar zxvf webcheck-1.0b6.tar.gz -C /usr/local/lib
$ PYTHONPATH="/usr/local/lib/webcheck:$PYTHONPATH"
$ export PYTHONPATH
2. Add a symbolic link to some place in your PATH
$ ln -s /usr/local/lib/webcheck/webcheck.py /usr/local/bin/webcheck
3. Edit the config.py file to your choosing. Most of the defaults are
safe. The important ones can be overridden with command-line flags. You
may want to keep a copy of the original config.py file just in case.
The config.py options are documented within the file.
------------------------------------------------------------------------
Running Webcheck
It is simple to run Webcheck.
Executing Webcheck without any command-line arguments will cause it to give a
simple synopsis of its usage and then quit.
$ webcheck
webcheck [-abvq][-l url][-x url]... [-y url]... [-r depth][-o dir][-w sec][-d level] url [location]...
Before running Webcheck on a site, you should need to do a little preparation.
One think that Webcheck needs is a directory in which to publish its reports.
It is recommended that you choose a directory which is empty and will only
contain webcheck reports. This directory must exist and be writeable by the
user running webcheck before webcheck is run.
$ mkdir /usr/local/apache/share/htdocs/webcheck
The report can be viewed using most web browsers. Browsers using frames can
initially open the "index.html" file. Browsers not supporting frames or
users who do not like frames can initially open the "navbar.html" file. Note
these are default filenames for Webcheck and may be changed via the config
file.
It should be decided beforehand which documents on your site should be
considered "internal" and which should be considered "external". Webcheck
defines internal and external documents as such:
An internal document is a part of your site that you have control of and
checked, as well as the links that it points to. Basically an internal
document is one that, if broken, you have the power to fix.
An external document is one that an internal document points to but you have
no jurisdiction over. It can also be a document that you have the power to
change, but need not be checked, such as documents pointed to by CGI scripts
or other automated tools such as Webcheck.
Your base url is the url pointing to the document that is the top level of
your site. Commonly referred to as the "home page", it is the url that
points to all other urls, either directly or indirectly. The base url can be
on one web server but point to documents on another server that hosts other
internal documents. An example would be a main server
www.someplaceonthenet.com in which there may be links to an alternate server
called www2.someplaceonthenet.com. In this case www2.someplaceonthenet.com
would host internal documents even though your "home page" is on
www.someplaceonthenet.com.
That said, you should have a basic idea of what you do and do not want
Webcheck to check. Don't be surprised if you do not get it exactly right the
first time. Also, consider using the robots.txt explained at
http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html.
Currently Webcheck identifies itself as User-Agent: Webcheck.
You can allow Webcheck to search a directory but restrict other bots, for
example, like this:
User-agent: *
Disallow: /
User-agent: Webcheck
Allow: /
Okay you have heard enough and you just want to run the darn thing. The
simplest way to run Webcheck is:
$ webcheck http://www.someplaceonthenet.com/
This will first read the robots.txt file at www.someplaceonthenet.com and
then proceed to examine every link pointed to on that site except documents
denied by robots.txt if that file exists.
The exact usage for webcheck is given below.
------------------------------------------------------------------------
Synopsis
webcheck webcheck [-abvq][-l url][-x url]... [-y url]... [-r depth][-o dir][-w
sec][-d level] url [location[:port]]...
-x regexUse this option to tell Webcheck to consider any url matching regex
to be external. Uses perl-type regular expressions. Can be used
multiple times.
-y regexLike the -x flag, though this option will cause Webcheck to not
check the link matched by regex whereas -x will check the link but
not its children. Uses perl-type regular expressions. Can be used
multiple times.
-l url Use url for the logo image on all reports. The url should point to
a valid image.
-b Base urls only. Tells Webcheck to consider any url that does not
start with the base url to be considered external. For example, if
you run webcheck -b
http://www.someplaceonthenet.com/~somebody/foo.html then
http://www.someplaceonthenet.com/~somebody/misc/index.html will be
considered internal whereas http://www.someplaceonthenet.com/ will
be considered external.
-a Avoid external links. Normally if Webcheck is examining an HTML page
and it finds a link that points to an external document, it will
check to see if that external document exists. This flag disables
that action. External links will not be checked.
-q Quiet. Do not print out the progress as Webcheck traverses a site
(equivalent to -d 0).
-o dir Output directory. Use to specify the directory where Webcheck will
dump its reports. The default is the current directory or as
specified by config.py. If this directory does not exist it will
be created for you (if possible).
-r depthRedirect depth. the amount of redirects Webcheck should follow when
following a link. 0 implies follow all redirects.
-w secs Wait secs between link checking. Usually Webcheck will process a url
and immediately move on to the next. However on some loaded
systems it may be desirable to have Webcheck pause between requests.
This option can be set to any non-negative number.
-d levelSet debug level to level. For programmer-level debugging use a
level > 1.
url The base url. Webcheck checks this link first, then all the links it
points to on down the "tree".
location This specifies the hosts pointed to that are to be considered
internal. By default Webcheck only considers URLs pointing to the
host of the base url to be internal. However if your site resides
on multiple servers use this parameter to tell Webcheck what other
servers should be considered internal. May be used multiple times,
but must follow url.
------------------------------------------------------------------------
Examples
Here are some examples of running Webcheck.
$ webcheck http://manson.ddns.org/ -x /webcheck starship.skyport.net
$ webcheck -o /stats/altavista/ http://altavista.digital.com/
$ webcheck -o ~/Lang/Python/webcheck -b -l http://manson.ddns.org/images/marduk.gif http://manson.ddns.org/~marduk/
------------------------------------------------------------------------
Running Periodically
Webcheck may be safely run periodically or on off-peak hours using on or at.
It may be safely run unattended. You may want to redirect Webcheck's output to
the null device, log file, or have it emailed to an account. Consult your
operating system manuals for how this can be done on your system.
------------------------------------------------------------------------
Feedback
If you have any questions about Webcheck or would like to report a
bug, it helps a lot to include a url where the problem can be found,
an HTML file where the error occurs or a (small) tar of the site where
the error occurs. Suggestions for improvements are also welcomed.
Patches and code contributions are even better.
|