1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
|
webcheck - a website link checker
webcheck was orriginally called linbot which was developed by Albert
Hopkins (marduk) <marduk@python.net>.
Versions up till 1.0 were maintained by Mike W. Meyer <mwm@mired.org> who
changed the name to webcheck, http://www.mired.org/webcheck/.
After that Arthur de Jong <arthur@tiefighter.et.tudelft.nl>
took up the work and did a complete rewrite,
http://tiefighter.et.tudelft.nl/~arthur/webcheck/.
Copyright (C) 1998, 1999 Albert Hopkins (marduk) <marduk@python.net>
Copyright (C) 2002 Mike W. Meyer <mwm@mired.org>
Copyright (C) 2005 Arthur de Jong <arthur@tiefighter.et.tudelft.nl>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
INTRODUCTION
============
webcheck is a website checking tool for webmasters. It crawles a given website
and generates a number of reports. The whole system is pluggable allowing
easily adding extra reports and checks.
Features of webcheck include:
* view the structure of a site
* track down broken links
* find potentially outdated web pages
* list links pointing to external sites
* view portfolio of inline images
* do all this periodically and without user intervention
INSTALLING WEBCHECK
===================
Installation is relatively easy. Note these installation instructions are for
Unix-like systems. Other operating systems may differ.
1. Unpack the gzipped tarchive. Be sure to add the directory to your
PYTHONPATH environment variable.
% tar zxvf webcheck-1.0b6.tar.gz -C /usr/local/lib
% PYTHONPATH="/usr/local/lib/webcheck:$PYTHONPATH"
% export PYTHONPATH
2. Add a symbolic link to some place in your PATH
% ln -s /usr/local/lib/webcheck/webcheck.py /usr/local/bin/webcheck
3. Edit the config.py file to your choosing. Most of the defaults are safe.
The important ones can be overridden with command-line flags. You may
want to keep a copy of the original config.py file just in case. The
config.py options are documented within the file.
RUNNING WEBCHECK
================
Executing webcheck without any command-line arguments will cause it to give a
simple synopsis of its usage and then quit.
% webcheck
webcheck [-abvq][-l url][-x url]... [-y url]... [-r depth][-o dir][-w sec][-d level] url [location]...
Before running webcheck on a site, you should need to do a little preparation.
One think that webcheck needs is a directory in which to publish its reports.
It is recommended that you choose a directory which is empty and will only
contain webcheck reports. This directory must exist and be writeable by the
user running webcheck before webcheck is run.
% mkdir /usr/local/apache/share/htdocs/webcheck
The report can be viewed using most web browsers. Browsers using frames can
initially open the "index.html" file. Browsers not supporting frames or
users who do not like frames can initially open the "navbar.html" file. Note
these are default filenames for webcheck and may be changed via the config
file.
It should be decided beforehand which documents on your site should be
considered "internal" and which should be considered "external". webcheck
defines internal and external documents as such:
An internal document is a part of your site that you have control of and
checked, as well as the links that it points to. Basically an internal
document is one that, if broken, you have the power to fix.
An external document is one that an internal document points to but you have
no jurisdiction over. It can also be a document that you have the power to
change, but need not be checked, such as documents pointed to by CGI scripts
or other automated tools such as webcheck.
Your base url is the url pointing to the document that is the top level of
your site. Commonly referred to as the "home page", it is the url that
points to all other urls, either directly or indirectly. The base url can be
on one web server but point to documents on another server that hosts other
internal documents. An example would be a main server
www.someplaceonthenet.com in which there may be links to an alternate server
called www2.someplaceonthenet.com. In this case www2.someplaceonthenet.com
would host internal documents even though your "home page" is on
www.someplaceonthenet.com.
That said, you should have a basic idea of what you do and do not want
webcheck to check. Don't be surprised if you do not get it exactly right the
first time.
Okay you have heard enough and you just want to run the darn thing. The
simplest way to run webcheck is:
% webcheck http://www.someplaceonthenet.com/
This will first read the robots.txt file at www.someplaceonthenet.com and
then proceed to examine every link pointed to on that site except documents
denied by robots.txt if that file exists.
------------------------------------------------------------------------
Running Periodically
webcheck may be safely run periodically or on off-peak hours using on or at.
It may be safely run unattended. You may want to redirect webcheck's output to
the null device, log file, or have it emailed to an account. Consult your
operating system manuals for how this can be done on your system.
------------------------------------------------------------------------
Feedback
If you have any questions about webcheck or would like to report a
bug, it helps a lot to include a url where the problem can be found,
an HTML file where the error occurs or a (small) tar of the site where
the error occurs. Suggestions for improvements are also welcomed.
Patches and code contributions are even better.
|