Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
path: root/webcheck.1
blob: 1e8eda8d095644e9fc256eab04c595f67ad4bf5b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
.\" Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2013 Arthur de Jong
.\"
.\" This program is free software; you can redistribute it and/or modify
.\" it under the terms of the GNU General Public License as published by
.\" the Free Software Foundation; either version 2 of the License, or
.\" (at your option) any later version.
.\"
.\" This program is distributed in the hope that it will be useful,
.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
.\" GNU General Public License for more details.
.\"
.\" You should have received a copy of the GNU General Public License
.\" along with this program; if not, write to the Free Software
.\" Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
.\" .nh
.\"
.TH "webcheck" "1" "Sep 2010" "Version 1.10.4" "User Commands"
.nh
.SH "NAME"
webcheck \- website link checker

.SH "SYNOPSIS"
.B webcheck
.RI [ OPTION ]...
.I URL

.SH "DESCRIPTION"
\fBwebcheck\fP will check the document at the specified URL for links to other
documents, follow these links recursively and generate an HTML report.

.TP
.BI "\-i,  \-\-internal=" "PATTERN"
Mark URLs matching the
.I PATTERN
(perl\-type regular expression) as an internal link.
Can be used multiple times.
Note that the PATTERN is matched against the full URL.
URLs matching this PATTERN will be considered internal, even if they match one
of the \-\-external PATTERNs.

.TP
.BI "\-x,  \-\-external=" "PATTERN"
Mark URLs matching the
.I PATTERN
(perl\-type regular expression) as an external link.
Can be used multiple times.
Note that the PATTERN is matched against the full URL.

.TP
.BI "\-y, \-\-yank=" "PATTERN"
Do not check URLs matching the
.I PATTERN
(perl\-type regular expression).
Like the \-x flag, though this option will cause webcheck to not
check the link matched by regex whereas \-x will check the link but
not its children.
Can be used multiple times.
Note that the PATTERN is matched against the full URL.

.TP
.B \-b, \-\-base\-only
Consider any URL not starting with the base URL to be external.
For example, if you run
.ft B
    webcheck \-b http://www.example.com/foo
.ft R
.br
then http://www.example.com/foo/bar will be
considered internal whereas http://www.example.com/ will
be considered external.
By default all the pages on the site will be considered internal.

.TP
.B \-a, \-\-avoid\-external
Avoid external links.
Normally if webcheck is examining an HTML page
and it finds a link that points to an external document, it will
check to see if that external document exists.
This flag disables that action.

.TP
.B \-\-ignore\-robots
Do not retrieve and parse robots.txt files.
By default robots.txt files are retrieved and honored.
If you are sure you want to ignore and override the webmaster's
decision this option can be used.
.br
For more information on robots.txt handling see the NOTES section below.

.TP
.B \-q, \-\-quiet, \-\-silent
Do not print out progress as webcheck traverses a site.

.TP
.B \-d, \-\-debug
Print debugging information while crawling the site.
This option is mainly useful for developers.

.TP
.BI "\-o, \-\-output=" "DIRECTORY"
Output directory. Use to specify the directory where webcheck will
dump its reports. The default is the current directory or as
specified by config.py. If this directory does not exist it will
be created for you (if possible).

.TP
.BI "\-c, \-\-continue"
Try to continue from a previous run. When using this option webcheck
will look for a webcheck.dat in the output directory.
This file is read to restore the state from the previous run.
This allows webcheck to continue a previously interrupted run.
When this option is used, the \-\-internal, \-\-external and \-\-yank
options will be ignored as well as any URL arguments.
The \-\-base\-only and \-\-avoid\-external options should be the same
as the previous run.
.br
Note that this option is experimental and it's semantics may change
with coming releases (especially in relation to other options).
Also note that the stored files are not guaranteed to be compatible
between releases.

.TP
.B \-f, \-\-force
Overwrite files without asking.
This option is required for running webcheck non-interactively.

.TP
.BI "\-r, \-\-redirects=" "N"
Redirect depth. the number of redirects webcheck should follow when
following a link. 0 implies to follow all redirects.

.TP
.BI "\-l, \-\-max-depth=, \-\-levels" "N"
Recursion depth. The number of links to follow from the base URLs.
By default links are infinitely followed.

.TP
.BI "\-w, \-\-wait=" "SECONDS"
Wait
.I SECONDS
between document retrievals. Usually webcheck will process a url
and immediately move on to the next. However on some loaded
systems it may be desirable to have webcheck pause between requests.
This option can be set to any non-negative number.

.TP
.B \-v, \-\-version
Show version of program.

.TP
.B \-h, \-\-help
Show short summary of options.

.SH "URL CLASSES"

URLs are divided into two classes:

.B Internal
URLs are retrieved and the retrieved content is checked for problems.
Also, the retrieved item is searched for links
and these links are followed.

.B External
URLs are only retrieved to test whether they are valid and to gather some
basic information from them (title, size, content-type, etc).
The retrieved items are not inspected for links to other items.

Apart from their class, URLs can also be considered
.B yanked
(as specified with the \-\-yank or \-\-avoid\-external options).
The URLs can be either internal or external and will not be retrieved or
checked at all.
URLs of unsupported schemes are also considered yanked.

.SH "EXAMPLES"

Check the site www.example.com but consider any path with "/webcheck" in it
to be external.
.ft B
    webcheck http://www.example.com/ \-x /webcheck
.ft R

.SH "NOTES"

When checking internal URLs webcheck honors the robots.txt file, identifying
itself as user-agent webcheck. Disallowed links will not be checked at all as
if the \-y option was specified for that URL. To allow webcheck to crawl parts
of a site that other robots are disallowed, use something like:
.ft B
    User-agent: *
    Disallow: /foo

    User-agent: webcheck
    Allow: /foo
.ft R

.SH "ENVIRONMENT"

.TP
.BI <scheme>_proxy
Proxy url for <scheme>.

.SH "REPORTING BUGS"

Bug reports shoult be sent to the mailing list <webcheck-users@lists.arthurdejong.org>.
More information on reporting bugs can be found on the webcheck homepage:
.br
http://arthurdejong.org/webcheck/

.SH "COPYRIGHT"
Copyright \(co 1998, 1999 Albert Hopkins (marduk)
.br
Copyright \(co 2002 Mike W. Meyer
.br
Copyright \(co 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2013 Arthur de Jong
.br
webcheck is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
.br
The files produced as output from the software do not automatically fall
under the copyright of the software, unless explicitly stated otherwise.