Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
path: root/README
blob: bde64ecf4b93bfc25f5db6a7375b326b68d57e10 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
   webcheck - a website link checker

   webcheck was originally named linbot which was developed by Albert
   Hopkins (marduk) <marduk@python.net>.

   Versions up till 1.0 were maintained by Mike W. Meyer <mwm@mired.org> who
   changed the name to webcheck, http://www.mired.org/webcheck/.

   After that Arthur de Jong <arthur@ch.tudelft.nl> took up the work and did
   a complete rewrite, http://ch.tudelft.nl/~arthur/webcheck/.

   Copyright (C) 1998, 1999 Albert Hopkins (marduk)
   Copyright (C) 2002 Mike W. Meyer
   Copyright (C) 2005, 2006 Arthur de Jong

   This program is free software; you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation; either version 2 of the License, or
   (at your option) any later version.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.

   You should have received a copy of the GNU General Public License
   along with this program; if not, write to the Free Software
   Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA

   The files produced as output from the software do not automatically fall
   under the copyright of the software, unless explicitly stated otherwise.

   webcheck includes the FancyTooltips javascript library to display readable
   tooltips. FancyTooltips is distributed under the MIT license and has the
   following copyright notices (see fancytooltips/fancytooltips.js for
   details):

   Copyright (C) 2005 Victor Kulinski
   Copyright (C) 2003 Dunstan Orchard, Ethan Marcotte, Mark Wubben
   Copyright (C) 2003 Stuart Langridge, Paul McLanahan, Peter Janes,
                 Brad Choate

INTRODUCTION
============

webcheck is a website checking tool for webmasters. It crawls a given website
and generates a number of reports. The whole system is pluggable allowing
easily adding extra reports and checks.

Features of webcheck include:
 * view the structure of a site
 * track down broken links
 * find potentially outdated web pages
 * list links pointing to external sites
 * do all this periodically and without user intervention

webcheck is written in python and is developed on a Debian system with python
2.3. Previous versions of python are not tested but python 2.4 is tested
occasionally. Patches to support a wider range of python releases are welcome
(provided they are not too intrusive).

INSTALLING WEBCHECK
===================

Installation is relatively easy. These installation instructions are for
Unix-like systems. Other operating systems may differ.

  1. Unpack the tarball in the location that you want to have it installed.
     Maybe something like /usr/local/lib/python/site-packages or /opt.
     % tar -xvzf webcheck-1.9.?.tar.gz

  2. Add a symbolic link to some place in your PATH.
     % ln -s /opt/webcheck-1.9.?/webcheck.py /usr/local/bin/webcheck

  3. Put the manual page in the MANPATH.
     % ln -s /opt/webcheck-1.9.?/webcheck.1 /usr/local/man/man1/webcheck.1

RUNNING WEBCHECK
================

Executing webcheck without any command line arguments will cause it to give a
simple synopsis of its usage and then quit. Giving it the --help option will
cause it to print out all command line options.

Webcheck writes its reports to an output directory. This is the current
directory by default. Running webcheck as something like:

  % webcheck -o /tmp/myreport http://www.example.com/

Should crawl the site and write the reports to the /tmp/myreport directory.
The reports are simple HMTL pages that should look fine with most modern
browsers.

For more information on webcheck usage and command line options see the
webcheck manual page. If the manual page is not in the MANPATH you can
probably open the manual with something like:
  % man -l /opt/webcheck-1.9.?/webcheck.1

FEEDBACK AND BUG REPORTS
========================

If you have any questions about webcheck or would like to report a
bug, it helps a lot to include a url where the problem can be found,
an HTML file where the error occurs or a (small) tar of the site where
the error occurs. Suggestions for improvements are also welcomed.
Patches and code contributions are even better.

Please send all your reports to Arthur de Jong <arthur@ch.tudelft.nl>.

WEBCHECK DESIGN OVERVIEW
========================

Webcheck has grown and has been refactored over time so there is not really a
single design. The functions are grouped in modules according to their
function. This graphs should present a simple overview of the modules and
order of calling the functions.

webcheck.py                 - main program, command line parsing, etc
 \- config.py               - configuration settings (imported from most other
 |                            modules)
 \- debugio.py              - functions for printing output (imported from
 |                            most other modules)
 \- crawler.py              - module with loop and logic for traversing a
 |   |                        website and storing all the information about
 |   |                        the website that is used later
 |   \- schemes/__init__.py - front-end module to make available scheme
 |   |   |                    modules for fetching content
 |   |   \- schemes/*.py    - per scheme (ftp/file/http) a module
 |   \- parsers/__init.py   - front-end module to handle parsing of content
 |       \- parsers/*.py    - parser modules for content (html and dummy css
 |                            currently)
 \- plugins/__init__.py     - front-end module for plugin modules, this calls
     |                        all configured plugins and has some helper
     |                        functions for plugins
     \- plugins/*.py        - per report one plugin that does some specific
                              checking and outputs some html code