webcheck - a website link checker webcheck was orriginally called linbot which was developed by Albert Hopkins (marduk) . Versions up till 1.0 were maintained by Mike W. Meyer who changed the name to webcheck, http://www.mired.org/webcheck/. After that Arthur de Jong took up the work and did a complete rewrite, http://tiefighter.et.tudelft.nl/~arthur/webcheck/. Copyright (C) 1998, 1999 Albert Hopkins (marduk) Copyright (C) 2002 Mike W. Meyer Copyright (C) 2005 Arthur de Jong This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA INTRODUCTION ============ webcheck is a website checking tool for webmasters. It crawles a given website and generates a number of reports. The whole system is pluggable allowing easily adding extra reports and checks. Features of webcheck include: * view the structure of a site * track down broken links * find potentially outdated web pages * list links pointing to external sites * view portfolio of inline images * do all this periodically and without user intervention INSTALLING WEBCHECK =================== Installation is relatively easy. Note these installation instructions are for Unix-like systems. Other operating systems may differ. 1. Unpack the gzipped tarchive. Be sure to add the directory to your PYTHONPATH environment variable. % tar zxvf webcheck-1.0b6.tar.gz -C /usr/local/lib % PYTHONPATH="/usr/local/lib/webcheck:$PYTHONPATH" % export PYTHONPATH 2. Add a symbolic link to some place in your PATH % ln -s /usr/local/lib/webcheck/webcheck.py /usr/local/bin/webcheck 3. Edit the config.py file to your choosing. Most of the defaults are safe. The important ones can be overridden with command-line flags. You may want to keep a copy of the original config.py file just in case. The config.py options are documented within the file. RUNNING WEBCHECK ================ Executing webcheck without any command-line arguments will cause it to give a simple synopsis of its usage and then quit. % webcheck webcheck [-abvq][-l url][-x url]... [-y url]... [-r depth][-o dir][-w sec][-d level] url [location]... Before running webcheck on a site, you should need to do a little preparation. One think that webcheck needs is a directory in which to publish its reports. It is recommended that you choose a directory which is empty and will only contain webcheck reports. This directory must exist and be writeable by the user running webcheck before webcheck is run. % mkdir /usr/local/apache/share/htdocs/webcheck The report can be viewed using most web browsers. Browsers using frames can initially open the "index.html" file. Browsers not supporting frames or users who do not like frames can initially open the "navbar.html" file. Note these are default filenames for webcheck and may be changed via the config file. It should be decided beforehand which documents on your site should be considered "internal" and which should be considered "external". webcheck defines internal and external documents as such: An internal document is a part of your site that you have control of and checked, as well as the links that it points to. Basically an internal document is one that, if broken, you have the power to fix. An external document is one that an internal document points to but you have no jurisdiction over. It can also be a document that you have the power to change, but need not be checked, such as documents pointed to by CGI scripts or other automated tools such as webcheck. Your base url is the url pointing to the document that is the top level of your site. Commonly referred to as the "home page", it is the url that points to all other urls, either directly or indirectly. The base url can be on one web server but point to documents on another server that hosts other internal documents. An example would be a main server www.someplaceonthenet.com in which there may be links to an alternate server called www2.someplaceonthenet.com. In this case www2.someplaceonthenet.com would host internal documents even though your "home page" is on www.someplaceonthenet.com. That said, you should have a basic idea of what you do and do not want webcheck to check. Don't be surprised if you do not get it exactly right the first time. Okay you have heard enough and you just want to run the darn thing. The simplest way to run webcheck is: % webcheck http://www.someplaceonthenet.com/ This will first read the robots.txt file at www.someplaceonthenet.com and then proceed to examine every link pointed to on that site except documents denied by robots.txt if that file exists. ------------------------------------------------------------------------ Running Periodically webcheck may be safely run periodically or on off-peak hours using on or at. It may be safely run unattended. You may want to redirect webcheck's output to the null device, log file, or have it emailed to an account. Consult your operating system manuals for how this can be done on your system. ------------------------------------------------------------------------ Feedback If you have any questions about webcheck or would like to report a bug, it helps a lot to include a url where the problem can be found, an HTML file where the error occurs or a (small) tar of the site where the error occurs. Suggestions for improvements are also welcomed. Patches and code contributions are even better.