Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorArthur de Jong <arthur@arthurdejong.org>2005-03-29 14:08:05 +0200
committerArthur de Jong <arthur@arthurdejong.org>2005-03-29 14:08:05 +0200
commit61846f4d01e6e9a15b78f4c82e80fa6e711c3cd8 (patch)
tree4248c91ac2c78ddb33a1cfe160c26234c3e70e7c
parent4be06ae90467f04335d031ec8e78525167941d45 (diff)
import of release 1.01.0
git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@2 86f53f14-5ff3-0310-afe5-9b438ce3f40c
-rw-r--r--BUGS16
-rw-r--r--CHANGES136
-rw-r--r--COPYING340
-rw-r--r--CREDITS16
-rw-r--r--HISTORY567
-rw-r--r--HISTORY.linbot262
-rw-r--r--INSTALL180
-rw-r--r--README41
-rw-r--r--TODO24
-rw-r--r--config.py157
-rw-r--r--contrib/plugins/about.py47
-rw-r--r--debugio.py34
-rw-r--r--htmlparse.py129
-rw-r--r--httpcodes.py58
-rw-r--r--myUrlLib.py303
-rw-r--r--plugins/__init__.py17
-rw-r--r--plugins/badlinks.py56
-rw-r--r--plugins/external.py40
-rw-r--r--plugins/images.py58
-rw-r--r--plugins/notchkd.py46
-rw-r--r--plugins/notitles.py47
-rw-r--r--plugins/problems.py53
-rw-r--r--plugins/rptlib.py290
-rw-r--r--plugins/sitemap.py79
-rw-r--r--plugins/slow.py61
-rw-r--r--plugins/whatsnew.py49
-rw-r--r--plugins/whatsold.py50
-rw-r--r--robotparser.py103
-rw-r--r--schemes/__init__.py18
-rw-r--r--schemes/filelink.py57
-rw-r--r--schemes/ftplink.py125
-rw-r--r--schemes/httplink.py167
-rw-r--r--version.py24
-rw-r--r--webcheck.css126
-rwxr-xr-xwebcheck.py145
-rwxr-xr-xwebcheck.sh4
36 files changed, 3925 insertions, 0 deletions
diff --git a/BUGS b/BUGS
new file mode 100644
index 0000000..e668091
--- /dev/null
+++ b/BUGS
@@ -0,0 +1,16 @@
+Bug report sould be sent to the webcheck mailing list. If you absolutely
+cannot subscribe to the mailing list then you may report bugs to
+mwm@mired.org. See INSTALL for details.
+
+Known bugs:
+
+I tried webcheck on a site that used FrontPage publishing on IIS and
+IIS reports error 406 whenever webcheck attemts to retieve the HEADers
+for a document in one of the "underscore" directories. I'm not yet
+sure why this happens, but I doubt its really webcheck's fault. I
+might just have to code a way around it. In the meantime, you can
+usually yank these URLs with -y '/_' or something similiar.
+
+Some (IIS?) servers seem to be reporting -1 as an HTTP status code.
+I'm not sure what that means or what to do about it.
+
diff --git a/CHANGES b/CHANGES
new file mode 100644
index 0000000..f4eb872
--- /dev/null
+++ b/CHANGES
@@ -0,0 +1,136 @@
+Changes in webcheck 1.0
+
++ Don't send accept headers, as they weren't valid.
+
++ WARN_OLD_VERSION no longer works, until I decide what to do about it.
+
++ Named changed to webcheck.
+
++ Fixed typos in INSTALL.
+
++ Changes so it works with python 2.0.
+
+Changes in 1.0b10
+
+b Fixed bug when server redirects to a document in robots.txt (does not show
+ up as broken (hopefully))
+
++ Filename mangling in filelink.py to help OS/2 (and Win32) (Patch submitted
+ by Steffen Siebert <siebert@logware.de>
+
++ Added WARN_OLD_VERSION config.py option. If this option is set to true
+ (the default) Linbot will check it's version number and the version
+ numbers of it's plugins against a global registry on the Net. If it
+ finds that a version is not the latest, it will print a warning on the
+ reports along with a link you can follow to download the latest version.
+ I think it's neat. You might find it annoying.
+
++ Added preliminary support for authenticating proxies, though it does not
+ work correctly yet.
+
++ Added -r (redirect depth) and REDIRECT_DEPTH option in config.py to indicate
+ the amount of redirects Linbot should follow when following a link. Thanks
+ to Andrea Glorioso <sama@intercity.it> for the patch.
+
++ Added debugio module that handles debugging and I/O
+
++ Added -q (quiet option). Use it to suppress output
+
++ Added -d (debug) option and DEBUG_LEVEL variable in config.py for debugging
+
++ added version module and removed __version__ and __author__ from all the
+ modules (except plugins).
+
+b Fixed bug in Linbot using putrequest() instead of putheader() when requesting
+ header information. Thanks to Andrea Glorioso <sama@intercity.it> for
+ fixing this glitch (and Seth Chaiklin <seth@psy.au.dk> for noticing).
+
+Changes in 1.0b9
+
++ If you use the -o command-line option or the OUTPUT_DIR config file option
+ and the directory does not exist, linbot will create it for you (provided
+ that it has the correct permissions, etc.) Thanks to Andrea Glorioso
+ <sama@intercity.it> for this feature.
+
++ Added a CREDITS file and probably left a lot of people out. If you think
+ you should be in it let me know (marduk@python.net).
+
+b Linbot will now report to the server that it can accept any MIME type (found
+ in mimetypes.py. This should fix the "406: No acceptable objects found"
+ error that some servers report.
+
+b Linbot correctly identifies itself as "Linbot <version>" on HEAD requests
+ as well as GET requests.
+
+Changes in 1.0b8
+
+b Fixed bug when no images are reported for documents having 0 links
+ If you don't know what this means it probably wasn't a problem for you.
+
+b Fixed code that was messing with arguments passed via -x and -y and caused
+ unexpected results and/or errors.
+
+b -b flag should work this time (for real)
+
+b Cosmetic changes (reports didn't look the way I thought they should in IE4.
+ (and may not still as I havent' had a chance to check it yet)
+
+b Linbot won't follow infinite redirects (currently hardcoded to max of 5
+ redirects per document)
+
+Changes in 1.0b6
+
++ Minor change in ftplink.py should allow better ftp link checking
+
++ You can now press CTRL-C (or whatever your operating system supports) to break
+ out of a linbot run. However, the work linbot does is not saved (yet).
+
+b Fixed problem when server redirects a URL to itself. This fix seems to work
+ for most servers I've tried but there are a few more out there that I need to
+ take a look at.
+
+b Fixed bug that caused linbot to not check for yanked URLs
+
++ Added -l command-line option. Usage: -l <url> where <url> is a url pointing
+ to an image to be used as the report's logo.
+
+b "patched" strings.py so that it can better parse html files created in
+ Windows/DOS (I think).
+
++ Made report LOGO a link to the base url
+
++ httplink does not HEAD a redirected URL if it is already in the link list
+ (performance improvement)
+
+- Removed LOGO_ALT from config.py
+
++ Changed my email address to marduk@python.net. The official home page of
+ Linbot will probaby also change with the next release so stay tuned.
+
+Changes in 1.0b5 (from 1.0b4)
+
++ Added a contrib directory. Right now it just contains the about plugin. Other
+ plugins will be included if people contribute them. Also, the man page will
+ return once I have updated it. Those ugly buttons are obsolete.
+
++ Linbot now "inlines" stylesheets. This has the benefits of 1) better support
+ of Netscape browsers (so I hear) and 2) I don't have to document to put
+ linbot.css in the output directory since it grabs it from starship 8*)
+
+b Handling of error for when robots.txt cannot be retreived.
+
++ Malformed urls are trapped (sorry, I had that commented out)
+
+b FTP link handling is totally rewritten. Fortunately it shouldn't crash anymore
+ Unfortunately it doesn't really work reliably and probably never will. See
+ README.ftp for details.
+
+b Two bugs in HTTP proxy handling made it almost completely unusable, though
+ conveniently seemed to cancel each other out when I was testing.
+
+b Too many files error on large sites should be fixed. Thanks to Andrew Kuchling
+ et al for suggestions.
+
+b Bug when some servers erroneously report (or don't report) Content-Length header
+ fixed.
+
diff --git a/COPYING b/COPYING
new file mode 100644
index 0000000..60549be
--- /dev/null
+++ b/COPYING
@@ -0,0 +1,340 @@
+ GNU GENERAL PUBLIC LICENSE
+ Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.
+ 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+ Preamble
+
+ The licenses for most software are designed to take away your
+freedom to share and change it. By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users. This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it. (Some other Free Software Foundation software is covered by
+the GNU Library General Public License instead.) You can apply it to
+your programs, too.
+
+ When we speak of free software, we are referring to freedom, not
+price. Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+ To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+ For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have. You must make sure that they, too, receive or can get the
+source code. And you must show them these terms so they know their
+rights.
+
+ We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+ Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software. If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+ Finally, any free program is threatened constantly by software
+patents. We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary. To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+ The precise terms and conditions for copying, distribution and
+modification follow.
+
+ GNU GENERAL PUBLIC LICENSE
+ TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+ 0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License. The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language. (Hereinafter, translation is included without limitation in
+the term "modification".) Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope. The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+ 1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+ 2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+ a) You must cause the modified files to carry prominent notices
+ stating that you changed the files and the date of any change.
+
+ b) You must cause any work that you distribute or publish, that in
+ whole or in part contains or is derived from the Program or any
+ part thereof, to be licensed as a whole at no charge to all third
+ parties under the terms of this License.
+
+ c) If the modified program normally reads commands interactively
+ when run, you must cause it, when started running for such
+ interactive use in the most ordinary way, to print or display an
+ announcement including an appropriate copyright notice and a
+ notice that there is no warranty (or else, saying that you provide
+ a warranty) and that users may redistribute the program under
+ these conditions, and telling the user how to view a copy of this
+ License. (Exception: if the Program itself is interactive but
+ does not normally print such an announcement, your work based on
+ the Program is not required to print an announcement.)
+
+These requirements apply to the modified work as a whole. If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works. But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+ 3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+ a) Accompany it with the complete corresponding machine-readable
+ source code, which must be distributed under the terms of Sections
+ 1 and 2 above on a medium customarily used for software interchange; or,
+
+ b) Accompany it with a written offer, valid for at least three
+ years, to give any third party, for a charge no more than your
+ cost of physically performing source distribution, a complete
+ machine-readable copy of the corresponding source code, to be
+ distributed under the terms of Sections 1 and 2 above on a medium
+ customarily used for software interchange; or,
+
+ c) Accompany it with the information you received as to the offer
+ to distribute corresponding source code. (This alternative is
+ allowed only for noncommercial distribution and only if you
+ received the program in object code or executable form with such
+ an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it. For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable. However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+ 4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License. Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+ 5. You are not required to accept this License, since you have not
+signed it. However, nothing else grants you permission to modify or
+distribute the Program or its derivative works. These actions are
+prohibited by law if you do not accept this License. Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+ 6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions. You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+ 7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License. If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all. For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices. Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+ 8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded. In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+ 9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time. Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number. If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation. If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free Software
+Foundation.
+
+ 10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission. For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this. Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+ NO WARRANTY
+
+ 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+ 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+ END OF TERMS AND CONDITIONS
+
+ How to Apply These Terms to Your New Programs
+
+ If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+ To do so, attach the following notices to the program. It is safest
+to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+ <one line to give the program's name and a brief idea of what it does.>
+ Copyright (C) 19yy <name of author>
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+
+
+Also add information on how to contact you by electronic and paper mail.
+
+If the program is interactive, make it output a short notice like this
+when it starts in an interactive mode:
+
+ Gnomovision version 69, Copyright (C) 19yy name of author
+ Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+ This is free software, and you are welcome to redistribute it
+ under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License. Of course, the commands you use may
+be called something other than `show w' and `show c'; they could even be
+mouse-clicks or menu items--whatever suits your program.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the program, if
+necessary. Here is a sample; alter the names:
+
+ Yoyodyne, Inc., hereby disclaims all copyright interest in the program
+ `Gnomovision' (which makes passes at compilers) written by James Hacker.
+
+ <signature of Ty Coon>, 1 April 1989
+ Ty Coon, President of Vice
+
+This General Public License does not permit incorporating your program into
+proprietary programs. If your program is a subroutine library, you may
+consider it more useful to permit linking proprietary applications with the
+library. If this is what you want to do, use the GNU Library General
+Public License instead of this License.
diff --git a/CREDITS b/CREDITS
new file mode 100644
index 0000000..fb21a32
--- /dev/null
+++ b/CREDITS
@@ -0,0 +1,16 @@
+The following entities have contributed to webcheck in some fashion. Please
+know that this is not a complete list. Most likely I've forgotten someone
+If you would like to be included/removed from this list, send email to
+mwm@mired.org. Thank you to all the contributers to webcheck.
+
+
+ Contributers
+ ----------------------------------------
+
+Mike Meyer mwm@mired.org
+Marduk marduk@python.net
+Oleg Broytmann phd2@earthlink.net
+Andrea Glorioso sama@intercity.it
+Andrew Kuchling akuchlin@cnri.reston.va.us
+Jean Pierre LeJacq jplejacq@quoininc.com
+Steffen Siebert siebert@logware.de
diff --git a/HISTORY b/HISTORY
new file mode 100644
index 0000000..7203fe4
--- /dev/null
+++ b/HISTORY
@@ -0,0 +1,567 @@
+//depot/mwm/webcheck/BUGS
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/CHANGES
+... #3 change 2087 edit on 2002/04/02 by mwm@guru (text)
+
+ Note that we don't send accept headers any more, and fix the URL for
+ linkbot in README.
+
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/COPYING
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/CREDITS
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/HISTORY.linbot
+... #1 change 2185 add on 2002/05/04 by mwm@guru (text)
+
+ Add the linbot history file.
+
+//depot/mwm/webcheck/INSTALL
+... #3 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #2 change 2079 edit on 2002/04/02 by mwm@guru (text)
+
+ Apply the patches from the FreeBSD port.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/README
+... #3 change 2087 edit on 2002/04/02 by mwm@guru (text)
+
+ Note that we don't send accept headers any more, and fix the URL for
+ linkbot in README.
+
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/TODO
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/config.py
+... #4 change 2083 edit on 2002/04/02 by mwm@guru (text)
+
+ Change config.py to match my own version.
+
+... #3 change 2082 edit on 2002/04/02 by mwm@guru (text)
+
+ Move the stylesheet and LOGO references from marduk's - now
+ non-existent - site.
+
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/debugio.py
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/htmlparse.py
+... #4 change 2156 edit on 2002/04/28 by mwm@guru (text)
+
+ Deal with ambiguous tabs in the source.
+
+... #3 change 2090 edit on 2002/04/02 by mwm@guru (text)
+
+ Fix "import *"'s that caused 2.2 to choke.
+
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/httpcodes.py
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/myUrlLib.py
+... #5 change 2156 edit on 2002/04/28 by mwm@guru (text)
+
+ Deal with ambiguous tabs in the source.
+
+... #4 change 2085 edit on 2002/04/02 by mwm@guru (text)
+
+ Change the "import *"'s that were causing problems to import just the
+ one name we needed.
+
+... #3 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #2 change 2079 edit on 2002/04/02 by mwm@guru (text)
+
+ Apply the patches from the FreeBSD port.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/robotparser.py
+... #3 change 2156 edit on 2002/04/28 by mwm@guru (text)
+
+ Deal with ambiguous tabs in the source.
+
+... #2 change 2079 edit on 2002/04/02 by mwm@guru (text)
+
+ Apply the patches from the FreeBSD port.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/version.py
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/webcheck.css
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2080 branch on 2002/04/02 by mwm@guru (text)
+
+ Stage one of the rename - fix the file names.
+
+... ... branch from //depot/mwm/webcheck/linbot.css#1
+//depot/mwm/webcheck/webcheck.py
+... #4 change 2156 edit on 2002/04/28 by mwm@guru (xtext)
+
+ Deal with ambiguous tabs in the source.
+
+... #3 change 2091 edit on 2002/04/02 by mwm@guru (xtext)
+
+ Change one last "import *".
+
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (xtext)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2080 branch on 2002/04/02 by mwm@guru (xtext)
+
+ Stage one of the rename - fix the file names.
+
+... ... branch from //depot/mwm/webcheck/linbot.py#1
+//depot/mwm/webcheck/webcheck.sh
+... #4 change 2157 edit on 2002/04/28 by mwm@guru (xtext)
+
+ Add the director the python binary resides in to the PATH.
+
+... #3 change 2084 edit on 2002/04/02 by mwm@guru (xtext)
+
+ Fix the program name to be src, not external-src.
+
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (xtext)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2080 branch on 2002/04/02 by mwm@guru (xtext)
+
+ Stage one of the rename - fix the file names.
+
+... ... branch from //depot/mwm/webcheck/linbot.sh#1
+//depot/mwm/webcheck/plugins/__init__.py
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/plugins/badlinks.py
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/plugins/external.py
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/plugins/images.py
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/plugins/notchkd.py
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/plugins/notitles.py
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/plugins/problems.py
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/plugins/rptlib.py
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/plugins/sitemap.py
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/plugins/slow.py
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/plugins/whatsnew.py
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/plugins/whatsold.py
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/schemes/__init__.py
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/schemes/filelink.py
+... #3 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #2 change 2079 edit on 2002/04/02 by mwm@guru (text)
+
+ Apply the patches from the FreeBSD port.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/schemes/ftplink.py
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
+//depot/mwm/webcheck/schemes/httplink.py
+... #5 change 2186 edit on 2002/05/04 by mwm@guru (text)
+
+ Revert the accept headers - leave them out.
+
+... #4 change 2184 edit on 2002/05/04 by mwm@guru (text)
+
+ Put back the accept headers. The problem appeared to be elsewhere.
+
+... #3 change 2086 edit on 2002/04/02 by mwm@guru (text)
+
+ Rip out the accept: headers. They are making some servers choke for
+ some reason.
+
+... #2 change 2081 edit on 2002/04/02 by mwm@guru (text)
+
+ Rename, phase 2 - change internal references from "linbot" to "webcheck".
+
+ Also add my copyright, standardize the GNU copyright header, and rip out
+ the CVS cruft that we're not going to use.
+
+ Document this in changes.
+
+... #1 change 2078 add on 2002/04/02 by mwm@guru (text)
+
+ Check in linbot with the webcheck name, in preperation for the
+ rename for my distribution.
+
diff --git a/HISTORY.linbot b/HISTORY.linbot
new file mode 100644
index 0000000..24ffc3a
--- /dev/null
+++ b/HISTORY.linbot
@@ -0,0 +1,262 @@
+# $Log: debugio.py,v $
+# Revision 1.1 1999/03/11 02:29:50 marduk
+# Added debugio module to handle debugging and IO
+
+# $Log: htmlparse.py,v $
+# Revision 1.5 1999/03/11 04:51:25 marduk
+# Added version module.
+#
+# Revision 1.4 1999/03/11 02:29:50 marduk
+# Added debugio module to handle debugging and IO
+#
+# Revision 1.3 1999/02/21 16:39:24 marduk
+# 1.0b8
+#
+# Revision 1.2 1999/01/10 01:01:44 marduk
+# Linbot 1.0b6
+#
+# Revision 1.1 1998/12/23 02:12:15 marduk
+# This is 1.0b1
+#
+# Revision 1.1.1.1 1998/12/20 23:17:12 marduk
+# initial 1.0
+
+
+# $Log: httpcodes.py,v $
+# Revision 1.2 1999/01/10 01:01:44 marduk
+# Linbot 1.0b6
+#
+# Revision 1.1.1.1 1998/12/20 23:17:12 marduk
+# initial 1.0
+
+# $Log: __init__.py,v $
+# Revision 1.2 1999/01/10 01:02:02 marduk
+# Linbot 1.0b6
+#
+# Revision 1.1.1.1 1998/12/20 23:17:12 marduk
+# initial 1.0
+#
+
+# $Log: badlinks.py,v $
+# Revision 1.2 1999/01/10 01:02:02 marduk
+# Linbot 1.0b6
+#
+# Revision 1.1.1.1 1998/12/20 23:17:12 marduk
+# initial 1.0
+#
+
+# $Log: external.py,v $
+# Revision 1.2 1999/01/10 01:02:02 marduk
+# Linbot 1.0b6
+#
+# Revision 1.1.1.1 1998/12/20 23:17:12 marduk
+# initial 1.0
+#
+
+# $Log: images.py,v $
+# Revision 1.3 1999/02/21 16:39:43 marduk
+# 1.0b8
+#
+# Revision 1.2 1999/01/10 01:02:03 marduk
+# Linbot 1.0b6
+#
+# Revision 1.1.1.1 1998/12/20 23:17:12 marduk
+# initial 1.0
+#
+
+# $Log: notchkd.py,v $
+# Revision 1.2 1999/01/10 01:02:03 marduk
+# Linbot 1.0b6
+#
+# Revision 1.1.1.1 1998/12/20 23:17:12 marduk
+# initial 1.0
+#
+
+# $Log: notitles.py,v $
+# Revision 1.2 1999/01/10 01:02:03 marduk
+# Linbot 1.0b6
+#
+# Revision 1.1.1.1 1998/12/20 23:17:12 marduk
+# initial 1.0
+#
+
+# $Log: problems.py,v $
+# Revision 1.2 1999/01/10 01:02:03 marduk
+# Linbot 1.0b6
+#
+# Revision 1.1.1.1 1998/12/20 23:17:12 marduk
+# initial 1.0
+#
+
+ $Log: rptlib.py,v $
+# Revision 1.8 1999/03/12 04:56:05 marduk
+# Added ability to warn of old versions of linbot/plugins
+# Added patch to enable file:// to work with OS/2
+#
+# Revision 1.7 1999/03/11 04:51:28 marduk
+# Added version module.
+#
+# Revision 1.6 1999/03/11 02:30:00 marduk
+# Added debugio module to handle debugging and IO
+#
+# Revision 1.5 1999/02/26 01:12:15 marduk
+# -o option Created directory if does not exist.
+#
+# Revision 1.4 1999/02/21 16:39:44 marduk
+# 1.0b8
+#
+# Revision 1.3 1999/01/10 01:02:04 marduk
+# Linbot 1.0b6
+#
+# Revision 1.2 1998/12/31 03:49:08 marduk
+# This is linbot 1.0b5. See CHANGES
+#
+# Revision 1.1.1.1 1998/12/20 23:17:12 marduk
+# initial 1.0
+#
+
+# $Log: sitemap.py,v $
+# Revision 1.2 1999/01/10 01:02:04 marduk
+# Linbot 1.0b6
+#
+# Revision 1.1.1.1 1998/12/20 23:17:12 marduk
+# initial 1.0
+#
+
+# $Log: slow.py,v $
+# Revision 1.2 1999/01/10 01:02:04 marduk
+# Linbot 1.0b6
+#
+# Revision 1.1.1.1 1998/12/20 23:17:12 marduk
+# initial 1.0
+#
+
+# $Log: whatsnew.py,v $
+# Revision 1.2 1999/01/10 01:02:05 marduk
+# Linbot 1.0b6
+#
+# Revision 1.1.1.1 1998/12/20 23:17:12 marduk
+# initial 1.0
+#
+
+# $Log: whatsold.py,v $
+# Revision 1.2 1999/01/10 01:02:05 marduk
+# Linbot 1.0b6
+#
+# Revision 1.1.1.1 1998/12/20 23:17:12 marduk
+# initial 1.0
+#
+
+# $Log: __init__.py,v $
+# Revision 1.3 1999/03/11 04:51:32 marduk
+# Added version module.
+#
+# Revision 1.2 1999/01/10 01:02:15 marduk
+# Linbot 1.0b6
+#
+# Revision 1.1.1.1 1998/12/20 23:17:13 marduk
+# initial 1.0
+#
+
+# $Log: filelink.py,v $
+# Revision 1.5 1999/03/12 04:56:07 marduk
+# Added ability to warn of old versions of linbot/plugins
+# Added patch to enable file:// to work with OS/2
+#
+# Revision 1.4 1999/03/11 04:51:32 marduk
+# Added version module.
+#
+# Revision 1.3 1999/01/10 01:02:15 marduk
+# Linbot 1.0b6
+#
+# Revision 1.2 1998/12/31 03:49:14 marduk
+# This is linbot 1.0b5. See CHANGES
+#
+# Revision 1.1.1.1 1998/12/20 23:17:13 marduk
+# initial 1.0
+#
+
+# $Log: ftplink.py,v $
+# Revision 1.6 1999/03/11 04:51:32 marduk
+# Added version module.
+#
+# Revision 1.5 1999/03/11 02:30:05 marduk
+# Added debugio module to handle debugging and IO
+#
+# Revision 1.4 1999/01/10 01:02:15 marduk
+# Linbot 1.0b6
+#
+# Revision 1.3 1998/12/31 03:49:14 marduk
+# This is linbot 1.0b5. See CHANGES
+#
+# Revision 1.2 1998/12/23 07:38:35 marduk
+# Fix bug: NameError: myUrlLib
+#
+# Revision 1.1.1.1 1998/12/20 23:17:14 marduk
+# initial 1.0
+#
+
+# $Log: httplink.py,v $
+# Revision 1.11 1999/03/14 19:24:25 marduk
+# Fixed bug when server redirects to a document in robots.txt
+#
+# Revision 1.10 1999/03/12 01:48:21 marduk
+# Preliminary support for authenticating proxies added.
+# Added Andrea's redirect-depth patch.
+#
+# Revision 1.9 1999/03/11 04:51:33 marduk
+# Added version module.
+#
+# Revision 1.8 1999/03/11 02:30:05 marduk
+# Added debugio module to handle debugging and IO
+#
+# Revision 1.7 1999/02/27 16:31:35 marduk
+# Use putheader("User-Agent:"...) instead of putrequest(...)
+#
+# Revision 1.6 1999/02/26 01:55:08 marduk
+# ACCEPTS all mime types in mimetypes.py
+#
+# Identify itself as Linbot x.x in HEAD requests
+#
+# Revision 1.5 1999/02/21 16:39:51 marduk
+# 1.0b8
+#
+# Revision 1.4 1999/01/10 21:58:19 marduk
+# Changed self.* to link.* @line 86 in httplink.py
+#
+# Revision 1.3 1999/01/10 01:02:16 marduk
+# Linbot 1.0b6
+#
+# Revision 1.2 1998/12/31 03:49:14 marduk
+# This is linbot 1.0b5. See CHANGES
+#
+# Revision 1.1.1.1 1998/12/20 23:17:12 marduk
+# initial 1.0
+#
+
+# $Id: linbot.py,v 1.8 1999/03/12 04:56:01 marduk Exp $
+# Revision 1.8 1999/03/12 04:56:01 marduk
+# Added ability to warn of old versions of linbot/plugins
+# Added patch to enable file:// to work with OS/2
+#
+# Revision 1.7 1999/03/12 01:48:14 marduk
+# Preliminary support for authenticating proxies added.
+# Added Andrea's redirect-depth patch.
+#
+# Revision 1.6 1999/03/11 04:51:25 marduk
+# Added version module.
+#
+# Revision 1.5 1999/03/11 02:29:51 marduk
+# Added debugio module to handle debugging and IO
+#
+# Revision 1.4 1999/02/21 16:39:25 marduk
+# 1.0b8
+#
+# Revision 1.3 1999/01/10 01:01:44 marduk
+# Linbot 1.0b6
+#
+# Revision 1.2 1998/12/23 07:34:59 marduk
+# Fixed problem in linbot.py "import parser"
+#
+# Revision 1.1.1.2 1998/12/20 23:27:50 marduk
+# This is pre 1.0, I hope
diff --git a/INSTALL b/INSTALL
new file mode 100644
index 0000000..5182d22
--- /dev/null
+++ b/INSTALL
@@ -0,0 +1,180 @@
+ [Webcheck]
+
+
+ ------------------------------------------------------------------------
+
+Installing Webcheck
+
+Installation is relatively easy. Note these installation instructions are
+for Unix-like systems. Other operating systems may differ.
+
+ 1. Unpack the gzipped tarchive. Be sure to add the directory to your
+ PYTHONPATH environment variable.
+
+ $ tar zxvf webcheck-1.0b6.tar.gz -C /usr/local/lib
+ $ PYTHONPATH="/usr/local/lib/webcheck:$PYTHONPATH"
+ $ export PYTHONPATH
+
+ 2. Add a symbolic link to some place in your PATH
+
+ $ ln -s /usr/local/lib/webcheck/webcheck.py /usr/local/bin/webcheck
+
+ 3. Edit the config.py file to your choosing. Most of the defaults are
+ safe. The important ones can be overridden with command-line flags. You
+ may want to keep a copy of the original config.py file just in case.
+ The config.py options are documented within the file.
+
+ ------------------------------------------------------------------------
+
+Running Webcheck
+
+It is simple to run Webcheck.
+
+Executing Webcheck without any command-line arguments will cause it to give a
+simple synopsis of its usage and then quit.
+
+$ webcheck
+webcheck [-abvq][-l url][-x url]... [-y url]... [-r depth][-o dir][-w sec][-d level] url [location]...
+
+Before running Webcheck on a site, you should need to do a little preparation.
+
+One think that Webcheck needs is a directory in which to publish its reports.
+It is recommended that you choose a directory which is empty and will only
+contain webcheck reports. This directory must exist and be writeable by the
+user running webcheck before webcheck is run.
+
+$ mkdir /usr/local/apache/share/htdocs/webcheck
+
+The report can be viewed using most web browsers. Browsers using frames can
+initially open the "index.html" file. Browsers not supporting frames or
+users who do not like frames can initially open the "navbar.html" file. Note
+these are default filenames for Webcheck and may be changed via the config
+file.
+
+It should be decided beforehand which documents on your site should be
+considered "internal" and which should be considered "external". Webcheck
+defines internal and external documents as such:
+
+An internal document is a part of your site that you have control of and
+checked, as well as the links that it points to. Basically an internal
+document is one that, if broken, you have the power to fix.
+
+An external document is one that an internal document points to but you have
+no jurisdiction over. It can also be a document that you have the power to
+change, but need not be checked, such as documents pointed to by CGI scripts
+or other automated tools such as Webcheck.
+
+Your base url is the url pointing to the document that is the top level of
+your site. Commonly referred to as the "home page", it is the url that
+points to all other urls, either directly or indirectly. The base url can be
+on one web server but point to documents on another server that hosts other
+internal documents. An example would be a main server
+www.someplaceonthenet.com in which there may be links to an alternate server
+called www2.someplaceonthenet.com. In this case www2.someplaceonthenet.com
+would host internal documents even though your "home page" is on
+www.someplaceonthenet.com.
+
+That said, you should have a basic idea of what you do and do not want
+Webcheck to check. Don't be surprised if you do not get it exactly right the
+first time. Also, consider using the robots.txt explained at
+http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html.
+Currently Webcheck identifies itself as User-Agent: Webcheck.
+
+You can allow Webcheck to search a directory but restrict other bots, for
+example, like this:
+
+User-agent: *
+Disallow: /
+
+User-agent: Webcheck
+Allow: /
+
+Okay you have heard enough and you just want to run the darn thing. The
+simplest way to run Webcheck is:
+
+$ webcheck http://www.someplaceonthenet.com/
+
+This will first read the robots.txt file at www.someplaceonthenet.com and
+then proceed to examine every link pointed to on that site except documents
+denied by robots.txt if that file exists.
+
+The exact usage for webcheck is given below.
+
+ ------------------------------------------------------------------------
+
+Synopsis
+
+ webcheck webcheck [-abvq][-l url][-x url]... [-y url]... [-r depth][-o dir][-w
+ sec][-d level] url [location[:port]]...
+
+
+
+ -x regexUse this option to tell Webcheck to consider any url matching regex
+ to be external. Uses perl-type regular expressions. Can be used
+ multiple times.
+ -y regexLike the -x flag, though this option will cause Webcheck to not
+ check the link matched by regex whereas -x will check the link but
+ not its children. Uses perl-type regular expressions. Can be used
+ multiple times.
+ -l url Use url for the logo image on all reports. The url should point to
+ a valid image.
+ -b Base urls only. Tells Webcheck to consider any url that does not
+ start with the base url to be considered external. For example, if
+ you run webcheck -b
+ http://www.someplaceonthenet.com/~somebody/foo.html then
+ http://www.someplaceonthenet.com/~somebody/misc/index.html will be
+ considered internal whereas http://www.someplaceonthenet.com/ will
+ be considered external.
+ -a Avoid external links. Normally if Webcheck is examining an HTML page
+ and it finds a link that points to an external document, it will
+ check to see if that external document exists. This flag disables
+ that action. External links will not be checked.
+ -q Quiet. Do not print out the progress as Webcheck traverses a site
+ (equivalent to -d 0).
+ -o dir Output directory. Use to specify the directory where Webcheck will
+ dump its reports. The default is the current directory or as
+ specified by config.py. If this directory does not exist it will
+ be created for you (if possible).
+ -r depthRedirect depth. the amount of redirects Webcheck should follow when
+ following a link. 0 implies follow all redirects.
+ -w secs Wait secs between link checking. Usually Webcheck will process a url
+ and immediately move on to the next. However on some loaded
+ systems it may be desirable to have Webcheck pause between requests.
+ This option can be set to any non-negative number.
+ -d levelSet debug level to level. For programmer-level debugging use a
+ level > 1.
+ url The base url. Webcheck checks this link first, then all the links it
+ points to on down the "tree".
+ location This specifies the hosts pointed to that are to be considered
+ internal. By default Webcheck only considers URLs pointing to the
+ host of the base url to be internal. However if your site resides
+ on multiple servers use this parameter to tell Webcheck what other
+ servers should be considered internal. May be used multiple times,
+ but must follow url.
+ ------------------------------------------------------------------------
+
+Examples
+
+Here are some examples of running Webcheck.
+
+$ webcheck http://manson.ddns.org/ -x /webcheck starship.skyport.net
+$ webcheck -o /stats/altavista/ http://altavista.digital.com/
+$ webcheck -o ~/Lang/Python/webcheck -b -l http://manson.ddns.org/images/marduk.gif http://manson.ddns.org/~marduk/
+ ------------------------------------------------------------------------
+
+Running Periodically
+
+Webcheck may be safely run periodically or on off-peak hours using on or at.
+It may be safely run unattended. You may want to redirect Webcheck's output to
+the null device, log file, or have it emailed to an account. Consult your
+operating system manuals for how this can be done on your system.
+
+ ------------------------------------------------------------------------
+
+Feedback
+
+If you have any questions about Webcheck or would like to report a
+bug, it helps a lot to include a url where the problem can be found,
+an HTML file where the error occurs or a (small) tar of the site where
+the error occurs. Suggestions for improvements are also welcomed.
+Patches and code contributions are even better.
diff --git a/README b/README
new file mode 100644
index 0000000..802a264
--- /dev/null
+++ b/README
@@ -0,0 +1,41 @@
+
+Webcheck is the amazing Site Management Tool for webmasters. Downloads and more
+information at:
+
+ http://www.mired.org/webcheck/
+
+Webcheck allows webmasters to:
+
+* View The Structure Of A Site
+
+* Track Down Broken Links
+
+* Find Potentially Outdated Web Pages
+
+* List Links Pointing To External Sites
+
+* View Portfolio Of Inline Images
+
+* Do All This Periodically And Without User Intervention
+
+
+
+Changes to v. 1.0 include:
+
+* Faster checking of sites (only downloads files when it needs).
+
+* Supported schemes (http, ftp, file) handled more efficiently.
+
+* More modular design allows other schemes to be added easily.
+
+* Plug-in support: third-party reports can be added to webcheck easily!
+
+* Themes (TM) support via Cascading StyleSheets.
+
+* Lots of bug fixes, including the infamous proxy bug.
+
+* and more!
+
+Webcheck is a FREE clone of Linkbot <URL:
+http://www.watchfire.com/solutions/linkbot.asp > and incorporates many
+of Linkbot's features as well as enhancements of its own.
diff --git a/TODO b/TODO
new file mode 100644
index 0000000..e412afd
--- /dev/null
+++ b/TODO
@@ -0,0 +1,24 @@
+
+
+**************************************************************************
+* I'm running out of ideas ;-). If you have any suggestions for *
+* improvement, please let me know. mwm@mired.org *
+**************************************************************************
+
+Support for authenticating proxies
+
+New config file format.
+
+Configurable time-out when retrieving a document.
+
+Cookies support (maybe)
+
+Integration with weblint
+
+If using 'file' scheme, clicking on a link will bring up the file in an editor
+
+Support for mult-threading (maybe)
+
+Option to put # of hits of a document in the Site Map obtained from log file.
+
+Export to database file.
diff --git a/config.py b/config.py
new file mode 100644
index 0000000..54a5746
--- /dev/null
+++ b/config.py
@@ -0,0 +1,157 @@
+"""
+
+ Webcheck Configuration file
+ Edit this file to your choosing. This is just a regular Python module, so
+ if you want to do something fancy with it, go right ahead. Just make sure
+ that all variables are defined and have an appropriate value .
+
+"""
+
+
+# if this is true, webcheck will consider external all links that do not start in
+# the same directory level as the base url. For example, given
+# webcheck http://www.myhost.com/~me/
+# 'http://www.myhost.com/~me/stuff/index.html' would be considered internal while
+# 'http://www.myhost.com/index.html' would be considered external.
+# The default is false (0). note this can be overriden with the -b command-line
+# flag
+BASE_URLS_ONLY=0
+
+# This is a (Python) list of URLs that should not be explored. This can also
+# be passed to webcheck via the -x command line switch. Note this should be a
+# VALID REGULAR EXPRESSION. See also YANKED_URLS below.
+EXCLUDED_URLS = [r'.*\.gif',r'.*\.tar\.gz',r'.*\.jpeg',r'.*\.jpg',
+ r'http://www.mired.org/cgi-bin/', r'http://www.mired.org/ATCPFAQ/']
+
+# This is like EXLUDED_URLS, but YANKED_URLS are not checked at all. Also
+# you can use the -y command line switch.
+# When using the below parameter, make sure that the regular expressions are
+# raw Python strings (beginning quote preceded with an "r"). Regular expressions
+# are case insensitive.
+YANKED_URLS = [r'http://www.amazon.com/exec/obidos/',
+ r'http://www.mired.org/home/mwm/&me;.txt']
+
+# Normally webcheck will check links to "external" sites at the top level to
+# ensure that your pages don't refer to broken links that are not at your
+# site. However, you may not want this. Setting this option to 1 will cause
+# webcheck to not check external links. Note a link that is part of the. This
+# can also be set with the command-line -a switch
+#
+# EXCLUDED_URLS list is considered external
+AVOID_EXTERNAL_LINKS = 0
+
+# Currently, Webcheck can checks http:, ftp:, and file:, schemes. However, you may
+# want to avoid certain schemes (such as file: or ftp:). Remove the scheme
+# from this list and Webcheck will avoid it. Avoided URLs are treated as external
+# Default is to not avoid any.
+# Examples:
+#SCHEMES = ['http']
+#SCHEMES = ['http','ftp','file']
+SCHEMES = ['http','ftp']
+
+
+
+# You can define proxies for the individual schemes above. The PROXIES config
+# variable is a python dictionary or 'None', for example:
+# PROXIES = {'http':'http://localhost:3128'}
+PROXIES = None
+# Note: according to the urllib documentation, you should also be able to set
+# proxies according to your system's environment variables, for example:
+# $ HTTP_PROXY='http://localhost:3128' ; export HTTP_PROXY # using Bourne Shell
+# $ FTP_PROXY='http://localhost:3128' ; export FTP_PROXY
+# proxies in the configuration take precedence over environment settings
+
+
+# hostnames (for example, www.myhost.com) which are to be considered local to
+# your site. Note that by default, the base URL of your site is considered
+# local. This can also be passed via command-line (see documentation for details
+HOSTS = ['www.mired.org','mired.org']
+
+
+# Directory where files generated by webcheck will be placed. This can also be
+# specified via the -o command-line flag.
+OUTPUT_DIR = '.'
+
+# When listing a broken link in it's published report, Webcheck can either make the
+# broken link 'active' or simply list the URL. Most users will probably not
+# want the broken link to be active.
+ANCHOR_BAD_LINKS = 1
+
+# Usually, Webcheck will processs a URL and immediately move on to the next one.
+# However, on some loaded systems, it may be more desirable to have Webcheck wait
+# a while between requests. This option should be set to any non-negative number
+# (in seconds). This can also be set using the command-line -w <secs> flag.
+WAIT_BETWEEN_REQUESTS = 0
+
+# When Webcheck encounters a 301 or 302 response from the server, it
+# needs to decide how many times it will follow the indications of the
+# server. By setting this option, you may change it to your
+# tastes. Setting it to -1 means "infinite redirection" (don't say I
+# didn't warn you, when your sysadm tries to make you eat the 10^6
+# network logs you produced and he printed... :)
+REDIRECT_DEPTH = 5
+
+# Webcheck has the option of checking a registry and determine it is the
+# latest version of Webcheck as well as plugin reports you are using. If
+# this option is set to true (not 0) it will check the registry and print a
+# message on the reports to notify you along with a link to where you
+# can download the latest version of the plugin (or Webcheck). Note that
+# this feature requires that Webcheck have access to the Internet
+#
+# **** THIS FEATURE IS CURRENTLY NONFUNCTIONAL ****
+WARN_OLD_VERSION = 0
+
+# Debug level. For normal output, set to 1. The higher the number, the more
+# output. A setting of 0 produces no output.
+DEBUG_LEVEL = 1
+
+################ The section below is for report plugins ################
+
+# This is the list of report plugins to display. The elements are strings and
+# there should be a corresponding .py file in the WEBCHECKHOME/reports directory
+# else bad things will occur ;-). Place in the order for which you would like to
+# see them in the navigation bar.
+# Note: Do not include the 'problems' report as it will appear (last) on all
+# reports automatically
+PLUGINS = ['sitemap',
+ 'badlinks',
+ 'images',
+ 'whatsold',
+ 'whatsnew',
+ 'slow',
+ 'notitles',
+ 'external',
+ 'notchkd']
+
+# This is a URL (absolute or relative) of a level 1 Cascading Stylesheet to be
+# used in all reports. See the default webcheck.css as well as the HTML source
+# for ideas on making your own .css for Webcheck.
+STYLESHEET = ''
+
+##### The Navigation (menu) frame/page ############
+NAVBAR_FILENAME = 'navbar.html'
+NAVBAR_WIDTH = '150'
+NAVBAR_PADDING = 4
+NAVBAR_SPACING = 0
+
+MAIN_FILENAME = 'index.html'
+
+# url to logo (image) shown on all pages. If you change this you will also
+# want to change the LOGO_ALT option below
+LOGO_HREF="http://www.mired.org/webcheck/webcheck.gif"
+
+##### Configuratin for specific plugins #####
+REPORT_SITEMAP_LEVEL = 5 # How many levels deep to display the site map
+
+# number of columns in thumbnail image page
+REPORT_IMAGES_COLS=5
+# width of thumbnail images
+REPORT_IMAGES_WIDTH=100
+# height of thumbnail images
+REPORT_IMAGES_HEIGHT=100
+
+REPORT_WHATSOLD_URL_AGE = 700
+REPORT_WHATSNEW_URL_AGE = 7
+
+REPORT_SLOW_URL_SIZE = 76
+
diff --git a/contrib/plugins/about.py b/contrib/plugins/about.py
new file mode 100644
index 0000000..470d02e
--- /dev/null
+++ b/contrib/plugins/about.py
@@ -0,0 +1,47 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+"""Plugins used in this report"""
+
+# This is a trivial plugin aid developers of linbot pluggins
+
+__version__ = '1.0'
+__author__ = 'mwm@mired.org'
+
+import webcheck
+from httpcodes import HTTP_STATUS_CODES
+from rptlib import *
+
+Link = webcheck.Link
+linkList = Link.linkList
+config = webcheck.config
+
+title = "About&nbsp;Plugins"
+
+def generate():
+ print '<div class="table">'
+ print '<table border=0 cellpadding=2 cellspacing=2 width="75%">'
+ print '<tr><th>Plugin</th><th>Version</th><th>Author</th></tr>'
+ for plugin in config.PLUGINS + ['problems']:
+ report = __import__('plugins.%s' % plugin,globals(),locals(),[plugin])
+ author = report.__author__
+ version = report.__version__
+ print '<tr><td class="pluginname">%s</td>' % plugin,
+ print '<td class="pluginversion">%s</td>' % version,
+ print '<td class="pluginauthor">%s</td></tr>' % author
+ print '</table>'
+ print '</div>'
diff --git a/debugio.py b/debugio.py
new file mode 100644
index 0000000..3ef3d76
--- /dev/null
+++ b/debugio.py
@@ -0,0 +1,34 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+
+
+"""debugio.py: debugging and input/output module
+
+ This module contains facilities for printing to standard output. The use
+ of this module is really simple: import it, set DEBUG_LEVEL, and use write()
+ whenever you want to print something. The print function will print to
+ standard output depending on DEBUG_LEVEL.
+"""
+import sys
+
+DEBUG_LEVEL=1
+
+def write(s, level=1, file=sys.stdout):
+ """Write s to stdout if DEBUG_LEVEL is >= level"""
+
+ if DEBUG_LEVEL >= level: file.write("%s\n" % s)
diff --git a/htmlparse.py b/htmlparse.py
new file mode 100644
index 0000000..38e1f13
--- /dev/null
+++ b/htmlparse.py
@@ -0,0 +1,129 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+
+"""Utilites for parsing HTML and urls"""
+
+import htmllib
+import string
+import debugio
+
+def urlformat(url,parent=None):
+ """ returns a formatted version of URL, which, adds trailing '/'s, if
+ necessary, deletes fragmentation identifiers '#' and expands partial url's
+ based on parent"""
+
+ from urlparse import urlparse, urljoin, urlunparse
+
+ method=urlparse(url)[0]
+ if (method=='') and (parent != None):
+ url=urljoin(parent,url)
+ #url=basejoin(parent,url)
+ parsedlist = list(urlparse(url))
+ parsedlist[5]='' # remove fragment
+ # parsedlist[4]='' # remove query string
+ url = urlunparse(tuple(parsedlist))
+ return url
+
+
+class MyHTMLParser(htmllib.HTMLParser):
+
+ def __init__(self,formatter):
+ self.imagelist = []
+ self.title = None
+ self.author = None
+ self.base = None
+ htmllib.HTMLParser.__init__(self,formatter)
+
+ # override handle_image()
+ def handle_image(self,src,alt,*stuff):
+ if src not in self.imagelist: self.imagelist.append(src)
+
+ def do_frame(self,attrs):
+ for name, val in attrs:
+ if name=="src":
+ self.anchorlist.append(val)
+
+ def save_bgn(self):
+ self.savedata = ''
+
+
+ def save_end(self):
+ data = self.savedata
+ self.savedata = None
+ return data
+
+ def start_title(self, attrs):
+ self.save_bgn()
+
+ def end_title(self):
+ #if not self.savedata:
+ # self.title = None
+ # return
+ self.title = string.join(string.split(self.save_end()))
+
+ def do_meta(self,attrs):
+ fields={}
+ for name, value in attrs:
+ fields[name]=value
+ if fields.has_key('name'):
+ if string.lower(fields['name']) == 'author':
+ if fields.has_key('content'):
+ author = fields['content']
+ self.author = author
+ debugio.write('\tauthor: ' + author)
+
+ # <AREA> for client-side image maps
+ def do_area(self,attrs):
+ for name, val in attrs:
+ if name=="href":
+ if val not in self.anchorlist:
+ self.anchorlist.append(val)
+
+ def do_base(self,attrs):
+ for name,val in attrs:
+ if name=="href":
+ self.base = val
+
+def pageLinks(url,page):
+ """ returns a list of all the url's in a page. page should be a file object
+ Partial urls will be expanded using <url> parameter unless the page contains
+ the <BASE HREF=> tag."""
+ import htmllib
+ from formatter import NullFormatter
+
+ parser = MyHTMLParser(NullFormatter())
+ parser.feed(page)
+ parser.close()
+ urllist = []
+ imagelist = []
+
+ title = parser.title
+ author = parser.author
+ if parser.base is not None:
+ parent = parser.base
+ else:
+ parent = url
+ for anchor in parser.anchorlist:
+ anchor=urlformat(anchor,parent)
+ if anchor not in urllist: urllist.append(anchor)
+
+ for image in parser.imagelist:
+ image=urlformat(image,parent)
+ if image not in imagelist: imagelist.append(image)
+
+ return (urllist, imagelist, title, author)
diff --git a/httpcodes.py b/httpcodes.py
new file mode 100644
index 0000000..6060c67
--- /dev/null
+++ b/httpcodes.py
@@ -0,0 +1,58 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+
+__version__='0.10'
+__author__ = 'Mike Meyer <mwm@miredo.org>'
+
+HTTP_STATUS_CODES= {'100':"Continue",
+ '101':"Switching Protocols",
+ '200':"OK",
+ '201':"Created",
+ '202':"Accepted",
+ '204':"No Content",
+ '205':"Reset Content",
+ '206':"Partial Content",
+ '300':"Multiple Choices",
+ '301':"Moved Permanently",
+ '302':"Moved Temporarily",
+ '303':"See Other",
+ '304':"Not Modified",
+ '305':"Use Proxy",
+ '400':"Bad Request",
+ '401':"Unauthorized",
+ '402':"Payment Required",
+ '403':"Forbidden",
+ '404':"Not Found",
+ '405':"Method Not Allowed",
+ '406':"Not Acceptable",
+ '407':"Proxy Authentication Required",
+ '408':"Request Time-out",
+ '409':"Conflict",
+ '410':"Gone",
+ '411':"Length Required",
+ '412':"Precondition Failed",
+ '413':"Request Entity Too Large",
+ '414':"Request-URI Too Large",
+ '415':"Unsupported Media Type",
+ '500':"Internal Server Error",
+ '501':"Not Implemented",
+ '502':"Bad Gateway",
+ '503':"Service Unavailable",
+ '504':"Gateway Time-out",
+ '505':"HTTP Version not supported"
+ }
diff --git a/myUrlLib.py b/myUrlLib.py
new file mode 100644
index 0000000..fc7804d
--- /dev/null
+++ b/myUrlLib.py
@@ -0,0 +1,303 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+"""Generic library for handling urls and links"""
+
+config = None
+robot_parsers={}
+SECS_PER_DAY=60*60*24
+compiled_ex = []
+compiled_yanked = []
+linkmodules={}
+
+from urllib import *
+from types import *
+import htmllib
+import httplib
+import robotparser
+import string
+# The following is to help sgmllib parse DOS/Windows text files
+string.whitespace = string.whitespace + '\012\015'
+import time
+import re
+import stat
+import htmlparse
+import debugio
+import sys
+import socket
+
+
+def get_robots(location):
+ global robot_parsers
+ debugio.write('\tGetting robots.txt for %s' % location)
+ rp=robotparser.RobotFileParser(config.PROXIES)
+ try:
+ rp.set_url('http://' + location + '/robots.txt')
+ rp.read()
+ robot_parsers[location]=rp
+ except TypeError:
+ pass
+
+def can_fetch(location, url):
+ """Return true if url is allowed at location, else return 0"""
+ if robot_parsers.has_key(location):
+ return robot_parsers[location].can_fetch('Webcheck',url)
+ return 1
+
+############################################################################
+class Link:
+ """ my class of url's which includes parents, HTTP status number, and
+ a list of URL's in that link urls.
+ """
+
+ linkList = {}
+ badLinks = []
+ notChecked = []
+ images = {}
+ baseurl = ""
+ base=""
+
+ # This is a static variable to indicate if the config.EXCLUDED urls have been
+ # compiled as regular expressions.
+ re_compiled = 0
+
+ def __init__(self,url,parent):
+ self.init()
+
+ debugio.write('\tparent = ' + str(parent),2)
+ from urlparse import urlparse
+
+ parsed = urlparse(url)
+ self.scheme = parsed[0]
+ location = parsed[1]
+
+ if parent not in self.parents:
+ if parent: self.parents.append(parent)
+
+ self.URL = url
+ Link.linkList[self.URL]=self
+
+ modname = self.scheme + 'link'
+ if linkmodules.has_key(modname): linkmodule = linkmodules[modname]
+ else:
+ try:
+ linkmodule = linkmodules[modname] = __import__('schemes.'+modname, globals(),locals(),[modname])
+ except ImportError:
+ self.status="Not Checked"
+ self.external=1
+ self.URL=url
+ Link.notChecked.append(self.URL)
+ Link.linkList[self.URL]=self
+ debugio.write('\tNot checked: URL scheme ' + self.scheme + ' ignored.')
+ return
+
+ if (parent is None):
+ Link.baseurl=self.URL
+ if hasattr(self.URL, 'rfind'):
+ Link.base=self.URL[:self.URL.rfind('/')+1]
+ else:
+ Link.base=self.URL[:string.rfind(self.URL,'/')+1]
+ if Link.base[-2:] == '//': Link.base = self.URL
+ debugio.write('\tbase: %s' % Link.base)
+ if self.scheme == 'http':
+ base_location = parsed[1]
+ if base_location not in config.HOSTS:
+ config.HOSTS.append(base_location)
+ if not robot_parsers.has_key(location):
+ try:
+ get_robots(location)
+ except IOError:
+ pass
+
+ # see if robots.txt will let us in
+ if self.scheme == 'http':
+ if not can_fetch(location, url):
+ debugio.write('\tRobot Restriced')
+ self.status = 'Not Checked'
+ self.message = 'Robot Restricted'
+ Link.notChecked.append(url)
+ return
+
+ try:
+ linkmodule.init(self, url, parent)
+ if (self.URL not in Link.badLinks) and (self.type == 'text/html'):
+ page = linkmodule.get_document(self.URL)
+ self._handleHTML(self.URL, page)
+ except IOError, data:
+ self.set_bad_link(url,str(data.errno) + ': ' + str(data.strerror))
+ return
+ except socket.error, data:
+ if type(data) is StringType:
+ self.set_bad_link(url, data)
+ elif type(data) is TupleType:
+ errno, string = data
+ self.set_bad_link(url,str(errno) + ': ' + string)
+ else:
+ self.set_bad_link(url,str(data))
+ except KeyboardInterrupt:
+ raise KeyboardInterrupt
+ except:
+ self.set_bad_link(url,"Error: Malformed URL?")
+ debugio.write("\t%s: %s" % (sys.exc_type, sys.exc_value),3)
+ return
+
+ def explore_children(self):
+ for child in self.children:
+ if not Link.linkList.has_key(child):
+ if config.WAIT_BETWEEN_REQUESTS > 0:
+ debugio.write('sleeping %s seconds' % config.WAIT_BETWEEN_REQUESTS)
+ time.sleep(config.WAIT_BETWEEN_REQUESTS)
+ debugio.write("adding url: %s" % child)
+ if is_yanked(child):
+ Link.linkList[child]=ExternalLink(child,self.URL,1)
+ elif is_external(child) or is_excluded(child):
+ Link.linkList[child]=ExternalLink(child,self.URL)
+ else:
+ Link.linkList[child]=Link(child,self.URL)
+ elif self.URL not in Link.linkList[child].parents:
+ Link.linkList[child].parents.append(self.URL)
+ return # __init__
+
+ def init(self):
+ """ initialize some variables """
+ self.age = None
+ self.scheme = None
+ self.headers = None
+ self.parents= []
+ self.children = []
+ self.status = None
+ self.title = None
+ self.external = 0
+ self.html = 0
+ self.size = 0
+ self.totalSize = 0
+ self.author = None
+
+ def __repr__(self):
+ return self.URL
+
+ def set_bad_link(self,url,status):
+ """ flags the link as bad """
+ debugio.write('\t' + str(status))
+ self.status = str(status)
+ self.URL=url
+ Link.linkList[self.URL]=self
+ Link.badLinks.append(self.URL)
+
+ def _handleHTML(self,url,htmlfile):
+ """examines and html file and updates the Link object"""
+ # get anchorlist
+ (anchorlist, imagelist, title, author) = htmlparse.pageLinks(url,htmlfile)
+
+ debugio.write('\ttitle: %s' % str(title))
+ for child in anchorlist:
+ if child not in self.children:
+ self.children.append(child)
+
+ self.totalSize = self.size
+ self.title = title
+ self.author = author
+ self.html = 1
+ # get image list
+ for image in imagelist:
+ if image not in Link.images.keys():
+ debugio.write('\tadding image: %s' % image)
+ Link.images[image] = Image(image, self.URL)
+ self.totalSize = self.totalSize + int(Link.images[image].size)
+ if not self.external: self.explore_children()
+ return
+
+
+
+class ExternalLink(Link):
+ """ this class is just like Link, but it does not explore it's children """
+
+ def __init__(self,url,parent,yanked=0):
+
+ if config.AVOID_EXTERNAL_LINKS or yanked:
+ self.init()
+ self.status="Not Checked"
+ self.external=1
+ debugio.write('\tNot checked')
+ if yanked: debugio.write('\tYanked')
+ if parent not in self.parents:
+ if parent: self.parents.append(parent)
+ Link.notChecked.append(url)
+ return
+ Link.__init__(self,url,parent)
+ self.external=1
+
+
+ def _handleHTML(self,url,htmlfile):
+ # ignore links and images, but use the title
+ self.title = htmlparse.pageLinks(url,htmlfile)[2]
+ debugio.write('\ttitle: %s' % str(self.title))
+ self.children=[]
+
+class Image(Link):
+ """ This class is just like link, but different :-)"""
+ def __init__(self, url, parent):
+ #self.init()
+ Link.__init__(self, url, parent)
+ #self.age = getAge(self)
+
+ def _handleHTML(self,url,htmlfile):
+ """Don't handle HTML, this is an image"""
+ self.set_bad_link(url,"HTML file used in IMG tag?")
+ return
+
+def is_external(url):
+ """ returns true if url is an external link """
+ from urlparse import urlparse
+ parsed = urlparse(url)
+ scheme = parsed[0]
+ location = parsed[1]
+ if (location not in config.HOSTS) and (scheme in ['http','ftp']):
+ return 1
+ if config.BASE_URLS_ONLY and (Link.base!=url[:len(Link.base)]):
+ return 1
+ return 0
+
+def compile_re():
+ """Compile EXCLUDED URLSs and set flag"""
+ global compiled_ex
+ for i in config.EXCLUDED_URLS:
+ debugio.write('compiling %s' % i,3)
+ compiled_ex.append(re.compile(i,re.IGNORECASE))
+ for i in config.YANKED_URLS:
+ debugio.write('compiling %s' % i,3)
+ compiled_yanked.append(re.compile(i,re.IGNORECASE))
+ Link.re_compiled = 1
+
+def is_excluded(url):
+ """ Returns true if url is part of the EXCLUDED_URLS list """
+ if not Link.re_compiled:
+ compile_re()
+ for x in compiled_ex:
+ if x.search(url) is not None:
+ return 1
+ return 0
+
+def is_yanked(url):
+ """ Returns true if url is part of YANKED_URLS list"""
+ if not Link.re_compiled:
+ compile_re()
+ for x in compiled_yanked:
+ if x.search(url) is not None:
+ return 1
+ return 0
+
diff --git a/plugins/__init__.py b/plugins/__init__.py
new file mode 100644
index 0000000..2a586eb
--- /dev/null
+++ b/plugins/__init__.py
@@ -0,0 +1,17 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
diff --git a/plugins/badlinks.py b/plugins/badlinks.py
new file mode 100644
index 0000000..ef1f229
--- /dev/null
+++ b/plugins/badlinks.py
@@ -0,0 +1,56 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+"""Listing of bad links"""
+
+__version__ = '1.0'
+__author__ = 'mwm@mired.org'
+
+import webcheck
+from httpcodes import HTTP_STATUS_CODES
+from rptlib import *
+
+Link = webcheck.Link
+linkList = Link.linkList
+config = webcheck.config
+
+title = 'Bad Links'
+
+def generate():
+ print '<div class="table">'
+ print '<table border=0 cellspacing=2 width="75%">'
+ for link in Link.badLinks:
+ print '\t<tr><td class="blank" colspan=3>&nbsp;</td></tr>'
+ if config.ANCHOR_BAD_LINKS:
+ print '\t<tr class="link"><th>Link</th>',
+ print '<td colspan=2 align=left>' +make_link(link,link) +'</td></tr>'
+ else:
+ print '\t<tr class="link"><th>Link</th>',
+ print '<td colspan=2 align=left>%s</td></tr>' % link
+ status = str(linkList[link].status)
+ if status in HTTP_STATUS_CODES.keys():
+ status = status + ": " + HTTP_STATUS_CODES[status]
+ print '\t<tr class="status"><th>Status</th><td colspan=2>%s</td></tr>' % status
+ print '\t<tr class="parent"><th rowspan="%s">Parents</th>' % len(linkList[link].parents)
+ parents = linkList[link].parents
+ parents.sort(sort_by_author)
+ for parent in parents:
+ print '\t\t<td>%s</td>' % make_link(parent,get_title(parent)),
+ print '<td>%s</td>\n\t</tr>' % (str(linkList[parent].author))
+ add_problem("Bad Link: " + link,linkList[parent])
+ print '</table>'
+ print '</div>'
diff --git a/plugins/external.py b/plugins/external.py
new file mode 100644
index 0000000..44e11f3
--- /dev/null
+++ b/plugins/external.py
@@ -0,0 +1,40 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+"""External links"""
+
+__version__ = '1.0'
+__author__ = 'mwm@mired.org'
+
+
+import webcheck
+from httpcodes import HTTP_STATUS_CODES
+from rptlib import *
+
+Link = webcheck.Link
+linkList = Link.linkList
+config = webcheck.config
+
+title = 'External Links'
+
+def generate():
+ print '<ol>'
+ for url in linkList.keys():
+ link=linkList[url]
+ if link.external:
+ print '\t<li>%s' % make_link(url,get_title(url))
+ print '</ol>'
diff --git a/plugins/images.py b/plugins/images.py
new file mode 100644
index 0000000..7d65d0b
--- /dev/null
+++ b/plugins/images.py
@@ -0,0 +1,58 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+"""Image Catalog"""
+
+__version__ = '1.0'
+__author__ = 'mwm@mired.org'
+
+import webcheck
+from httpcodes import HTTP_STATUS_CODES
+from rptlib import *
+
+Link = webcheck.Link
+linkList = Link.linkList
+config = webcheck.config
+
+title = 'Images'
+
+# images
+def generate():
+ import math
+ imagelist=Link.images.keys()
+
+ currentPic=0
+ rows = int(math.ceil(len(imagelist)/config.REPORT_IMAGES_COLS))+1
+ print '<div class="table">'
+ print '<table border=0 cellspacing="1" cellpadding="0">'
+
+ for row in range(rows):
+ print'\t<tr>'
+ for col in range(config.REPORT_IMAGES_COLS):
+ if currentPic==len(imagelist): break
+ image=imagelist[currentPic]
+ print '\t\t<td>' + \
+ make_link(image,
+ '<img src="%s" width="%d" height="%d" alt="%s">' \
+ % (image,config.REPORT_IMAGES_WIDTH,
+ config.REPORT_IMAGES_HEIGHT, image)),
+ print '</td>'
+ currentPic = currentPic + 1
+
+ print '\t</tr>'
+ print '</table>'
+ print '</div>'
diff --git a/plugins/notchkd.py b/plugins/notchkd.py
new file mode 100644
index 0000000..d9c08d0
--- /dev/null
+++ b/plugins/notchkd.py
@@ -0,0 +1,46 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+"""Pages which were not checked"""
+
+__version__ = '1.0'
+__author__ = 'mwm@mired.org'
+
+import webcheck
+from httpcodes import HTTP_STATUS_CODES
+from rptlib import *
+
+Link = webcheck.Link
+linkList = Link.linkList
+config = webcheck.config
+
+title = 'Not Checked'
+
+def generate():
+ print '<div class="table">'
+ print '<table border=0 cellpadding=2 cellspacing=2 width="75%">'
+ for url in Link.notChecked:
+ print '\t<tr><th colspan=4>%s</th></tr>' % make_link(url,url)
+ print '\t<tr class="parent"><th rowspan="%s">Parent</th>' % len(linkList[url].parents)
+ for parent in linkList[url].parents:
+ print '\t\t',
+ if parent != linkList[url].parents[0]: print '<tr>',
+ print '<td colspan=2>%s</td>' % make_link(parent,get_title(parent)),
+ print '<td>%s</td></tr>' % (linkList[parent].author)
+ print '\n\t<tr><td class="blank" colspan=4>&nbsp;</td></tr>\n'
+ print '</table>'
+ print '</div>'
diff --git a/plugins/notitles.py b/plugins/notitles.py
new file mode 100644
index 0000000..aba829a
--- /dev/null
+++ b/plugins/notitles.py
@@ -0,0 +1,47 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+"""Pages with no titles"""
+
+__version__ = '1.0'
+__author__ = 'mwm@mired.org'
+
+import webcheck
+from httpcodes import HTTP_STATUS_CODES
+from rptlib import *
+
+Link = webcheck.Link
+linkList = Link.linkList
+config = webcheck.config
+
+title = 'No Titles'
+
+def generate():
+ print '<div class="table">'
+ print '<table border=0 cellpadding=2 cellspacing=2 width="75%">'
+ print '\t<tr><th>URL</th><th>Author</th></tr>'
+ urls = linkList.keys()
+ urls.sort(sort_by_author)
+ for url in urls:
+ link = linkList[url]
+ if link.external: continue
+ if link.html and (link.title is None):
+ print '\t<tr><td>%s</td><td>%s</td></tr>' \
+ % (make_link(url,url), link.author)
+ add_problem("No Title",link)
+ print '</table>'
+ print '</div>'
diff --git a/plugins/problems.py b/plugins/problems.py
new file mode 100644
index 0000000..2a42f99
--- /dev/null
+++ b/plugins/problems.py
@@ -0,0 +1,53 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+"""Breakdown of links with problems"""
+
+__version__ = '1.0'
+__author__ = 'mwm@mired.org'
+
+import webcheck
+from httpcodes import HTTP_STATUS_CODES
+from rptlib import *
+
+Link = webcheck.Link
+linkList = Link.linkList
+config = webcheck.config
+
+title = 'Problems (By&nbsp;Author)'
+
+def generate():
+ authors=problem_db.keys()
+ authors.sort()
+ if len(authors) > 1:
+ print '<p class="authorlist">'
+ for author in authors[:-1]:
+ print '<a href="#%s">%s</a>' % (author, author),
+ print " | "
+ print '<a href="#%s">%s</a>' % (authors[-1], authors[-1]),
+ print '</p>'
+ print '<div class="table">'
+ print '<table border=0 cellpadding=2 cellspacing=2 width="75%">'
+ for author in authors:
+ print '<tr><th><a name="%s">%s</a></th></tr>' % (author,author)
+ for type,link in problem_db[author]:
+ url=`link`
+ title=get_title(url)
+ print '<tr><td>%s <br>%s</td></tr>' % (make_link(url,title), type)
+ print '<tr><td class="blank">&nbsp;</td></tr>\n'
+ print '</table>'
+ print '</div>'
diff --git a/plugins/rptlib.py b/plugins/rptlib.py
new file mode 100644
index 0000000..101a56e
--- /dev/null
+++ b/plugins/rptlib.py
@@ -0,0 +1,290 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+__version__ = '1.0'
+__author__ = 'mwm@mired.org'
+
+import sys
+import webcheck
+import urllib
+import string
+import os
+import debugio
+import version
+
+Link = webcheck.Link
+linkList = Link.linkList
+config = webcheck.config
+proxies = config.PROXIES
+
+problem_db = {}
+
+# get the stylesheet for insertion,
+# Note that I do it this way for two reasons. One is that Netscape reportedly
+# handles stylesheets better when they are inlined. Two is that people often
+# forget to put webcheck.css in the output directory.
+if proxies is None:
+ proxies = urllib.getproxies()
+opener = urllib.FancyURLopener(proxies)
+opener.addheaders = [('User-agent','Webcheck ' + version.webcheck)]
+try:
+ stylesheet = opener.open(config.STYLESHEET).read()
+except:
+ stylesheet = ''
+
+def get_title(url):
+ """ returns the title of a url if it is not None, else returns url
+ note that this implies linkList[url] """
+ link=linkList[url]
+ if link.title is None:
+ return url
+ return link.title
+
+def make_link(url,text):
+ """Return an <A>nchor to a url with <text>. If url is in the Linklist and
+ is external, insert "class=external" in the <A> tag."""
+ url = str(url) # because sometimes I lazily pass a Link object.
+ mystring = '<a href="' + url + '"'
+ try:
+ external = linkList[url].external
+ except KeyError:
+ external = 0
+ if external:
+ mystring = mystring + ' class="external"'
+ else:
+ mystring = mystring + ' class="internal"'
+ mystring = mystring + '>' + text + '</a>'
+ return mystring
+
+def add_problem(type,link):
+ """ add a problem to the 'problems' database. Will not add external links"""
+ if link.external: return
+ global problem_db
+ author = link.author
+ if problem_db.has_key(author):
+ problem_db[author].append((type,link))
+ else:
+ problem_db[author]=[(type,link)]
+
+def sort_by_age(a,b):
+ """ sort helper for url's age. a and b are urls in linkList """
+ aage, bage = linkList[a].age, linkList[b].age
+ if aage < bage:
+ return -1
+ if aage == bage:
+ return sort_by_author(a,b)
+ return 1
+
+def sort_by_rev_age(a,b):
+ aage, bage = linkList[a].age, linkList[b].age
+ if aage > bage:
+ return -1
+ if aage == bage:
+ return sort_by_author(a,b)
+ return 1
+
+def sort_by_author(a,b):
+ aauthor,bauthor = `linkList[a].author`, `linkList[b].author`
+ if aauthor < bauthor:
+ return -1
+ if aauthor == bauthor:
+ return 0
+ return 1
+
+def sort_by_size(a,b):
+ asize, bsize = linkList[a].totalSize, linkList[b].totalSize
+ if asize < bsize:
+ return 1
+ if asize == bsize:
+ return 0
+ return -1
+
+def main_index():
+ tmp = sys.stdout
+ fp = open_file(config.MAIN_FILENAME)
+ sys.stdout=fp
+
+ print '<html>'
+ print '<head>'
+ print '<title>Webcheck report for "%s"</title>' % get_title(`Link.base`)
+ print '<style type="text/css">'
+ print '<!-- /* hide from old browsers */'
+ print stylesheet
+ print ' --> </style>'
+ print '</head>'
+ print '<frameset COLS="%s,*" border=0 framespacing=0>' \
+ % config.NAVBAR_WIDTH
+ print '<frame name="navbar" src="%s" marginwidth=0 marginheight=0 frameborder=0>' \
+ % config.NAVBAR_FILENAME
+ print '<frame name="main" src="%s" frameborder=0>' % (webcheck.plugins[0]+'.html')
+ print '</frameset>'
+ print '</html>'
+ fp.close()
+ sys.stdout = tmp
+
+
+def nav_bar(plugins):
+ # navigation bar
+ fp=open_file(config.NAVBAR_FILENAME)
+ stdout = sys.stdout
+ sys.stdout = fp
+ print '<html>\n<head>'
+ print '\t<title>navbar</title>'
+ print '<style type="text/css">'
+ print '<!-- /* hide from old browsers */'
+ print stylesheet
+ print ' --> </style>'
+ print '\t<base target="main">'
+ print '</head>'
+ print '<body class="navbar">'
+ print '<div align=center>'
+ print '<table cellpadding="%s" cellspacing="%s">' \
+ % (config.NAVBAR_PADDING, config.NAVBAR_SPACING)
+ # title
+ print '<tr><th class="home">',
+ print '<a target="_top" href="%s" onMouseOver="window.status=\'Webcheck Home Page\'; return true;">Webcheck %s</a></th></tr>' \
+ % (version.home, version.webcheck)
+
+ # labels pointing to each individual page
+ for plugin in plugins + ['problems']:
+ debugio.write('\t' + plugin,file=stdout)
+ filename = plugin + '.html'
+ print '<tr><th>',
+ report = __import__('plugins.' + plugin, globals(), locals(), [plugin])
+ print '<strong><a href="%s" onMouseOver="window.status=\'%s\'; return true">%s</a></strong>' \
+ % (filename, report.__doc__, report.title),
+ print '</th></tr>'
+
+ # create the file we just pointed to
+ tmp = sys.stdout
+ fp = open_file(filename)
+ sys.stdout = fp
+ doTopMain(report)
+ report.generate()
+ report_version = report.__version__
+ if config.WARN_OLD_VERSION:
+ check_and_warn(plugin,report_version)
+ doBotMain()
+ fp.close()
+ sys.stdout = tmp
+
+ print
+ print '</table>'
+ print '</div>'
+ print '</body>'
+ print '</html>'
+
+ fp.close()
+ sys.stdout = stdout
+
+def open_file(filename):
+ """ given config.OUTPUT_DIR checks if the directory already exists; if not, it creates it, and then opens filename for writing and returns the file object """
+ if os.path.isdir (config.OUTPUT_DIR) == 0:
+ os.mkdir(config.OUTPUT_DIR)
+ return open(config.OUTPUT_DIR + filename,'w')
+
+def doTopMain(report):
+ """top part of html files in main frame prints to stdout"""
+ print '<html>'
+ print '<head><title>%s</title>' % report.title
+ print '<style type="text/css">'
+ print '<!-- /* hide from old browsers */'
+ print stylesheet
+ print ' --> </style>'
+ print '<meta name="Author" content="Webcheck ' + version.webcheck + '">'
+ print '</head>'
+ print '<body class="%s">' % string.split(report.__name__,'.')[1]
+ print '<p class="logo"><a '
+ print 'href="%s"><img src="%s" border=0 alt=""></a></p>' % (Link.base, config.LOGO_HREF)
+ print '\n<h1 class="basename">'
+ print '\t<a href="%s">%s</a>' \
+ % (`Link.base`, get_title(`Link.base`))
+ print '</h1>'
+ print '\n\n<table width="100%" cellpadding=4>'
+ print '\t<tr><th class="title">%s</th></tr>\n</table>\n' % report.title
+
+def doBotMain():
+ """ bottom part of html files in main frame"""
+ print
+ print '<hr>'
+ print '<p class="footer">'
+ print '<em>Generated %s by <a target="_top" href="%s">Webcheck %s</a></em></p>' \
+ % (webcheck.start_time,version.home, version.webcheck)
+ print '</body>'
+ print '</html>'
+
+
+def read_registry(url):
+ """Read file referenced by url and return a registry object.
+
+ The registry object is just a dictionary. The key an individual
+ module name. The value is a tuple consisting of the latest version
+ and the url where it can be retrieved. e.g.:
+ registry['mymodule'] = ('1.0','http://www.mymodule.com/')
+ """
+ registry = {}
+ lines = opener.open(url).readlines()
+ opener.close()
+ for line in lines:
+ fields = string.split(line)
+ if len(fields) != 3: continue
+ registry[fields[0]] = fields[1:]
+
+ return registry
+
+def check_and_warn(plugin,plugin_version):
+ """Check to see if Webcheck and plugin are up to date if so write it in
+ the report.
+ """
+
+ old_webcheck = 0
+ old_plugin = 0
+
+ # first check to see if webcheck is up to date
+ try:
+ if version.webcheck != registry['webcheck'][0]:
+ old_webcheck = 1
+ except KeyError:
+ pass
+ try:
+ if plugin_version != registry[plugin][0]:
+ old_plugin = 1
+ except KeyError:
+ pass
+
+ if (old_plugin + old_webcheck):
+ print '<table class="warning" cellpadding="4" cellspacing="0" border="0">'
+ print '<tr><td><strong>Warning:</strong> ',
+ if old_webcheck:
+ print 'The version of Webcheck you are using (%s) is outdated.' \
+ % version.webcheck,
+ print 'You may download the latest version, %s, at ' \
+ % registry['webcheck'][0],
+ print '<a href="%s" target="_top">%s</a>.<br><br>' \
+ % (registry['webcheck'][1],registry['webcheck'][1])
+ if old_plugin:
+ print 'The %s plugin used to generate this report is outdated.' \
+ % plugin,
+ print 'This version is %s. The latest version is %s ' \
+ % (plugin_version, registry[plugin][0]),
+ print 'And may be downloaded at <a href="%s" target="_top">%s</a>.<br>' \
+ % (registry[plugin][1],registry[plugin][1])
+ print '</td></tr></table>'
+
+if config.WARN_OLD_VERSION:
+ registry = read_registry(version.registry)
+ debugio.write('registry = %s' % registry,4)
diff --git a/plugins/sitemap.py b/plugins/sitemap.py
new file mode 100644
index 0000000..8338e4e
--- /dev/null
+++ b/plugins/sitemap.py
@@ -0,0 +1,79 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+"""Your site at-a-glance"""
+
+__version__ = '1.0'
+__author__ = 'mwm@mired.org'
+
+import webcheck
+from rptlib import *
+
+title = 'Site Map'
+level = 0
+
+def explore(link, explored):
+ """Recursively do a breadth-first traversal of the graph of links
+ on the site. Returns a list of HTML fragments that can be printed
+ to produce a site map."""
+
+ global level
+ if level > webcheck.config.REPORT_SITEMAP_LEVEL: return []
+ # XXX I assume an object without a .URL is something
+ # uninteresting? --amk
+ if not hasattr(link, 'URL'): return []
+
+ level=level+1
+ explored[ link.URL ] = 1
+ to_explore = []
+ L = ['<ul>']
+
+ # We need to do a breadth-first traversal. This requires two
+ # steps for any given page. First, we need to make a list of
+ # links to be traversed; links that have already been explored can
+ # be ignored.
+
+ for i in link.children:
+ # Skip pages that have already been traversed
+ if explored.has_key( i ): continue
+ if (i in webcheck.Link.badLinks) and not webcheck.config.ANCHOR_BAD_LINKS:
+ L.append('<li>%s' % i)
+ else:
+ to_explore.append(i)
+ explored[ i ] = 1 # Mark the link as explored
+
+ # Now we loop over the list of links; the traversal will not go to
+ # any pages that are marked as having already been traversed.
+ for i in to_explore:
+ child = webcheck.Link.linkList[i]
+ L.append('<li>%s' % (make_link(i,get_title(i))))
+ L = L + explore(child, explored)
+
+ L.append( '</ul>' )
+ level=level-1
+
+ # If no sub-pages were traversed at all, just return an empty list
+ # to avoid redundant <UL>...</UL> pairs
+ if len(L) == 2: return []
+
+ return L
+
+# site map
+def generate():
+ print make_link(webcheck.Link.base,'Starting Page')
+ L = explore(webcheck.Link.base, {})
+ for i in L: print i
diff --git a/plugins/slow.py b/plugins/slow.py
new file mode 100644
index 0000000..ab18d2f
--- /dev/null
+++ b/plugins/slow.py
@@ -0,0 +1,61 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+"""Pages that are slow to download"""
+
+__version__ = '1.0'
+__author__ = 'mwm@mired.org'
+
+import webcheck
+from httpcodes import HTTP_STATUS_CODES
+from rptlib import *
+
+Link = webcheck.Link
+linkList = Link.linkList
+config = webcheck.config
+
+title = "What's Slow"
+
+def generate():
+ import time
+ print '<div class="table">'
+ print '<table border=0 cellpadding=2 cellspacing=2 width="75%">'
+ print '\t<tr><th rowspan=2>Link</th>',
+ print '<th rowspan=2>Size <br>(Kb)</th>',
+ print '<th colspan=3>Time (HH:MM:SS)</th></tr>'
+ print '\t<tr><th>28.8</th><th>ISDN</th><th>T1</th></tr>'
+
+ urls = linkList.keys()
+ urls.sort(sort_by_size)
+ for url in urls:
+ link = linkList[url]
+ if not link.html: continue
+ sizeK = link.totalSize / 1024
+ sizek = link.totalSize * 8 / 1000
+ if sizeK < config.REPORT_SLOW_URL_SIZE:
+ break
+ print '\t<tr><td>%s</td>' % make_link(url, get_title(url)),
+ print '<td>%s</td><td class="time">%s</td>' \
+ % (sizeK, time.strftime('%H:%M:%S',time.gmtime(int(sizek/28.8)))),
+ print '<td class="time">%s</td>' \
+ % time.strftime('%H:%M:%S',time.gmtime(int(sizek/56))),
+ print '<td class="time">%s</td>' \
+ % time.strftime('%H:%M:%S',time.gmtime(int(sizek/1500))),
+ print '</tr>'
+ add_problem('Slow Link: %sK' % sizeK, link)
+ print '</table>'
+ print '</div>'
diff --git a/plugins/whatsnew.py b/plugins/whatsnew.py
new file mode 100644
index 0000000..1c655af
--- /dev/null
+++ b/plugins/whatsnew.py
@@ -0,0 +1,49 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+"""Recently modified pages"""
+
+__version__ = '1.0'
+__author__ = 'mwm@mired.org'
+
+import webcheck
+from httpcodes import HTTP_STATUS_CODES
+from rptlib import *
+
+Link = webcheck.Link
+linkList = Link.linkList
+config = webcheck.config
+
+title = "What's New"
+
+# what's new
+def generate():
+ print '<div class="table">'
+ print '<table border=0 cellpadding=2 cellspacing=2 width="75%">'
+ print '\t<tr><th>Link</th><th>Author</th><th>Age</th></tr>'
+ urls = linkList.keys()
+ urls.sort(sort_by_age)
+ for url in urls:
+ link=linkList[url]
+ if not link.html: continue
+ age = link.age
+ if (age is not None)and (age <= config.REPORT_WHATSNEW_URL_AGE):
+ print '\t<tr><td>%s</td>' % make_link(url,get_title(url)),
+ print '<td>%s</td>' % link.author,
+ print '<td class="time">%s</td></tr>' % age
+ print '</table>'
+ print '</div>'
diff --git a/plugins/whatsold.py b/plugins/whatsold.py
new file mode 100644
index 0000000..51d1ad2
--- /dev/null
+++ b/plugins/whatsold.py
@@ -0,0 +1,50 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+"""Potentially outdated pages"""
+
+__version__ = '1.0'
+__author__ = 'mwm@mired.org'
+
+import webcheck
+from httpcodes import HTTP_STATUS_CODES
+from rptlib import *
+
+Link = webcheck.Link
+linkList = Link.linkList
+config = webcheck.config
+
+title = "What's Old"
+
+# what's old
+def generate():
+ print '<div class="table">'
+ print '<table border=0 cellpadding=2 cellspacing=2 width="75%">'
+ print '\t<tr><th>Link</th><th>Author</th><th>Age</th></tr>'
+ urls = linkList.keys()
+ urls.sort(sort_by_rev_age)
+ for url in urls:
+ link=linkList[url]
+ if not link.html: continue
+ age = link.age
+ if age and (age >= config.REPORT_WHATSOLD_URL_AGE):
+ print '\t<tr><td>%s</td>' % make_link(url,get_title(url)),
+ print '<td>%s</td>' % (link.author),
+ print '<td class="time">%s</td></tr>' % age
+ add_problem('Old Link: %s days old' % age ,link)
+ print '</table>'
+ print '</div>'
diff --git a/robotparser.py b/robotparser.py
new file mode 100644
index 0000000..b479dd6
--- /dev/null
+++ b/robotparser.py
@@ -0,0 +1,103 @@
+"""
+
+Robots.txt file parser class. Accepts a list of lines or robots.txt URL as
+input, builds a set of rules from that list, then answers questions about
+fetchability of other URLs.
+
+Change made by marduk@python.net to support proxies.
+RobotFileParser class can be instantiated with optional proxies parameter,
+just like FancyURLopener in urllib.
+
+"""
+
+class RobotFileParser:
+
+ def __init__(self, proxies = None):
+ self.proxies = proxies
+ self.rules = {}
+ self.debug = 0
+ self.url = ''
+ self.last_checked = 0
+
+ def mtime(self):
+ return self.last_checked
+
+ def modified(self):
+ import time
+ self.last_checked = time.time()
+
+ def set_url(self, url):
+ self.url = url
+## import urlmisc
+## self.url = urlmisc.canonical_url(url)
+
+ def read(self):
+ import urllib
+ urlopener = urllib.FancyURLopener(self.proxies)
+ self.parse(urlopener.open(self.url).readlines())
+
+ def parse(self, lines):
+ import re, string
+ active = []
+ for line in lines:
+ if self.debug: print '>', line,
+ # blank line terminates current record
+ if not line[:-1]:
+ active = []
+ continue
+ # remove optional comment and strip line
+ line = string.strip(line[:string.find(line, '#')])
+ if not line:
+ continue
+ line = re.split(' *: *', line)
+ if len(line) == 2:
+ line[0] = string.lower(line[0])
+ if line[0] == 'user-agent':
+ # this record applies to this user agent
+ if self.debug: print '>> user-agent:', line[1]
+ active.append(line[1])
+ if not self.rules.has_key(line[1]):
+ self.rules[line[1]] = []
+ elif line[0] == 'disallow':
+ if line[1]:
+ if self.debug: print '>> disallow:', line[1]
+ for agent in active:
+ self.rules[agent].append(re.compile(line[1]))
+ else:
+ pass
+ for agent in active:
+ if self.debug: print '>> allow', agent
+ self.rules[agent] = []
+ else:
+ if self.debug: print '>> unknown:', line
+
+ self.modified()
+
+ # returns true if agent is allowed to fetch url
+ def can_fetch(self, agent, url):
+ import urlparse
+ ag = agent
+ if not self.rules.has_key(ag): ag = '*'
+ if not self.rules.has_key(ag):
+ if self.debug: print '>> allowing', url, 'fetch by', agent
+ return 1
+ path = urlparse.urlparse(url)[2]
+ for rule in self.rules[ag]:
+ if rule.match(path):
+ if self.debug: print '>> disallowing', url, 'fetch by', agent
+ return 0
+ if self.debug: print '>> allowing', url, 'fetch by', agent
+ return 1
+
+def test():
+ rp = RobotFileParser()
+ rp.debug = 1
+ rp.set_url('http://www.automatrix.com/robots.txt')
+ rp.read()
+ print rp.rules
+ print rp.can_fetch('*', 'http://www.calendar.com/concerts/')
+ print rp.can_fetch('Musi-Cal-Robot',
+ 'http://dolphin:80/cgi-bin/music-search?performer=Rolling+Stones')
+
+ print rp.can_fetch('Lycos', 'http://www/~skip/volkswagen/')
+ print rp.can_fetch('Lycos', 'http://www/~skip/volkswagen/vanagon-list-001')
diff --git a/schemes/__init__.py b/schemes/__init__.py
new file mode 100644
index 0000000..4b915ca
--- /dev/null
+++ b/schemes/__init__.py
@@ -0,0 +1,18 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+# hi mom
diff --git a/schemes/filelink.py b/schemes/filelink.py
new file mode 100644
index 0000000..0c0cb7c
--- /dev/null
+++ b/schemes/filelink.py
@@ -0,0 +1,57 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+"""This module defines the functions needed for creating Link objects for urls
+using the file scheme"""
+
+import urlparse
+import os
+import time
+import mimetypes
+import myUrlLib
+import re
+
+mimetypes.types_map['.shtml']='text/html'
+
+def init(self, url, parent):
+ self.URL = myUrlLib.basejoin(parent,url)
+ parsed = urlparse.urlparse(self.URL,'file',0)
+ filename = parsed[2]
+ if os.name != 'posix':
+ filename = re.sub("^/\(//\)?\([a-zA-Z]\)[|:]","\\2:",filename)
+ try:
+ stats = os.stat(filename)
+ except os.error:
+ self.set_bad_link(self.URL, "No such file or directory")
+ return
+
+ self.size = stats[6]
+
+ lastMod = stats[8]
+ self.age = int((time.time()-lastMod)/myUrlLib.SECS_PER_DAY)
+
+ self.type = mimetypes.guess_type(url)[0]
+ if self.type is None: self.type = 'application/octet-stream' # good enough?
+
+def get_document(url):
+ parsed = urlparse.urlparse(url,'file',0)
+ filename = parsed[2]
+ if os.name != 'posix':
+ filename = re.sub("^/\(//\)?\([a-zA-Z]\)[|:]","\\2:",filename)
+
+ return open(filename,'r').read()
+
diff --git a/schemes/ftplink.py b/schemes/ftplink.py
new file mode 100644
index 0000000..8f09d0a
--- /dev/null
+++ b/schemes/ftplink.py
@@ -0,0 +1,125 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 1998,1999 Mike Meyer <mwm@mired.org>
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+"""This module defines the functions needed for creating Link objects for urls
+using the ftp scheme"""
+
+import urllib
+import mimetypes
+import ftplib
+import urlparse
+import myUrlLib
+import string
+import posixpath
+import debugio
+
+Link = myUrlLib.Link
+
+def init(self, url, parent):
+
+ self.URL = myUrlLib.basejoin(parent,url)
+ self.type = mimetypes.guess_type(url)[0]
+
+ host, port, user, passwd, pathname = parseurl(url)
+ try:
+ ftp = ftplib.FTP(host,user,passwd)
+ stat(pathname, ftp)
+ except ftplib.all_errors, errtext:
+ self.set_bad_link(self.URL, str(errtext))
+ return
+
+ self.size = size(pathname,ftp)
+ if self.size is None: self.size = 0
+
+def callback(line):
+ """Read a line of text and do nothing with it"""
+ return
+
+def stat(pathname, ftpobject):
+ # This is not completely implemented
+ # Note: ftp servers do not respond with a 5xx error when a file does not
+ # exist except for GET, which I'm trying to GET around ;-) Anyway, an
+ # error code will be reported if you try to change to a directory that
+ # does not exist, so this is not totally useless
+ # In addition to the above, all of the ftp servers i tested this on
+ # did not report the correct code (211,212,213) when responding to STAT
+ # per RFC959. What the hell is up with that? Can checking ftp links be
+ # done reliably?
+ # FTP should be replaced by a new protocol that produces machine-readable
+ # responses and actually lets you get the status of a file without having to
+ # download it. Oh wait, that's what HTTP is.
+ dirs, filename = split_dirs(pathname)
+ cwd(dirs, ftpobject)
+ response = ftpobject.retrlines('NLST %s' % filename,callback)
+ debugio.write(response,2)
+
+def get_document(url):
+ host, port, user, passwd, pathname = parseurl(url)
+ dirs, filename = split_dirs(pathname)
+ ftp = ftplib.FTP(host,user,passwd)
+ cwd(dirs, ftp)
+ return ftp.retrbinary('RETR %s' % filename)
+
+def split_dirs(pathname):
+ """Given pathname, split it into a tuple consisting of a list of dirs and
+ a filename"""
+
+ dirs, filename = posixpath.split(pathname)
+ dirs = string.split(dirs,'/')
+ if dirs[0] == '': dirs[0] = '/'
+ if not filename:
+ filename = dirs[-1]
+ dirs = dirs[:-1]
+ return (dirs, filename)
+
+def size(pathname,ftpobject):
+ if pathname == '': pathname = '/'
+ dirs, filename = split_dirs(pathname)
+ debugio.write('pathname =%s' % pathname,3)
+ debugio.write('dirs= %s' % dirs,3)
+ debugio.write('filename= %s' % filename,3)
+ cwd(dirs, ftpobject)
+ return ftpobject.size(filename)
+
+def cwd(dirs, ftpobject):
+ for dir in dirs:
+ ftpobject.cwd(dir)
+
+def parseurl(url):
+ parsed = urlparse.urlparse(url)
+ host = parsed[1]
+ if '@' in host:
+ userpass, host = string.split(host,'@')
+ if ':' in userpass:
+ user, passwd = string.split(userpass,':')
+ else:
+ user = userpass
+ passwd = None
+ else:
+ user = 'anonymous'
+ # this is bad, i'll change it later
+ passwd = 'mwm@mired.org'
+
+ if ':' in host:
+ host, port = string.split(host,':')
+ port = int(port)
+ else:
+ port = ftplib.FTP_PORT
+
+ pathname = parsed[2]
+ if not port: port = ftplib.FTP_PORT
+ return (host, port, user, passwd, pathname)
diff --git a/schemes/httplink.py b/schemes/httplink.py
new file mode 100644
index 0000000..c792d84
--- /dev/null
+++ b/schemes/httplink.py
@@ -0,0 +1,167 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+"""This module defines the functions needed for creating Link objects for urls
+using the http scheme"""
+
+import myUrlLib
+import string
+import httplib
+import urllib
+import time
+import urlparse
+import base64
+import mimetypes
+import debugio
+import version
+
+config = myUrlLib.config
+Link = myUrlLib.Link
+proxies = config.PROXIES
+if proxies is None:
+ proxies = urllib.getproxies()
+redirect_depth = 0
+
+opener = urllib.FancyURLopener(proxies)
+opener.addheaders = [('User-agent','Webcheck ' + version.webcheck)]
+
+def get_reply(url):
+ """Open connection to url and report information given by HEAD command"""
+
+ global redirect_depth
+ parsed = urlparse.urlparse(url)
+ if proxies and proxies.has_key('http'):
+ host = urlparse.urlparse(proxies['http'])[1]
+ document = url
+
+ else:
+ host = parsed[1]
+ document = string.join(parsed[2:4],'')
+
+ if not document: document = '/'
+ debugio.write('document= %s' % document,3)
+
+ (username, passwd, realhost, port) = parse_host(host)
+
+ h = httplib.HTTP()
+ if port:
+ h.connect(realhost, port)
+ else:
+ h.connect(realhost)
+
+ h.putrequest('HEAD', document)
+ if username and passwd:
+ auth = string.strip(base64.encodestring(username + ":" + passwd))
+ h.putheader('Authorization', 'Basic %s' % auth)
+ h.putheader('User-Agent','Webcheck %s' % version.webcheck)
+ h.putheader('Host',realhost)
+
+ h.endheaders()
+
+ errcode, errmsg, headers = h.getreply()
+ h.close()
+ debugio.write(errcode,2)
+ debugio.write(errmsg,2)
+ if errcode == 301 or errcode == 302:
+ redirect_depth = redirect_depth + 1
+ if redirect_depth > config.REDIRECT_DEPTH:
+ debugio.write('\tToo many redirects!')
+ redirect_depth = 0
+ return (errcode, errmsg, headers, url)
+ redirect = headers['location']
+ redirect = urlparse.urljoin(url,redirect)
+ if redirect == url:
+ debugio.write('\tRedirect same as source: %s' % redirect)
+ redirect_depth = 0
+ return (errcode, errmsg, headers, url)
+ debugio.write('\tRedirected to: ' + redirect)
+ if Link.linkList.has_key(redirect):
+ link = Link.linkList[redirect]
+ return (link.status, link.message, link.headers, link.URL)
+ return get_reply(redirect)
+ return (errcode, errmsg, headers, url)
+
+def init(self, url, parent):
+ """ Here, self is a reference of the link object that is calling this
+ pseudo-method"""
+
+ (self.status, self.message, self.headers, self.URL) = get_reply(myUrlLib.basejoin(parent,url))
+ Link.linkList[self.URL] = self
+ try:
+ self.type = self.headers.gettype()
+ except AttributeError:
+ self.type = 'text/html' # is this a good enough default?
+
+ debugio.write('\tContent-type: ' + self.type,2)
+ try:
+ self.size = int(self.headers['content-length'])
+ except (KeyError, TypeError):
+ self.size = 0
+
+ if (self.status != 200) and (self.status != 'Not Checked'):
+ self.set_bad_link(self.URL,str(self.status) + ": " + self.message)
+ return
+
+ try:
+ lastMod = time.mktime(self.headers.getdate('Last-Modified'))
+ except (OverflowError, TypeError, ValueError):
+ lastMod = None
+ if lastMod:
+ self.age = int((time.time()-lastMod)/myUrlLib.SECS_PER_DAY)
+
+def get_document(url):
+ document = opener.open(url).read()
+ opener.cleanup()
+ return document
+
+def parse_host(location):
+ """Return a tuple (user, password, host, port)
+
+ takes string http://user:password@hostname:hostport and
+ returns a tuple. If a field is null in the string it will be
+ returned as None in the tuple.
+ """
+
+ #location = urlparse.urlparse(host)[1]
+ debugio.write("network location= %s" % location,3)
+
+ at = string.find(location, "@")
+ if at > -1:
+ userpass = location[:at]
+ colon = string.find(userpass, ":")
+ if colon > -1:
+ user = userpass[:colon]
+ passw = userpass[colon+1:]
+ else:
+ user = userpass
+ passw = None
+ hostport = location[at+1:]
+ else:
+ user = passw = None
+ hostport = location
+
+ colon = string.find(hostport, ":")
+ if colon > -1:
+ hostname = hostport[:colon]
+ port = hostport[colon+1:]
+ else:
+ hostname = hostport
+ port = None
+
+ debugio.write("parse_host = %s %s %s %s" % (user, passw, hostname, port),3)
+ return (user, passw, hostname, port)
+
diff --git a/version.py b/version.py
new file mode 100644
index 0000000..2c33a07
--- /dev/null
+++ b/version.py
@@ -0,0 +1,24 @@
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+
+"""Contains version and other static information"""
+
+webcheck="1.0"
+authors='Mike Meyer <mwm@mired.org>'
+home='http://www.mired.org/webcheck/'
+#registry='http://starship.python.net/crew/marduk/webcheck/registry'
diff --git a/webcheck.css b/webcheck.css
new file mode 100644
index 0000000..ba94323
--- /dev/null
+++ b/webcheck.css
@@ -0,0 +1,126 @@
+/* "Global" Settings */
+BODY {
+ background: #ffffff;
+ font-size: 10pt;
+}
+
+A:link {
+ color: #0000cd;
+}
+
+A.external:link {
+ font-style: italic;
+}
+
+A:active {
+ color: #0000ff;
+}
+
+A:visited {
+ color: #bc0000;
+}
+
+TH {
+ background: #cccc99;
+ color: #000000;
+}
+
+TD {
+ background: #eeeee0;
+ color: #000000;
+}
+
+BODY.navbar {
+ background:
+ url(http://www.mired.org/webcheck/blackbar.png);
+}
+.navbar TH {
+ background: #000000 url(http://www.mired.org/webcheck/blackbar.png);
+ color: #ffffff;
+ font-family: arial, sans-serif;
+}
+
+.highlight TH {
+ background: #ffffff;
+ color: #000000;
+}
+
+.navbar A:link {
+ background: none;
+ color: #ffffff;
+}
+
+.navbar A:active {
+ background: none;
+ color: #00ff00;
+}
+
+.navbar A:visited {
+ background: none;
+ color: #ffffff;
+}
+
+.navbar TH.home {
+ background: #cd0000;
+ color: #ffffff;
+}
+
+P.logo {
+ text-align: center;
+}
+
+H1 {
+ font-size: 1.5em;
+}
+H1.basename {
+ text-align: center;
+}
+
+TH.title {
+ background: #cd0000;
+ color: #ffffff;
+ font-family: "comic sans ms", verdana, sans-serif;
+ font-size: 1.3em;
+ text-align: left;
+}
+
+TR.link {
+ background: #cccc99;
+ color: #000000;
+}
+
+TR.status {
+ background: #bbbbb0;
+ color: #000000;
+}
+
+TR.parent {
+ background: #ddddd0;
+ color: #000000;
+}
+
+/* time/age fields */
+TD.time {
+ b
+ background: #ddddd0;
+ color: #000000;
+ text-align: right;
+}
+
+TD.blank {
+ background: #ffffff;
+}
+
+DIV.table {
+ /* the only way I know of to align tables via CSS */
+ text-align: center;
+}
+
+
+P.authorlist {
+ font-family: helvetica, sans-serif;
+ font-size: smaller;
+ font-weight: bold;
+ text-align: center;
+}
+
diff --git a/webcheck.py b/webcheck.py
new file mode 100755
index 0000000..f2d5fd1
--- /dev/null
+++ b/webcheck.py
@@ -0,0 +1,145 @@
+#!/usr/bin/env python
+
+# Copyright (C) 1998,1999 marduk <marduk@python.net>
+# Copyright (C) 2002 Mike Meyer <mwm@mired.org>
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+
+USAGE='webcheck [-abvq][-l url][-x url]... [-y url]... [-r depth][-o dir][-w sec][-d level] url [location]...'
+PYTHON_VERSION=1.5 # not used right now
+explored = []
+problem_db = {}
+linkList = {}
+
+import sys
+import time
+
+
+start_time = time.ctime(time.time())
+
+# importing the config.py file is a real problem if the user did not install
+# the files EXACTLY the way I said to... or even using the frozen version is
+# becoming a real bitch. I will just have to tell them right out how to fix it.
+try:
+ sys.path = ['.'] + sys.path
+ import config
+except ImportError:
+ sys.stdout.write('Please verify that PYTHONPATH knows where to find "config.py"\n')
+ sys.exit(1)
+
+import myUrlLib
+Link=myUrlLib.Link
+
+# myUrlLib will be looking for a 'config' module. set it up here.
+myUrlLib.config=config
+
+import debugio
+debugio.DEBUG_LEVEL = config.DEBUG_LEVEL
+
+import version
+
+def parse_args():
+ import getopt
+ global URL
+ try:
+ optlist, args = getopt.getopt(sys.argv[1:],'vl:x:y:ar:o:bw:d:q')
+ except getopt.error, reason:
+ print reason
+ print USAGE
+ sys.exit(1)
+ for flag,arg in optlist:
+ if flag=='-v':
+ print_version()
+ sys.exit(0)
+ elif flag=='-x':
+ config.EXCLUDED_URLS.append(arg)
+ elif flag=='-y':
+ config.YANKED_URLS.append(arg)
+ elif flag=='-a':
+ config.AVOID_EXTERNAL_LINKS=1
+ elif flag=='-r':
+ config.REDIRECT_DEPTH=int(arg)
+ elif flag=='-o':
+ config.OUTPUT_DIR=arg
+ elif flag=='-b':
+ config.BASE_URLS_ONLY=1
+ elif flag=='-w':
+ config.WAIT_BETWEEN_REQUESTS=int(arg)
+ elif flag=='-l':
+ config.LOGO_HREF=arg
+ elif flag=='-d':
+ debugio.DEBUG_LEVEL=int(arg)
+ elif flag=='-q':
+ debugio.DEBUG_LEVEL=0
+
+ if len(args)==0:
+ print USAGE
+ sys.exit(1)
+ else: URL = args[0]
+ config.HOSTS=args[1:]
+
+def print_version():
+ """Print version information"""
+ import os
+ print " Webcheck: " + version.webcheck
+ print " Python: " + sys.version
+ print " OS: " + os.name
+ print
+
+def warn():
+ """Warn the user that something has gone wrong."""
+ print "*******************************************"
+ print "* *"
+ print "* Warning, Webcheck has found nothing to *"
+ print "* report for this site. If you feel this *"
+ print "* is in error, please contact *"
+ print "* %s. *" % version.author
+ print "* and specify the environment that caused *"
+ print "* this to occur. *"
+ print "* *"
+ print "* Webcheck %s *" % version.webcheck
+ print "* *"
+ print "*******************************************"
+
+# set up the pages
+plugins = config.PLUGINS
+
+if __name__ == '__main__':
+
+ parse_args()
+ config.OUTPUT_DIR=config.OUTPUT_DIR + '/'
+
+ debugio.write('checking site....')
+ try:
+ Link.base = Link(URL,None) # this will take a while
+ except KeyboardInterrupt:
+ sys.stderr.write("Interrupted\n")
+ sys.exit(1)
+ debugio.write('done.')
+ if not hasattr(Link.base,"URL"):
+ warn()
+ sys.exit(1)
+
+ linkList = Link.linkList
+
+ # now we can write out the files
+ # start with the frame-description page
+ debugio.write('Generating reports...')
+ from plugins.rptlib import main_index, nav_bar
+ main_index()
+ nav_bar(plugins)
+ debugio.write('done.')
+
diff --git a/webcheck.sh b/webcheck.sh
new file mode 100755
index 0000000..b472e87
--- /dev/null
+++ b/webcheck.sh
@@ -0,0 +1,4 @@
+#! /bin/sh
+PYTHONPATH="/home/mwm/src/webcheck:$PYTHONPATH"
+PATH="/usr/opt/bin:$PATH"
+/home/mwm/src/webcheck/webcheck.py "$@"