Arthur de Jong

Open Source / Free Software developer

News

  • 2010-09-11 release 1.10.4 of webcheck
    This is more or less a maintenance release that gathers some outstanding fixes.
    An overview of the changes since the last release:
    • remove some left-over debugging code
    • several small bugfixes which more or less drop support for Python 2.3
    • limit "referenced from" list to 10 items
    • pass char_encoding option to tidy to fix some tidy-related errors
    • add a Referer header if possible (thanks Devin Bayer)
    • Debian packaging improvements
    Get this release from the downloads section.
  • 2009-06-14 webcheck homepage moved
    Since I haven't been studying at the Delft University for quite some time the webcheck homepage has been moved to https://arthurdejong.org/webcheck/. The contact email address has also been changed to arthur@arthurdejong.org.
    The subversion repository and viewvc URLs have also changed (see the downloads section for details). If you were using the svn repository before you can do
    svn switch --relocate http://arthurenhella.demon.nl/ https://arthurdejong.org/
    to relocate your working copy.
    In the meantime, webcheck migrated to Git and the Subversion repository will no longer be updated and likely go away at some point, see the downloads page for more information.
  • 2008-07-19 release 1.10.3 of webcheck
    This is a minor update release that fixes some smaller outstanding issues and adds a couple of new features.
    An overview of the changes since the last release:
    • support <iframe> and some common usages of <object>
    • fix bug in command-line parsing of short -r option
    • implement the --userpass option to pass username and password information to specific sites based on a patch by Chris Shenton
    • handle errors while parsing more gracefully
    • add parsing of <script> tag and background attributes, based on a patch by Robert M. Jansen
    • fix in parsing <style> tags and support style attributes
    • call tidy (if available) on HTML content, based on a patch by Henning Sielaff
    • fix problem with port numbers in host headers
    • Debian package improvements
    Get this release from the downloads section.
  • 2007-11-04 release 1.10.2 of webcheck
    This is a minor update release that fixes some smaller outstanding issues.
    An overview of the changes since the last release:
    • add checking for bug in BeautifulSoup and issue warning if bug is found
    • added support for Python 2.3 (alhough more recent versions of Python are recommended)
    • small documentation improvements
    • Debian package improvements
    Get this release from the downloads section.
  • 2007-07-15 release 1.10.1 of webcheck
    This release includes some big performance improvements (especially for very large sites) as well as some small bug fixes. This release requires Python 2.4 or more recent to work.
    An overview of the changes since the last release:
    • some extra Unicode handling precautions
    • fix problem in reading webcheck.dat for non-ASCII text
    • be more verbose about HTTP retrieval failures
    • split out URL normalization code into own module and do some basic protocol-specific normalizations
    • a number of big performance improvements
    • fix a bug in handling some zero-size pages
    • parse http-equiv meta HTML header to parse refresh option
    • webcheck now requires Python 2.4 or more recent
    Get this release from the downloads section.
  • 2007-05-12 release 1.10.0 of webcheck
    This release changes the HTML parser under the hood (hence the version bump) and includes some further general improvements.
    An overview of the changes since the last release:
    • switched HTML parsing to using BeautifulSoup with a fall-back mechanism to the old HTMLParser based solution
    • the new parser is much more error-tolerant but is reportedly somewhat slower and does not include line numbers in errors
    • new features will likely only be added to the new parser
    • some small improvements to the output to make it XHTML 1.1 compliant
    • internal improvements for handling Unicode strings
    • better support for parsing <applet> tags and anchors using id attributes
    • re-enable robots.txt parsing that was disabled in 1.9.8 and add an --ignore-robots option
    Get this release from the downloads section.
  • 2007-01-15 release 1.9.8 of webcheck
    This is a long overdue development release that should mainly include some stability improvements.
    An overview of the changes since the last release:
    • some checks for properly handling unknown and wrong encodings have been added
    • added proper error handling for SSL related socket problems (exceptions are not a subclass of regular socket exceptions)
    • a bugfix for urls that contain a user name without a password or the other way around
    • miscellaneous small report improvements
    Get this release from the downloads section.
  • 2006-07-02 release 1.9.7 of webcheck
    This is another development release that should improve stability but also adds some new functionality.
    Any feedback is still very much appreciated (thanks for all the feedback I already got). An overview of the changes since the last release:
    • site data is now stored to a file while crawling the site, this can be used to resume a crawl with the --continue option and for debugging purposes
    • implemented checking of link anchors
    • small improvements to generated reports (favicon included, css fix)
    • documentation improvements
    • properly handle float values for --wait
    • unreachable sites will time out faster
    • added support for plugins that don't output html
    • half a dozen other small bugfixes (stability fixes, code cleanups and improvements)
    Get this release from the downloads section.
  • 2006-05-08 svn access available
    Public read-only access to the webcheck subversion development repository is now available. The repository is also browsable through viewcvs. More details in the downloads section.
    svn access: http://arthurenhella.demon.nl/svn/webcheck/
    viewcvs: http://arthurenhella.demon.nl/viewvc/webcheck/
    The development repository currently includes a number of improvements over release 1.9.6 including support for outputting different file types in the report, some minor stability fixes, a transfer timeout and checking for anchors in pages.
  • 2006-01-30 release 1.9.6 of webcheck (security update)
    This release fixes a cross site scripting vulnerability. Content from crawled pages was insufficiently escaped in the tooltips of the generated report. A carefully crafted url, title or author name could allow a website operator to insert html code into the generated report. Users of webcheck 1.9.5 are urged to upgrade to this release.
    The CVE project has assigned id CVE-2006-1321 to this problem.
    Further improvements to stability were also made. Thanks for all the bugreports that help improve webcheck (more feedback is always appreciated).
    Changes since release 1.9.5:
    • a cross-site scripting vulnerability with content in the tooltips of generated report was fixed by properly escaping all output
    • urls are now url encoded into a consistent form, solving some problems with urls with non-ascii characters
    • no longer remove unreferenced redirects
    • more debugging info in debug mode
    • more fixes for escaping in generated reports and more support for sites in different character sets
    Get this release from the downloads section.
  • 2005-12-30 release 1.9.5 of webcheck
    This is another development release that should improve stability somewhat but also has some new functionality.
    Any feedback is still very much appreciated (thanks for all the feedback I already got). An overview of the changes since the last release:
    • about page now has some more useful information
    • proxy authentication is implemented
    • fix for using relative paths as output directory
    • add support for parsing html documents in different encodings
    • ensure that all generated html output is properly escaped
    • implemented --internal option to flag internal URLs with regular expressions
    • documentation improvements
    • several bugfixes to get webcheck more robust
    • included fancytooltips by Victor Kulinski to have nicer tooltips
    • generated reports now have friendlier messages for when there is nothing to report
    • there is a Debian package
  • 2005-09-03 release 1.9.4 of webcheck
    This is another development update that introduces some new functionality. There were some small stability improvements but no need to fix any major bugs.
    Any feedback is still very much appreciated. An overview of the changes since the last release:
    • split problems into link problems (errors retrieving the document) and page problems (parsing errors, wrong links, etc)
    • some fixes and improvements to the layout of the generated pages
    • redirect loops are now detected
    • transfer result status is now stored
    • addition of a limited css parser that handles imports and url() entries
    • support reading file names for checking from the command line (turning them into file:// urls internally)
    • better error handling of problems writing generated pages and check that we are not overwriting input files
  • 2005-08-16 release 1.9.3 of webcheck
    This release introduces some more rewritten part as well as some bug and stability fixes. These releases are still more development snapshot than real releases although they should be usable.
    Please report any problems, ideas and/or improvements. An overview of the changes since the last release:
    • several improvements to the generated reports, including tooltips with some useful information for the links (does not seem to work very well in firefox)
    • stability improvements to the html parser (thanks to everyone who reported problems) not all problems have been solved but it shouldn't stop webcheck any more
    • reimplementation of the file and ftp modules to read directory contents or read index.html file if present (there are known problems in the ftp module regarding empty directories and recovering from errors)
    • improvements to the url parsing code to warn about spaces in urls
    • only fetch content if we can parse it, based on the content type
  • 2005-07-31 release 1.9.2 of webcheck
    This is another development release of webcheck with some more structural changes. Please report any problems. An overview of the changes since the last release:
    • complete reimiplementation of the html and http modules
    • added https support
    • some spelling and typo fixes contributed by several people
    • site map now does a proper breadth first traversal of the site structure
    • several minor bugfixes and tweaks
  • 2005-07-29 webcheck homepage moved
    Since the old server is no longer available the webcheck homepage has been moved to http://ch.tudelft.nl/~arthur/webcheck/. As a consequence the new contact email address has been changed to arthur@ch.tudelft.nl. The old homepage and email address no longer work.
  • 2005-07-25 release 1.9.1 of webcheck
    This is a quick fix for a showstopper in release 1.9.0. An overview of the changes since the last release:
    • ship an empty css.py to actually run
    • small bugfixes for pages with multiple titles and slow plugin
  • 2005-07-24 release 1.9.0 of webcheck
    This is the first release of webcheck from my hand. Some major parts have been rewritten and some other parts have yet to be rewritten so this is a development release. Some parts of the system do not work 100% yet (e.g. there are known problems with ftp) but these are worked on. Please send feedback on problems and wishes. The goal for now is to work towards a stable 2.0 release. An rough overview of the changes since the 1.0 release on which this release is based on:
    • integrated several patches from Debian and Ubuntu packages
    • major rewrite of website crawling code allowing for easier change of request model (e.g. multi-threaded crawling)
    • documentation has been rewritten, including a new manual page
    • clean up of config.py which isn't really a configuration file any more
    • complete rewrite of output now generating valid XHTML 1.1 with CSS for style information
    • general refactoring of most parts of the code