Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
path: root/webcheck
Commit message (Collapse)AuthorAgeFilesLines
* Provide a CSV file pluginHEADmasterArthur de Jong2013-12-152-1/+64
| | | | This plugin generates a simple CSV file with all the URLs in the system and some basic information about them.
* Remove duplicate column definitionArthur de Jong2013-12-151-1/+0
|
* Split functionality into Link.get_or_create()Arthur de Jong2013-12-152-17/+21
| | | | | This splits some common functionality from Link._get_child() and Crawler.get_link() to the new Link.get_or_create() function.
* Rename some functionsArthur de Jong2013-12-152-16/+16
| | | | | This should make some functions clearer and marks internal functions with a leading underscore.
* Small simplificationArthur de Jong2013-12-151-1/+1
|
* Move SQLite initialisation to db moduleArthur de Jong2013-12-152-14/+13
|
* Remove annoying debug log messageArthur de Jong2013-12-151-2/+1
|
* Store link and page problems as unicodeArthur de Jong2013-12-021-4/+10
| | | | | | This converts problems to unicode so they can be stored correctly by SQLAlchemy. This amongst other things fixes a problem when the web server returns a status message with non-ASCII characters.
* Only convert content if link has encodingArthur de Jong2013-12-021-1/+2
| | | | | This fixes an issue for calling tidy when the character encoding of the page could not be determined.
* Move static files to webcheck/staticArthur de Jong2013-12-027-28/+783
| | | | | This moves all static files to be installed into the webcheck Python path and uses pkg_resources to load the files.
* Fix missing importArthur de Jong2013-12-021-0/+1
|
* Fix setuptools entry point invocationArthur de Jong2013-12-021-1/+1
| | | | Reported by Emmanuel Blot, fixes 24e191f.
* Support older versions of JinjaArthur de Jong2013-11-181-2/+5
| | | | | This tries to gracefully support older versions of Jinja that don't provide the trim_blocks, lstrip_blocks or keep_trailing_newline options.
* Optimise count_parents()Arthur de Jong2013-10-061-11/+4
| | | | | | | This combines two queries using a union that already does distinct. This also removes the distinct from the parents() function because it uses a union which is supposed to use distinct already.
* Use crawler.base_urls instead of crawler.basesArthur de Jong2013-09-282-35/+32
| | | | | | | | Exposing crawler.bases leaks the sqlalchemy session to the plugins which seems to cause problems in some cases. As a consequence of this change, the sitemap plugin now uses its own session.
* Introduce a site_name in the crawlerArthur de Jong2013-09-283-5/+7
|
* Fix old and new templates to use datetime objectsArthur de Jong2013-09-284-14/+8
|
* Fix time formattingArthur de Jong2013-09-281-1/+1
|
* Get response size and modified date from requestArthur de Jong2013-09-281-3/+9
|
* Add missing template changes from Jinja mergeArthur de Jong2013-09-223-13/+40
| | | | | This also fixes newlines in link meta information that were incorrectly escaped.
* Remove unused codeArthur de Jong2013-09-223-259/+0
| | | | | Most of this is removed because of the switch to the Jinja template engine.
* Switch plugins to use templateArthur de Jong2013-09-2224-458/+671
| | | | | | The sitemap module has been somewhat rewritten to use generators to provide the structure of the website. The problems module has also been simplified a bit.
* Introduce template macros for rendering linksArthur de Jong2013-09-222-0/+86
|
* Introduce a base templateArthur de Jong2013-09-222-0/+64
| | | | | This sets up the basic layout for the report. The plugins are expected to supply a crawler instance.
* Provide function for template-based report renderingArthur de Jong2013-09-223-3/+25
| | | | | | This uses the Jinja template engine to produce the report HTML files. This also renames the util module to output to better describe its purpose.
* Properly write an UTF-8 encoded output fileArthur de Jong2013-09-221-8/+9
| | | | | Write output using codecs.open() with the UTF-8 encoding. This also introduces a consistency improvement in argument naming.
* Explicityly close database sessionsArthur de Jong2013-09-2213-11/+27
| | | | | This tries to close the session when the function is done with it to avoid using too much memory.
* Initialise crawler with a configurationArthur de Jong2013-09-202-74/+57
| | | | | | This changes the constructor to accept a dict configuration of the crawler. This is currently combined with the configuration in the config module but the goal is to replace it completely.
* Expose configured plugins via crawler.pluginsArthur de Jong2013-09-203-34/+29
| | | | This avoids having module loading code in different places.
* Get default configuration from config moduleArthur de Jong2013-09-202-10/+20
|
* Use the argparse Python moduleArthur de Jong2013-09-201-136/+110
| | | | This greatly simplifies the command line parsing.
* pass a string to RobotFileParser because of problems ↵Arthur de Jong2012-08-291-1/+1
| | | | | | with unicode git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@471 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* now setup.py gets homepage and version from ↵Devin Bayer2011-11-192-4/+3
| | | | | | webcheck/__init__.py git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@470 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* cleanup after introduction of entry_pointDevin Bayer2011-11-161-11/+3
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@468 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* move cmd.py to package to support an entry point called ↵Devin Bayer2011-11-161-0/+208
| | | | | | webcheck git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@466 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* in old html parser, handle more invalid encodingsDevin Bayer2011-11-161-4/+3
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@465 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* support MAX_DEPTH == 0Devin Bayer2011-11-161-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@464 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* detect self-referencing redirects even with intermediate ↵Devin Bayer2011-11-161-7/+10
| | | | | | pages git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@463 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix encoding issues with strings passed to/from tidyArthur de Jong2011-11-082-2/+4
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@460 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* implement a MAX_DEPTH configuration option to limit ↵Arthur de Jong2011-11-043-3/+14
| | | | | | crawling based on a patch by Devin Bayer git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@459 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* simplification in size calculationArthur de Jong2011-10-141-9/+7
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@458 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* switch to using the logging frameworkArthur de Jong2011-10-147-119/+66
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@457 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* simplify logging of depthArthur de Jong2011-10-141-2/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@456 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix typo that resulted in bad links not being reported ↵Arthur de Jong2011-10-081-1/+1
| | | | | | as page problems git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@455 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix missing import (broken in r452)Arthur de Jong2011-10-081-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@454 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* also handle exceptions while parsing (e.g. issue when ↵Arthur de Jong2011-10-081-6/+9
| | | | | | reading the response times out) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@453 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* ensure that the database is emptied completely and move ↵Arthur de Jong2011-10-082-12/+21
| | | | | | the code to webcheck.db git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@452 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* switch to using MozillaCookieJar because LWPCookieJar ↵Arthur de Jong2011-10-081-2/+2
| | | | | | has issues with some dates (http://bugs.python.org/issue5537) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@451 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* rename Crawler.add_internal() to Crawler.add_base() and ↵Arthur de Jong2011-10-071-9/+25
| | | | | | automatically initialise database connection when needed git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@450 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* rename Site to CrawlerArthur de Jong2011-10-0717-38/+38
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@448 86f53f14-5ff3-0310-afe5-9b438ce3f40c