Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Provide a CSV file pluginHEADmasterArthur de Jong2013-12-152-1/+64
| | | | This plugin generates a simple CSV file with all the URLs in the system and some basic information about them.
* Remove duplicate column definitionArthur de Jong2013-12-151-1/+0
|
* Split functionality into Link.get_or_create()Arthur de Jong2013-12-152-17/+21
| | | | | This splits some common functionality from Link._get_child() and Crawler.get_link() to the new Link.get_or_create() function.
* Rename some functionsArthur de Jong2013-12-152-16/+16
| | | | | This should make some functions clearer and marks internal functions with a leading underscore.
* Small simplificationArthur de Jong2013-12-151-1/+1
|
* Move SQLite initialisation to db moduleArthur de Jong2013-12-152-14/+13
|
* Remove annoying debug log messageArthur de Jong2013-12-151-2/+1
|
* Store link and page problems as unicodeArthur de Jong2013-12-021-4/+10
| | | | | | This converts problems to unicode so they can be stored correctly by SQLAlchemy. This amongst other things fixes a problem when the web server returns a status message with non-ASCII characters.
* Only convert content if link has encodingArthur de Jong2013-12-021-1/+2
| | | | | This fixes an issue for calling tidy when the character encoding of the page could not be determined.
* Fix setup.py scriptArthur de Jong2013-12-023-14/+45
| | | | | This makes the script executable, adds copyright headers and ensures that all needed files are installed and shipped in the source package.
* Move static files to webcheck/staticArthur de Jong2013-12-027-28/+6
| | | | | This moves all static files to be installed into the webcheck Python path and uses pkg_resources to load the files.
* Fix missing importArthur de Jong2013-12-021-0/+1
|
* Fix setuptools entry point invocationArthur de Jong2013-12-021-1/+1
| | | | Reported by Emmanuel Blot, fixes 24e191f.
* Remove Debian packagingArthur de Jong2013-12-027-558/+0
| | | | | Packaging will be moved to the Debian Python Applications Packaging Team (PAPT) repository.
* Update documentationArthur de Jong2013-11-255-62/+70
| | | | | | This updates the README, HACKING and other documentation to be more in line with the current software set-up. This also updates the TODO list with current changes.
* Support older versions of JinjaArthur de Jong2013-11-181-2/+5
| | | | | This tries to gracefully support older versions of Jinja that don't provide the trim_blocks, lstrip_blocks or keep_trailing_newline options.
* Optimise count_parents()Arthur de Jong2013-10-061-11/+4
| | | | | | | This combines two queries using a union that already does distinct. This also removes the distinct from the parents() function because it uses a union which is supposed to use distinct already.
* Use crawler.base_urls instead of crawler.basesArthur de Jong2013-09-282-35/+32
| | | | | | | | Exposing crawler.bases leaks the sqlalchemy session to the plugins which seems to cause problems in some cases. As a consequence of this change, the sitemap plugin now uses its own session.
* Introduce a site_name in the crawlerArthur de Jong2013-09-283-5/+7
|
* Fix old and new templates to use datetime objectsArthur de Jong2013-09-284-14/+8
|
* Fix time formattingArthur de Jong2013-09-281-1/+1
|
* Get response size and modified date from requestArthur de Jong2013-09-281-3/+9
|
* Add missing template changes from Jinja mergeArthur de Jong2013-09-223-13/+40
| | | | | This also fixes newlines in link meta information that were incorrectly escaped.
* Use Jinja templates to render reportArthur de Jong2013-09-2231-719/+845
|\ | | | | | | | | | | | | | | The switch to Jinja removes the need for custom escaping and Python code to write HTML output and instead uses easy to read templates. As a result of the switch, this drops more than 450 lines of Python code while adding a little over 400 lines of HTML template code.
| * Remove unused codeArthur de Jong2013-09-223-259/+0
| | | | | | | | | | Most of this is removed because of the switch to the Jinja template engine.
| * Switch plugins to use templateArthur de Jong2013-09-2224-458/+671
| | | | | | | | | | | | The sitemap module has been somewhat rewritten to use generators to provide the structure of the website. The problems module has also been simplified a bit.
| * Introduce template macros for rendering linksArthur de Jong2013-09-222-0/+86
| |
| * Introduce a base templateArthur de Jong2013-09-222-0/+64
| | | | | | | | | | This sets up the basic layout for the report. The plugins are expected to supply a crawler instance.
| * Provide function for template-based report renderingArthur de Jong2013-09-223-3/+25
|/ | | | | | This uses the Jinja template engine to produce the report HTML files. This also renames the util module to output to better describe its purpose.
* Properly write an UTF-8 encoded output fileArthur de Jong2013-09-221-8/+9
| | | | | Write output using codecs.open() with the UTF-8 encoding. This also introduces a consistency improvement in argument naming.
* Explicityly close database sessionsArthur de Jong2013-09-2213-11/+27
| | | | | This tries to close the session when the function is done with it to avoid using too much memory.
* Initialise crawler with a configurationArthur de Jong2013-09-203-92/+67
| | | | | | This changes the constructor to accept a dict configuration of the crawler. This is currently combined with the configuration in the config module but the goal is to replace it completely.
* Expose configured plugins via crawler.pluginsArthur de Jong2013-09-203-34/+29
| | | | This avoids having module loading code in different places.
* Get default configuration from config moduleArthur de Jong2013-09-202-10/+20
|
* Use the argparse Python moduleArthur de Jong2013-09-201-136/+110
| | | | This greatly simplifies the command line parsing.
* Add a .gitignore fileArthur de Jong2013-06-071-0/+20
|
* pass a string to RobotFileParser because of problems ↵Arthur de Jong2012-08-291-1/+1
| | | | | | with unicode git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@471 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* now setup.py gets homepage and version from ↵Devin Bayer2011-11-194-7/+10
| | | | | | webcheck/__init__.py git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@470 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* update NEWS, README and HACKINGDevin Bayer2011-11-163-15/+25
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@469 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* cleanup after introduction of entry_pointDevin Bayer2011-11-162-16/+20
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@468 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add entry_point webcheckDevin Bayer2011-11-162-2/+57
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@467 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* move cmd.py to package to support an entry point called ↵Devin Bayer2011-11-161-30/+10
| | | | | | webcheck git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@466 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* in old html parser, handle more invalid encodingsDevin Bayer2011-11-161-4/+3
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@465 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* support MAX_DEPTH == 0Devin Bayer2011-11-161-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@464 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* detect self-referencing redirects even with intermediate ↵Devin Bayer2011-11-161-7/+10
| | | | | | pages git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@463 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* add setup.pyDevin Bayer2011-11-161-0/+20
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@462 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* support --level cmdline optionDevin Bayer2011-11-161-1/+4
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@461 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix encoding issues with strings passed to/from tidyArthur de Jong2011-11-082-2/+4
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@460 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* implement a MAX_DEPTH configuration option to limit ↵Arthur de Jong2011-11-044-3/+15
| | | | | | crawling based on a patch by Devin Bayer git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@459 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* simplification in size calculationArthur de Jong2011-10-141-9/+7
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@458 86f53f14-5ff3-0310-afe5-9b438ce3f40c