Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
path: root/webcheck/crawler.py
Commit message (Collapse)AuthorAgeFilesLines
* Split functionality into Link.get_or_create()Arthur de Jong2013-12-151-8/+1
| | | | | This splits some common functionality from Link._get_child() and Crawler.get_link() to the new Link.get_or_create() function.
* Rename some functionsArthur de Jong2013-12-151-13/+13
| | | | | This should make some functions clearer and marks internal functions with a leading underscore.
* Small simplificationArthur de Jong2013-12-151-1/+1
|
* Move SQLite initialisation to db moduleArthur de Jong2013-12-151-10/+2
|
* Move static files to webcheck/staticArthur de Jong2013-12-021-3/+3
| | | | | This moves all static files to be installed into the webcheck Python path and uses pkg_resources to load the files.
* Fix missing importArthur de Jong2013-12-021-0/+1
|
* Use crawler.base_urls instead of crawler.basesArthur de Jong2013-09-281-33/+27
| | | | | | | | Exposing crawler.bases leaks the sqlalchemy session to the plugins which seems to cause problems in some cases. As a consequence of this change, the sitemap plugin now uses its own session.
* Introduce a site_name in the crawlerArthur de Jong2013-09-281-0/+5
|
* Get response size and modified date from requestArthur de Jong2013-09-281-3/+9
|
* Provide function for template-based report renderingArthur de Jong2013-09-221-1/+1
| | | | | | This uses the Jinja template engine to produce the report HTML files. This also renames the util module to output to better describe its purpose.
* Explicityly close database sessionsArthur de Jong2013-09-221-0/+2
| | | | | This tries to close the session when the function is done with it to avoid using too much memory.
* Initialise crawler with a configurationArthur de Jong2013-09-201-31/+44
| | | | | | This changes the constructor to accept a dict configuration of the crawler. This is currently combined with the configuration in the config module but the goal is to replace it completely.
* Expose configured plugins via crawler.pluginsArthur de Jong2013-09-201-13/+14
| | | | This avoids having module loading code in different places.
* Get default configuration from config moduleArthur de Jong2013-09-201-1/+11
|
* pass a string to RobotFileParser because of problems ↵Arthur de Jong2012-08-291-1/+1
| | | | | | with unicode git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@471 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* support MAX_DEPTH == 0Devin Bayer2011-11-161-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@464 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* implement a MAX_DEPTH configuration option to limit ↵Arthur de Jong2011-11-041-1/+4
| | | | | | crawling based on a patch by Devin Bayer git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@459 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* switch to using the logging frameworkArthur de Jong2011-10-141-30/+25
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@457 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* simplify logging of depthArthur de Jong2011-10-141-2/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@456 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* fix missing import (broken in r452)Arthur de Jong2011-10-081-1/+1
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@454 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* also handle exceptions while parsing (e.g. issue when ↵Arthur de Jong2011-10-081-6/+9
| | | | | | reading the response times out) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@453 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* ensure that the database is emptied completely and move ↵Arthur de Jong2011-10-081-12/+2
| | | | | | the code to webcheck.db git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@452 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* switch to using MozillaCookieJar because LWPCookieJar ↵Arthur de Jong2011-10-081-2/+2
| | | | | | has issues with some dates (http://bugs.python.org/issue5537) git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@451 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* rename Crawler.add_internal() to Crawler.add_base() and ↵Arthur de Jong2011-10-071-9/+25
| | | | | | automatically initialise database connection when needed git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@450 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* rename Site to CrawlerArthur de Jong2011-10-071-4/+3
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@448 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* move some more initialisation from cmd to crawler and ↵Arthur de Jong2011-10-071-17/+28
| | | | | | make imports of config and debugio consistent git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@447 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* move some file-handling functions to webcheck.utilArthur de Jong2011-10-071-0/+5
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@446 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* move version and homepage definition from config to the ↵Arthur de Jong2011-10-071-1/+2
| | | | | | webcheck package git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@441 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* pass the IO timeout to urllib2Arthur de Jong2011-09-161-3/+2
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@438 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* use fully qualified plugin namesArthur de Jong2011-09-161-10/+10
| | | | git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@436 86f53f14-5ff3-0310-afe5-9b438ce3f40c
* move all the code except the command-line handling to ↵Arthur de Jong2011-09-161-0/+422
the webcheck package and reorganise imports accordingly git-svn-id: http://arthurdejong.org/svn/webcheck/webcheck@435 86f53f14-5ff3-0310-afe5-9b438ce3f40c