Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
path: root/crawler.py
Commit message (Collapse)AuthorAgeFilesLines
* Do not cache full backup contentsArthur de Jong2015-06-301-5/+2
| | | | | | Storing this in SQLite is slow and grows the cache to a huge size. The approach of reading these files lists may be a bit slower but saves a lot of space and overhead and removes quite some complexity.
* Change metadata informationArthur de Jong2015-06-251-20/+35
| | | | | | | | | | This changes the information in the metadata dict to include the file type in a separate field and limit the mode information to standard permissions only. Upon reading files lists from the repository the old format is automatically converted. This changes local cache file to ensure all information is re-read (the previous commit also already required this).
* Ingore errors in crawling directoriesArthur de Jong2015-04-021-5/+11
|
* Use fetchall() where neededArthur de Jong2015-03-081-22/+2
| | | | | | | | | | | | | | Use fetchall() on a cursor because SQLite cannot handle partial reads from a cursor if the database is being modified in another cursor. This clearly uses more memory during the backup. Set the synchronous and journal_mode SQLite pragmas to have better performance but be less safe. Since it is a cache, the data can be reconstructed from the repository if needed. This also uses the connection as a context manager instead of manually calling commit, changes some of the transactions around to have better performance and includes a few consistency improvements.
* Use better names for tablesArthur de Jong2015-03-021-2/+2
| | | | The names now better reflect the purpose and contents of the table.
* Refactor out path-handling functionsArthur de Jong2015-03-011-45/+5
|
* Implement --exclude optionArthur de Jong2015-03-011-10/+64
| | | | | | | This instructs the crawler to skip certain patterns from the backup. It supports * for matching any part of a file name, ** to also match /, ending the pattern with / to only match directories and starting the pattern with / to match the full path.
* Support both python 2 and 3Arthur de Jong2015-03-011-4/+4
|
* Implement a command to list backup contentsArthur de Jong2015-03-011-1/+2
| | | | | This reads snapshot file list and filters and formats the output to be like ls.
* Improve database performanceArthur de Jong2015-02-211-4/+0
| | | | | | | Create indexes (some after crawling which is a minor improvement) in the tables to improve queries, and use explicit transactions to improve performance (small improvement). Also, move temporary table creation to the functions where they are used (instead of global).
* Change directory while crawlingArthur de Jong2015-02-211-8/+14
| | | | | | The crawler now chanes to the directoties that are crawled and uses stat() on relative paths instead of using abolsute paths for all operations. This brings about a 10% reducting in crawling time.
* Handle path encoring errorsArthur de Jong2015-02-211-5/+10
| | | | | This currently ignores files with filenames that have an unknown encoding. This is far from ideal though.
* Use a slightly more efficient crawlerArthur de Jong2015-02-121-35/+45
| | | | | This replaces a call to os.walk() with one to os.listdir() to avoid calling stat() twice on each file and directory encountered.
* Initial version of command-line handlingArthur de Jong2015-02-121-2/+2
|
* Move crawler-related functions to new moduleArthur de Jong2015-02-121-0/+90