| Commit message (Expand) | Author | Age | Files | Lines |
* | introduce HTML parsing using BeautifulSoup with a fall-ba... | Arthur de Jong | 2007-04-20 | 1 | -343/+0 |
* | evaluate archive attribute of <applet> tag instead of cod... | Arthur de Jong | 2007-03-31 | 1 | -2/+5 |
* | add set_encoding method to Link object to do some basic e... | Arthur de Jong | 2006-07-13 | 1 | -13/+11 |
* | add TODOs | Arthur de Jong | 2006-05-31 | 1 | -0/+2 |
* | make decoding try/fall-back code a lot simpler and handle... | Arthur de Jong | 2006-05-15 | 1 | -12/+7 |
* | improve warning text and add comment concerning trying of... | Arthur de Jong | 2006-05-12 | 1 | -1/+2 |
* | ignore unknown entities instead of throwing an error | Arthur de Jong | 2006-05-12 | 1 | -2/+5 |
* | move html escaping and unescaping functions to parsers.html | Arthur de Jong | 2006-05-07 | 1 | -11/+52 |
* | use unichr() to generate Unicode characters, not chr() | Arthur de Jong | 2006-05-07 | 1 | -1/+1 |
* | some more small code improvements thanks to pychecker | Arthur de Jong | 2006-05-07 | 1 | -0/+1 |
* | implement checking for id and name tags in anchors | Arthur de Jong | 2006-05-06 | 1 | -12/+39 |
* | code improvements thanks to pylint | Arthur de Jong | 2006-04-23 | 1 | -54/+55 |
* | do not fail on unknown encodings (fall back to system enc... | Arthur de Jong | 2006-04-07 | 1 | -3/+6 |
* | split urlescape() from _urlclean() and ensure that all an... | Arthur de Jong | 2006-03-26 | 1 | -2/+2 |
* | revert catching Exception instead of IOError that was the... | Arthur de Jong | 2006-03-11 | 1 | -1/+1 |
* | implement checking of anchors (there should be no double ... | Arthur de Jong | 2006-03-10 | 1 | -4/+20 |
* | trim spaces from title and author fields and check that t... | Arthur de Jong | 2006-03-10 | 1 | -2/+2 |
* | make sure all URLs are consistently URL-encoded where it ... | Arthur de Jong | 2006-01-29 | 1 | -5/+1 |
* | fix typo (thanks Andrew Kim <Andrew.Kim@revolution.com>) | Arthur de Jong | 2006-01-26 | 1 | -2/+2 |
* | quote links so that they do not contain any non-ASCII cha... | Arthur de Jong | 2006-01-19 | 1 | -7/+13 |
* | bug fix to handle numeric character references better (Un... | Arthur de Jong | 2005-12-26 | 1 | -2/+2 |
* | add copyright clarification to specify that generated out... | Arthur de Jong | 2005-12-17 | 1 | -0/+3 |
* | store author and title in Unicode internally and ensure t... | Arthur de Jong | 2005-09-17 | 1 | -2/+23 |
* | also try to get character encoding from XML declaration a... | Arthur de Jong | 2005-09-17 | 1 | -0/+22 |
* | parse character entries as normal data, these entities wi... | Arthur de Jong | 2005-09-17 | 1 | -0/+10 |
* | also feed style tag content to the CSS parser to parse in... | Arthur de Jong | 2005-08-20 | 1 | -0/+7 |
* | add checking of unescaped spaces to the html parser, incl... | Arthur de Jong | 2005-08-20 | 1 | -25/+41 |
* | split problems into page problems (parsing errors, wrong ... | Arthur de Jong | 2005-08-19 | 1 | -1/+1 |
* | put compiled regular expression on module level so that i... | Arthur de Jong | 2005-08-12 | 1 | -2/+4 |
* | make parsing handle errors a little more gracefully, than... | Arthur de Jong | 2005-08-01 | 1 | -3/+6 |
* | also catch AttributeError for problem in HTMLParser not f... | Arthur de Jong | 2005-07-31 | 1 | -1/+1 |
* | replace numeric entity refs with their proper values base... | Arthur de Jong | 2005-07-31 | 1 | -2/+11 |
* | put new html parser in place | Arthur de Jong | 2005-07-31 | 1 | -88/+113 |
* | remove references to email addresses where they are not u... | Arthur de Jong | 2005-07-29 | 1 | -3/+3 |
* | don't replace an already set title | Arthur de Jong | 2005-07-25 | 1 | -1/+2 |
* | Mike Meyer -> Mike W. Meyer | Arthur de Jong | 2005-07-23 | 1 | -1/+1 |
* | almost complete rewrite of crawling and site state code m... | Arthur de Jong | 2005-07-22 | 1 | -19/+24 |
* | move htmlparse to a more generic parsers package, cleanin... | Arthur de Jong | 2005-07-09 | 1 | -0/+126 |