1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
|
changes from 1.10.4 to 1.10.5 (alpha)
-----------------------------
* added setup.py for pypi/egg-based installation
* support --max-depth option to control max depth
* detect and report on endless redirects
* move to sqlite for storing crawler state
changes from 1.10.3 to 1.10.4
-----------------------------
* remove some left-over debugging code
* several small bugfixes which more or less drop support for Python 2.3
* limit the list of "referenced from" to 10 items
* pass char_encoding option to tidy to fix some tidy-related errors
* add a Referer header if possible (thanks Devin Bayer)
* Debian packaging improvements
changes from 1.10.2 to 1.10.3
-----------------------------
* support <iframe> and some common usages of <object>
* fix bug in command-line parsing of short -r option
* implement the --userpass option to pass username and password information to
specific sites based on a patch by Chris Shenton
* handle errors while parsing more gracefully
* add parsing of <script> tag and background attributes, based on a patch by
Robert M. Jansen
* fix in parsing <style> tags and support style attributes
* call tidy (if available) on HTML content, based on a patch by Henning
Sielaff
* fix problem with port numbers in host headers
changes from 1.10.1 to 1.10.2
-----------------------------
* add checking for bug in BeautifulSoup and issue warning if bug is found
* added support for Python 2.3 (alhough more recent versions of Python
are recommended)
* small documentation improvements
* Debian package improvements
changes from 1.10.0 to 1.10.1
-----------------------------
* some extra Unicode handling precautions
* fix problem in reading webcheck.dat for non-ASCII text
* be more verbose about HTTP retrieval failures
* split out URL normalization code into own module and do some basic protocol-
specific normalizations
* a number of big performance improvements
* fix a bug in handling some zero-size pages
* parse http-equiv meta HTML header to parse refresh option
* webcheck now requires Python 2.4 or more recent
changes from 1.9.8 to 1.10.0
----------------------------
* switched HTML parsing to using BeautifulSoup with a fall-back mechanism to
the old HTMLParser based solution
* the new parser is much more error-tolerant but is reportedly somewhat slower
and does not include line numbers in errors
* new features will likely only be added to the new parser
* some small improvements to the output to make it XHTML 1.1 compliant
* internal improvements for handling Unicode strings
* better support for parsing <applet> tags and anchors using id attributes
* re-enable robots.txt parsing that was disabled in 1.9.8 and add an
--ignore-robots option
changes from 1.9.7 to 1.9.8
---------------------------
* some checks for properly handling unknown and wrong encodings have been
added
* added proper error handling for SSL related socket problems (exceptions are
not a subclass of regular socket exceptions)
* a bug fix for URLs that contain a user name without a password or the other
way around
* miscellaneous small report improvements
changes from 1.9.6 to 1.9.7
---------------------------
* site data is now stored to a file while crawling the site, this can be used
to resume a crawl with the --continue option and for debugging purposes
* implemented checking of link anchors
* small improvements to generated reports (favicon included, CSS fix)
* documentation improvements
* properly handle float values for --wait
* unreachable sites will time out faster
* added support for plugins that don't output html
* half a dozen other small bug fixes (stability fixes, code clean-ups and
improvements)
changes from 1.9.5 to 1.9.6
---------------------------
* SECURITY FIX: a cross-site scripting vulnerability with content in the
tooltips of generated report was fixed by properly escaping
all output (CVE-2006-1321)
* URLs are now URL encoded into a consistent form, solving some problems with
URLs with non-ASCII characters
* no longer remove unreferenced redirects
* more debugging info in debug mode
* more fixes for escaping in generated reports and more support for sites in
different character sets
changes from 1.9.4 to 1.9.5
---------------------------
* about page now has some more useful information
* proxy authentication is implemented
* fix for using relative paths as output directory
* add support for parsing html documents in different encodings
* ensure that all generated html output is properly escaped
* implemented --internal option to flag internal URLs with regular expressions
* documentation improvements
* several bug fixes to get webcheck more robust
* included FancyTooltips by Victor Kulinski to have nicer tooltips
* generated reports now have friendlier messages for when there is nothing to
report
* there is a Debian package
changes from 1.9.3 to 1.9.4
---------------------------
* split problems into link problems (errors retrieving the document) and page
problems (parsing errors, wrong links, etc)
* some fixes and improvements to the layout of the generated pages
* redirect loops are now detected
* transfer result status is now stored
* addition of a limited CSS parser that handles imports and url() entries
* support reading file names for checking from the command line (turning them
into file:// URLs internally)
* better error handling of problems writing generated pages and check that we
are not overwriting input files
changes from 1.9.2 to 1.9.3
---------------------------
* several improvements to the generated reports, including tooltips with some
useful information for the links (does not seem to work very well in
Firefox)
* stability improvements to the html parser (thanks to everyone who reported
problems) not all problems have been solved but it shouldn't stop webcheck
any more
* reimplementation of the file and ftp modules to read directory contents or
read index.html file if present (there are known problems in the ftp module
regarding empty directories and recovering from errors)
* improvements to the URL parsing code to warn about spaces in URLs
* only fetch content if we can parse it
changes from 1.9.1 to 1.9.2
---------------------------
* complete reimplementation of the html and http modules
* added HTTPS support
* some spelling and typo fixes contributed by several people
* site map now does a proper breadth first traversal of the site structure
* webcheck homepage has been changed to http://ch.tudelft.nl/~arthur/webcheck/
* several minor bug fixes and tweaks
changes from 1.9.0 to 1.9.1
---------------------------
* ship an empty css.py to actually run
* small bug fixes for pages with multiple titles and slow plugin
changes from 1.0 to 1.9.0
-------------------------
* maintainership transferred to Arthur de Jong
* major structural rewrites of crawling code and plugin structure
* the documentation was combined and partially rewritten in the README for
installation instructions and the manual page for usage information
* changed output to no longer use frames and produce valid XHTML 1.1 and use
CSS for layout
* config.py is no longer really a configuration file
changes from 1.0b10 to 1.0
--------------------------
+ Don't send accept headers, as they weren't valid.
+ WARN_OLD_VERSION no longer works, until I decide what to do about it.
+ Named changed to webcheck.
+ Fixed typos in INSTALL.
+ Changes so it works with python 2.0.
changes from 1.0b9 to 1.0b10
----------------------------
b Fixed bug when server redirects to a document in robots.txt (does not show
up as broken (hopefully))
+ Filename mangling in filelink.py to help OS/2 (and Win32) (Patch submitted
by Steffen Siebert)
+ Added WARN_OLD_VERSION config.py option. If this option is set to true (the
default) Linbot will check it's version number and the version numbers of
it's plugins against a global registry on the Net. If it finds that a
version is not the latest, it will print a warning on the reports along with
a link you can follow to download the latest version. I think it's neat. You
might find it annoying.
+ Added preliminary support for authenticating proxies, though it does not
work correctly yet.
+ Added -r (redirect depth) and REDIRECT_DEPTH option in config.py to indicate
the amount of redirects Linbot should follow when following a link. Thanks
to Andrea Glorioso for the patch.
+ Added debugio module that handles debugging and I/O
+ Added -q (quiet option). Use it to suppress output
+ Added -d (debug) option and DEBUG_LEVEL variable in config.py for debugging
+ added version module and removed __version__ and __author__ from all the
modules (except plugins).
b Fixed bug in Linbot using putrequest() instead of putheader() when
requesting header information. Thanks to Andrea Glorioso for fixing this
glitch (and Seth Chaiklin for noticing).
changes from 1.0b8 to 1.0b9
---------------------------
+ If you use the -o command-line option or the OUTPUT_DIR config file option
and the directory does not exist, linbot will create it for you (provided
that it has the correct permissions, etc.). Thanks to Andrea Glorioso for
this feature.
+ Added a CREDITS file and probably left a lot of people out. If you think you
should be in it let me know.
b Linbot will now report to the server that it can accept any MIME type (found
in mimetypes.py. This should fix the "406: No acceptable objects found"
error that some servers report.
b Linbot correctly identifies itself as "Linbot <version>" on HEAD requests as
well as GET requests.
changes from 1.0b6 to 1.0b8
---------------------------
b Fixed bug when no images are reported for documents having 0 links If you
don't know what this means it probably wasn't a problem for you.
b Fixed code that was messing with arguments passed via -x and -y and caused
unexpected results and/or errors.
b -b flag should work this time (for real)
b Cosmetic changes (reports didn't look the way I thought they should in IE4.
(and may not still as I haven't had a chance to check it yet)
b Linbot won't follow infinite redirects (currently hardcoded to max of 5
redirects per document)
changes from 1.0b5 to 1.0b6
---------------------------
+ Minor change in ftplink.py should allow better ftp link checking
+ You can now press CTRL-C (or whatever your operating system supports) to
break out of a linbot run. However, the work linbot does is not saved (yet).
b Fixed problem when server redirects a URL to itself. This fix seems to work
for most servers I've tried but there are a few more out there that I need
to take a look at.
b Fixed bug that caused linbot to not check for yanked URLs
+ Added -l command-line option. Usage: -l <url> where <url> is a URL pointing
to an image to be used as the report's logo.
b "patched" strings.py so that it can better parse html files created in
Windows/DOS (I think).
+ Made report LOGO a link to the base URL
+ httplink does not HEAD a redirected URL if it is already in the link list
(performance improvement)
- Removed LOGO_ALT from config.py
+ Changed my email address to marduk@python.net. The official home page of
Linbot will probably also change with the next release so stay tuned.
changes from 1.0b4 to 1.0b5
---------------------------
+ Added a contrib directory. Right now it just contains the about plugin.
Other plugins will be included if people contribute them. Also, the man page
will return once I have updated it. Those ugly buttons are obsolete.
+ Linbot now "inlines" stylesheets. This has the benefits of 1) better support
of Netscape browsers (so I hear) and 2) I don't have to document to put
linbot.css in the output directory since it grabs it from starship 8*)
b Handling of error for when robots.txt cannot be retrieved.
+ Malformed urls are trapped (sorry, I had that commented out)
b FTP link handling is totally rewritten. Fortunately it shouldn't crash
anymore Unfortunately it doesn't really work reliably and probably never
will. See README.ftp for details.
b Two bugs in HTTP proxy handling made it almost completely unusable, though
conveniently seemed to cancel each other out when I was testing.
b Too many files error on large sites should be fixed. Thanks to Andrew
Kuchling et al for suggestions.
b Bug when some servers erroneously report (or don't report) Content-Length
header fixed.
|