202 lines
7.1 KiB
Plaintext
202 lines
7.1 KiB
Plaintext
-*- indented-text -*-
|
|
Id: TODO,v 1.29 2001/03/18 03:12:20 mbp Exp
|
|
|
|
* Most urgent: do rolling checksums rather than from-scratch if
|
|
possible.
|
|
|
|
* Don't use the rs_buffers_t structure.
|
|
|
|
There's something confusing about the existence of this structure.
|
|
In part it may eb the name. I think people expect that it will be
|
|
something that behaves like a FILE* or C++ stream, and it really
|
|
does not. Also, the structure does not behave as an object: it's
|
|
really just a shorthand for passing values in to the encoding
|
|
routines, and so does not have a lot of identity of its own.
|
|
|
|
An alternative might be
|
|
|
|
result = rs_job_iter(job,
|
|
in_buf, &in_len, in_is_ending,
|
|
out_buf, &out_len);
|
|
|
|
where we update the length parameters on return to show how much we
|
|
really consumed.
|
|
|
|
One technicality here will be to restructure the code so that the
|
|
input buffers are passed down to the scoop/tube functions that need
|
|
them, which are relatively deeply embedded. I guess we could just
|
|
stick them into the job structure, which is becoming a kind of
|
|
catch-all "environment" for poor C programmers.
|
|
|
|
* Meta-programming
|
|
|
|
* Plot lengths of each function
|
|
|
|
* Some kind of statistics on delta each day
|
|
|
|
* Encoding format
|
|
|
|
* Include a version in the signature and difference fields
|
|
|
|
* Remember to update them if we ever ship a buggy version (nah!) so
|
|
that other parties can know not to trust the encoded data.
|
|
|
|
* abstract encoding
|
|
|
|
In fact, we can vary on several different variables:
|
|
|
|
* what signature format are we using
|
|
|
|
* what command protocol are we using
|
|
|
|
* what search algorithm are we using?
|
|
|
|
* what implementation version are we?
|
|
|
|
Some are more likely to change than others. We need a chart
|
|
showing which source files depend on which variable.
|
|
|
|
* Error handling
|
|
|
|
* What happens if the user terminates the request?
|
|
|
|
* Do HTTP CONNECT
|
|
|
|
* This might be a nice place to use select!
|
|
|
|
* Encoding implementation
|
|
|
|
* Join up copy commands through the copyq if this is not done yet.
|
|
|
|
* Join up signature commands
|
|
|
|
* Encoding algorithm
|
|
|
|
* Self-referential copy commands
|
|
|
|
Suppose we have a file with repeating blocks. The gdiff format
|
|
allows for COPY commands to extend into the *output* file so that
|
|
they can easily point this out. By doing this, they get
|
|
compression as well as differencing.
|
|
|
|
It'd be pretty simple to implement this, I think: as we produce
|
|
output, we'd also generate checksums (using the search block
|
|
size), and add them to the sum set. Then matches will fall out
|
|
automatically, although we might have to specially allow for
|
|
short blocks.
|
|
|
|
However, I don't see many files which have repeated 1kB chunks,
|
|
so I don't know if it would be worthwhile.
|
|
|
|
* Extended files
|
|
|
|
Suppose the new file just has data added to the end. At the
|
|
moment, we'll match everything but the last block of the old
|
|
file. It won't match, because at the moment the search block
|
|
size is only reduced at the end of the *new* file. This is a
|
|
little inefficient, because ideally we'd know to look for the
|
|
last block using the shortened length.
|
|
|
|
This is a little hard to implement, though perhaps not
|
|
impossible. The current rolling search algorithm can only look
|
|
for one block size at any time. Can we do better? Can we look
|
|
for all block lengths that could match anything?
|
|
|
|
Remember also that at the moment we don't send the block length
|
|
in the signature; it's implied by the length of the new block
|
|
that it matches. This is kind of cute, and importantly helps
|
|
reduce the length of the signature.
|
|
|
|
* State-machine searching
|
|
|
|
Building a state machine from a regular expression is a brilliant
|
|
idea. (I think `The Practice of Programming' walks through the
|
|
construction of this at a fairly simple level.)
|
|
|
|
In particular, we can search for any of a large number of
|
|
alternatives in a very efficient way, with much less effort than
|
|
it would take to search for each the hard way. Remember also the
|
|
string-searching algorithms and how much time they can take.
|
|
|
|
I wonder if we can use similar principles here rather than the
|
|
current simple rolling-sum mechanism? Could it let us match
|
|
variable-length signatures?
|
|
|
|
* Cross-file matches
|
|
|
|
If the downstream server had many similar URLs, it might be nice
|
|
if it could draw on all of them as a basis. At the moment
|
|
there's no way to express this, and I think the work of sending
|
|
up signatures for all of them may be too hard.
|
|
|
|
Better just to make sure we choose the best basis if there is
|
|
none present. Perhaps this needs to weigh several factors.
|
|
|
|
One factor might be that larger files are better because they're
|
|
more likely to have a match. I'm not sure if that's very strong,
|
|
because they'll just bloat the request. Another is that more
|
|
recent files might be more useful.
|
|
|
|
* Support gzip compression of the difference stream. Does this
|
|
belong here, or should it be in the client and librsync just have
|
|
an interface that lets it cleanly plug in?
|
|
|
|
* Licensing
|
|
|
|
* Will the GNU Lesser GPL work? Specifically, will it be a problem
|
|
in distributing this with Mozilla or Apache?
|
|
|
|
* Checksums
|
|
|
|
* Do we really need to require that signatures arrive after the
|
|
data they describe? Does it make sense in HTTP to resume an
|
|
interrupted transfer?
|
|
|
|
I hope we can do this. If we can't, however, then we should
|
|
relax this constraint and allow signatures to arrive before the
|
|
data they describe. (Really? Do we care?)
|
|
|
|
* Allow variable-length checksums in the signature; the signature
|
|
will have to describe the length of the sums and we must compare
|
|
them taking this into account.
|
|
|
|
* Testing
|
|
|
|
* test broken pipes
|
|
|
|
* Test files >2GB, >4GB. Presumably these must be done in streams
|
|
so that the disk requirements to run the test suite are not too
|
|
ridiculous. I wonder if it will take too long to run these
|
|
tests? Probably, but perhaps we can afford to run just one
|
|
carefully-chosen test.
|
|
|
|
* Use slprintf not strnprintf, etc.
|
|
|
|
* Long files
|
|
|
|
* How do we handle the large signatures required to support large
|
|
files? In particular, how do we choose an appropriate block size
|
|
when the length is unknown? Perhaps we should allow a way for
|
|
the signature to scale up as it grows.
|
|
|
|
* What do we need to do to compile in support for this?
|
|
|
|
* On GNU, defining _LARGEFILE_SOURCE as we now do should be
|
|
sufficient.
|
|
|
|
* SCO and similar things on 32-bit platforms may be more
|
|
difficult. Some SCO systems have no 64-bit types at all, so
|
|
there we will have to do without.
|
|
|
|
* On larger Unix platforms we hope that large file support will
|
|
be the default.
|
|
|
|
* Perhaps make extracted signatures still be wrapped in commands.
|
|
What would this lead to?
|
|
|
|
* We'd know how much signature data we expect to read, rather than
|
|
requiring it to be terminated by the caller.
|
|
|
|
* Selective trace of particular areas of the library.
|
|
|