Path: spam!ross
From: ross@spam.ua.oz.au (Ross Williams)
Newsgroups: comp.compression
Subject: *New*, *Improved*, DC Interface Standard (proposed)!
Summary: Revised version of a proposed data compression interface standard.
Keywords: data compression stream transformation proposed interface standard
Message-ID: <933@spam.ua.oz>
Date: 12 Jul 91 06:34:03 GMT
Sender: ross@spam.ua.oz
Followup-To: comp.compression
Distribution: world
Organization: Statistics, Pure & Applied Mathematics, University of Adelaide
Lines: 1207


                            Announcing the

                          *NEW*, *IMPROVED*

                 Data Compression Interface Standard:

    Revised, generalized, and now being flogged under the name of

          STREAM TRANSFORMATION ALGORITHM INTERFACE STANDARD


Hi there compressor heads. A few weeks ago I posted a proposed data
compression interface standard for comment in the hope that some sort
of consensus could be reached in standardizing the interfaces of data
compression algorithms. I got lots of comments and used these to
construct a revised standard. Here is a table of the people who
replied and a list of their comments.

Short Name    Full Name           Email Address
------------  ------------------  -------------------------
Douglas       Douglas W. Jones    jones@pyrite.cs.uiowa.edu
Ian           Ian Lance Taylor    ian@airs.com
Jean-loup     Jean-loup Gailly    jloup@chorus.fr
Brad          Brad Templeton      brad@looking.on.ca
Dan           Dan Bernstein       brnstnd@kramden.acf.nyu.edu
Gordon        Gordon Irlam        gordoni@berlioz.ua.oz
Ronald        Ronald van Eijck    ronald@ecl014.UUCP
Graham        Graham Toal         gtoal@tharr.UUCP
David         David Gudeman       gudeman@cs.arizona.edu
Marcus        Marcus J. Ranum     mjr@hussar.dco.dec.com

20-Jun-1991  Douglas    Use more than one procedure.
20-Jun-1991  Douglas    Use a stream interface rather than a block interface.
21-Jun-1991  Ian        Be less draconican about memory restrictions.
                        Include a version number for the standard in ID record.
21-Jun-1991  Jean-loup  Algorithm should allocate its own memory.
21-Jun-1991  Brad       Allow statically allocated memory. (Be less draconian).
21-Jun-1991  Jean-loup  Algorithm should be allowed to use lots of stack.
21-Jun-1991  Dan        Algorithms should be allowed to have parameters.
21-Jun-1991  Dan        Use a stream interface rather than a block interface.
22-Jun-1991  Gordon     Use an ISO standard date format in the ID record.
22-Jun-1991  Ian        Don't aim for cross machine portability.
22-Jun-1991  Ronald     Standardize an archiver interface instead.
22-Jun-1991  Dan        Allow parameters. Use a stream model.
22-Jun-1991  Graham     Standard is pretentious twaddle.
22-Jun-1991  David      Use global variables to implement parameters.
                        Allow both block and stream interface.
                        Use more than one procedure.
23-Jun-1991  Dan        Agreement with David's message of 22-Jun-1991.
23-Jun-1991  Graham     Standard should be implemented before being imposed.
                        A data format would be a more useful standard.
23-Jun-1991  Marcus     Use magic numbers to identify algorithms.
27-Jun-1991  Dan        Implement standard before releasing it.
                        Algorithm should obtain its own memory.
                        Stack and heap memory consumption quoted separately.
                        Allow parameters.
28-Jun-1991  Jean-loup  Use a process model. Do IO through standard procedures.
28-Jun-1991  Dan        Using IO procedures is not inefficient.

PRIVATE EMAIL SUGGESTIONS FOLLOWING PUBLIC DISCUSSION
01-Jul-1991  Jean-loup  Process model is best.
01-Jul-1991  Jean-loup  Multiple instances are a problem in process model.
02-Jul-1991  Jean-loup  Don't place a limit on the compression factor.
                        Parameterized IO routines are not hard to use.
02-Jul-1991  Jean-loup  Both process and ADT models have their problems, but
                        agreed that ADT model is more general in use.


When I drafted the standard, I thought it was a fairly clean, good
standard, and I still believe that. However, once posted to the
network it drew comments which made it clear to me that I had
neglected some classes of user and had (with all best intentions of
encouraging portability) cast a standard that was too fastidious about
tying down things like memory consumption.

Reluctantly, I revised the standard to satisfy the revised
requirements. The result is a standard that is much more
comprehensive, but not as clean and elegant. A good change, but a sad
change.

Here is the list of comments again with a summary of my final (i.e. at
the end of the review process) decision on each comment.

20-Jun-1991  Douglas    Use more than one procedure.
Done.

20-Jun-1991  Douglas    Use a stream interface rather than a block interface.
Done. But data is still passed in blocks.

21-Jun-1991  Ian        Be less draconican about memory restrictions.
                        Include a version number for the standard in ID record.
Done.

21-Jun-1991  Jean-loup  Algorithm should allocate its own memory.
Done.

21-Jun-1991  Brad       Allow statically allocated memory. (Be less draconian).
Done. (Just specify a single-instance abstraction).

21-Jun-1991  Jean-loup  Algorithm should be allowed to use lots of stack.
Done.

21-Jun-1991  Dan        Algorithms should be allowed to have parameters.
Done.

21-Jun-1991  Dan        Use a stream interface rather than a block interface.
Done. But data is still passed in blocks.

22-Jun-1991  Gordon     Use an ISO standard date format in the ID record.
Done.

22-Jun-1991  Ian        Don't aim for cross machine portability.
This is still important, but not as emphasized in the revised
standard.

22-Jun-1991  Ronald     Standardize an archiver interface instead.
I think an internal interface standard is a better first step.

22-Jun-1991  Dan        Allow parameters. Use a stream model.
Done.

22-Jun-1991  Graham     Standard is pretentious twaddle.
Sorry, - the revised standard is even worse!

22-Jun-1991  David      Use global variables to implement parameters.
                        Allow both block and stream interface.
                        Use more than one procedure.
Parameters are now supported using a string interface.
The revised standard supports a stream interface of which block
interfaces are a special case.
The revised standard uses ten procedures per data compression
algorithm!

23-Jun-1991  Dan        Agreement with David's message of 22-Jun-1991.
23-Jun-1991  Graham     Standard should be implemented before being imposed.
                        A data format would be a more useful standard.
Disagree. People need standards to cling to to make them feel warm and
fuzzy. Without standards everyone goes in different directions.

23-Jun-1991  Marcus     Use magic numbers to identify algorithms.
This was already in the original standard I think.

27-Jun-1991  Dan        Implement standard before releasing it.
                        Algorithm should obtain its own memory.
                        Stack and heap memory consumption quoted separately.
                        Allow parameters.
All but the first request done.

28-Jun-1991  Jean-loup  Use a process model. Do IO through standard procedures.
Rejected because of the messiness of linking in procedures and
because of the way such an arrangement dominates control (e.g. you
can't hook them together in chains).

28-Jun-1991  Dan        Using IO procedures is not inefficient.
Fine, but I chose not to go that way.

PRIVATE EMAIL SUGGESTIONS FOLLOWING PUBLIC DISCUSSION
01-Jul-1991  Jean-loup  Process model is best.
Possibly but it leads to a nasty interface. A procedure interface
implemented using generators gives the best of both worlds.

01-Jul-1991  Jean-loup  Multiple instances are a problem in process model.
Revised standard allows multiple instances (as did original one).

02-Jul-1991  Jean-loup  Don't place a limit on the compression factor.
                        Parameterized IO routines are not hard to use.
The revised standard places no limit on compression or on the length
of streams. I think paramterized IO is messy.

02-Jul-1991  Jean-loup  Both process and ADT models have their problems, but
                        agreed that ADT model is more general in use.
Agreed; the ADT model requires that algorithms that are most naturally
described as processes be inverted into data abstractions. However, it
is more general and in my opinion, its benefits outweigh its
disadvantages.


I feel that the revised standard addresses most of the comments
above made during the last round.

The revised standard is not as tight as the original, partly because
it is inherently messier and partly because (given the response to my
previous effort) I do not wish to put any further effort into tying up
the details until I feel that the net is happy with its overall
framework. The revised standard is by no means watertight, but it is
well fleshed out.

Once I had switched over to a stream model, I realized that with a few
minor changes, the standard could be used to describe the interface of
almost any stream model. I pursued this and the result is that the new
standard is generalized to apply to stream transformation algorithms
in general.

The following areas caused particular difficulty during the revision
process.

FLUSH FLAG: The semantics of the flush flag were one of the hardest parts
of forming this standard. All comments welcome.

IDENTITY and FORMAT: The whole issue of the identity of algorithms and
the data formats that they manipulate is extremely messy and it is
very hard to arrive at a satisfactory solution. Algorithm parameters
do not help the situation. I would greatly appreciate any ideas here
for cleaning up this part of the standard.

PARAMETERS IN INVOCATION GROUP: Is it better for the algorithm to tell
the user how many input bytes it processed after each invocation
(within an invocation group) and then have the user pass a reduced
input block during the next call, or is the current scheme of passing
the whole block each time better? I chose the whole block because I
though it would allow the user to more simply place the call in an
invocation group loop. It would also assist algorithms such as LZRW1
that wish to hold memory pointers to the input.

------

Below is the revised standard. I hope that it is closer to what people
want. I look forward to your comments.

Ross Williams
ross@spam.ua.oz.au
12-Jul-1991.


STREAM TRANSFORMATION ALGORITHM INTERFACE STANDARD
==================================================
Author : Ross Williams (ross@spam.ua.oz.au).
Date   : 12-Jul-1991.

CONTENTS
--------
1. INTRODUCTION
   1.1 Summary of the Standard
   1.2 Motivation for the Standard
   1.3 Example Applications
   1.4 Benefits of the Standard
   1.5 Scope of the Standard
   1.6 Portability

2. INTERFACE SPECIFICATION
   2.1 Structure of the Interface
       2.1.1 Overview
       2.1.2 Instances
       2.1.3 Statuses
   2.2 Procedures
       2.2.1 The Identification Procedure alg_inf
       2.2.2 The Instance Creation Procedure alg_cre
       2.2.3 The Instance Destruction Procedure alg_des
       2.2.4 The Reset Procedure alg_res
       2.2.5 The Processing Procedure alg_pro
   2.3 Parameter String Format
   2.4 Information Record
   2.5 Data Formats

3. ADMINISTRATION
   3.1 Introduction
   3.2 Revision History
   3.3 Child Standards
   3.4 Benchmark Implementations in C

4. EXAMPLE APPLICATIONS
   4.1 Compressing a Single Block of Memory
   4.2 Process a Stream of Bytes
   4.3 Compressing and Encrypting in a Modem

5. GLOSSARY

6. REFERENCES


1. INTRODUCTION
---------------

1.1 Summary of Standard
------------------------
This standard defines an interface for the communication between
computer programs and implementations of stream transformation
algorithms. Here the term "stream transformation algorithm" is used to
mean an algorithm whose sole function is to read a (possibly infinite)
stream of input bytes and produce a (possibly infinite) stream of
output bytes.

The standard constrains each stream transformation algorithm
implementation to take the form of a well-defined set of procedures
whose parameter lists are tightly defined. The standard is programming
language independent, seeking not to control the details of the
interface, but rather the structure of the interface and the nature of
the information passing through it.

1.2 Motivation for the Standard
-------------------------------
Many algorithms that read input and write output can be viewed as
stream transformers. So common is this structure in computer programs
that the Unix operating system embraces it by providing command
language constructs for connecting stream processing programs together
into sequences. At a lower programming level, there is no such
construct, and use of the higher level construct entails the overhead
of creating and destroying a process, an overhead that is likely to be
prohibitive for small amounts of data. At the lower programming level
there is also a need to keep control of the flow of data.

This standard describes a data abstraction (with a procedural
interface) that is general enough to allow a variety of stream
transformations to be implemented with a uniform interface. The
benefit of using the same interface for many stream algorithms is that
it will allow stream algorithms to be easily interchanged at edit,
compile, or even run time.

1.3 Example Applications
------------------------
This standard is highly applicable to, but not limited to, the
following applications:

   * Data compression.
   * Cryptography.
   * Error correction and detection.
   * General coding (e.g. to text file format for transmission).
   * Miscellaneous "Unix filter" transformations.

1.4 Benefits of the Standard
----------------------------
Some benefits that will flow from this standard are listed below.

  * By separating user programs and stream transformers, the standard will
    yield a high degree of interchangeability of each.

  * The standard will assist those cataloging and comparing algorithms.

  * Just by existing, the standard will encourage the designers of new
    algorithms to provide conforming implementations.

  * The standard will enable programers to write programs that
    select from one of many algorithms at run time.

1.5 Scope of the Standard
-------------------------
This standard defines a generic procedural interface for stream
transformation algorithms. The standard specifies:

   * A set of interface procedures and the information content of their
     parameter lists.

   * The generic behaviour of the procedures (stopping short of specifying
     the actual transformation performed!).

The standard does NOT specify:

   * The details of the interface (which are necessarily language specific).

   * The programming language or machine to be used.

   * The format of the data stream.

1.6 Portability
---------------
A goal of this standard is to provide a uniform interface that will
allow different stream algorithms to be easily interchanged. However,
in striving to support this "portability" of algorithms, other
portability issues arise:

   1) Portability across machines and machine environments.
   2) Portability across programming languages.

Machine portability cannot be controlled by this standard and is left
up to the implementor who may make as machine portable or as
non-machine portable an implementation as desired.

Language portability is more difficult as it is usually impossible to
write a program that is simultaneously legal in more than one
programming language! Because the details of procedure calls,
abstraction, and interfacing will vary in each language, this standard
does not attempt to define the details of the interface --- it
couldn't. Instead it defines the behaviour of procedures and the
nature of the information flowing through procedure parameter lists.

It is proposed that the interface details be standardized for each
programming language in a programming language specific child
standard.


                          +---------------------+
                          |     This Language   |
                          | Inspecific Standard |
                          +---------------------+
                                     |
              +----------------------+-----------------------+
              |                      |                       |
     +----------------+     +----------------+      +----------------+
     | Child Standard |     | Child Standard |      | Child Standard |
     | for Language 1 |     | for Language 2 |      | for Language 3 |
     +----------------+     +----------------+      +----------------+
        /     |    \           /     |    \            /     |    \
      Alg1  Alg2  Alg3       Alg1  Alg2  Alg3        Alg1  Alg2  Alg3


Each algorithm can then be implemented in each different language
using the specific interface for that language. Algorithms written in
a particular language will then be easily interchangeable with other
algorithms written in the same language.

At the next step up, the conformance of each child standard to the
generic standard will ensure that the task of translating an
algorithm from one language to another will be simplified.

On some systems (e.g. vaxen) that support cross-language procedure
call standards, it may be possible to define a machine specific
convention to allow implementations of various stream processing
algorithms written in different languages to be interchanged at link
time or run time.

There may also arise particular conventions within this standard for
particular classes of stream transformation algorithms. For example, a
convention might arise for the method of specifying a private key in
cryptography algorithms.

2. INTERFACE SPECIFICATION
--------------------------

2.1 Structure of the Interface
------------------------------

2.1.1 Overview
--------------
Stream transformation algorithms often come in pairs. For example,
there may be a pair of data compression algorithms, one to compress
the data and the other to decompress the data. This standard treats
these two aspects of a single system of compression as separate
algorithms. Thus, there will be one defined algorithm which performs
compression and one defined algorithm that will perform decompression.

A conforming implementation of a stream transformation algorithm may
be implemented in any manner, but must present to the user program an
interface consisting of five procedures whose parameter lists are
tightly defined. These procedures could connect to:

   * Straight procedures.
   * An object in an object oriented language.
   * A process whose interface is cast as procedure calls.
   * A generator[Marlin80].

A generator implementation would be by far the best, as it has the
procedure interface required by the standard, avoids the overheads
usually associated with processes, and does not require the algorithm
implementer to convert (in fact, "invert") the algorithm from process
form to ADT/procedure form.

Each interface procedure communicates only through its parameter list.

2.1.2 Instances
---------------
This standard allows stream transformation algorithm implementations
to allow more than one copy of an algorithm to exist at any one time.
These activations are called instances, and the data associated with a
particular instance is stored in an instance variable.

User program can manipulate instance variables, but will usually
manipulate references to instances. Such references are stored in
variables of type instance_t. An instance reference may be implemented
as:

   * The instance data itself.
   * A reference to the instance data.
   * A reference to a process.
   * A reference to a generator.
   * An integer identification number.
   * A dummy value in the case where there is at most a single instance.

(Note: In the case of the integer identification number, the algorithm
 abstraction must maintain a table that maps between integers and
 instances).

The particular system for referring to instances should be fixed for
each programming language and defined in that language's child
standard.

Perhaps the best system is to use integers (an integer type long
enough to hold pointer values if possible) as instance references and
let the each individual algorithm determine (out of sight of the
interface) how these integers are to be mapped into instances.

If the implementation places any limit on the number of instances
(e.g. 1 if the implementation uses static data structures for its
instance data), then this limit must be advertised in the
'max_instances' field in the information record.

2.1.3 Statuses
--------------
A number of things can go wrong during execution of a stream
transformation, and it is necessary to provide a channel for status
information in the interface.

When a call to one of the standard interface procedures is made, the
procedure conveys one of the following values upon return.

   SUCCESS            - The procedure call was successful.
   PARAM_BAD          - Specified parameter string illegal.
   DATA_BAD           - Data being processed is bad.
   MEMORY_OUT         - Algorithm has run out of memory.
   TOO_MANY_INSTANCES - Too many instances to allow creation of a new one.
   UNKNOWN_INSTANCE   - The instance id is unknown.
   OVERRUN            - Algorithm overrun by input data.
   CANT_FLUSH         - Algorithm is buffering non-byte-aligned data.
   MISC_ERROR         - An unspecified error occurred (e.g. assertion failed).

Any value except SUCCESS indicates that a error has occurred that is
fatal to the execution of the stream transformation algorithm and that
the user should not attempt any more operations on that instance.

The status channel can be implemented in a variety of ways. Most
simply it can be implemented as an integer or enumeration type which
can be returned by each procedure (as well as other information).
However, the status could also be implemented by other mechanisms such
as Ada exceptions. The exact mechanism should be fixed for each
programming language by the child standard for each language.

2.2 Procedures
--------------
The interface consists of five procedures. The naming of the
procedures is up to the child standard for each particular programming
language. However, it is suggested that the name of each procedure
consist first of the algorithm name followed by one of the following
three-letter codes or (if the identifier significance of the language
allows it) words.

   inf - information - Return information about the stream transformer.
   cre - create      - Create an instance of a stream transformer.
   des - destroy     - Destroy an instance of a stream transformer.
   res - reset       - Reset an instance of a stream transformer.
   pro - process     - Process a block of bytes.

For example, if the stream algorithm were called "alg", the procedures
exported from the "alg" module might be called:

   alg_inf
   alg_cre
   alg_des
   alg_res
   alg_pro

These names will be used in future examples.

The keywords IN, INOUT, and OUT are used below in the Ada sense to
describe the flow of information at the procedure interface:

   IN    allows the procedure only to read a parameter.
   OUT   allows the procedure only to write the parameter.
   INOUT allows the procedure to both read and write the parameter.

The fact that the procedure has a particular power (i.e. reading and
or writing) over a parameter does not require the procedure to
exercise that power.

2.2.1 The Identification Procedure alg_inf
------------------------------------------
The information procedure alg_inf accepts a parameter text string in
'param' containing parameters to the algorithm and returns an
information record containing general information about the algorithm
as well as more specific information determined by the parameter
settings. See the section on the information record for more
information.

This procedure may be called at any time.

   alg_inf(IN  param : string;
           OUT id    : information_record)
   begin alg_inf
      id:=ALG_ID;
      if legal(param)
         id.status:=SUCCESS;
         <fill in parameter specific fields of 'id'>
      else
         id.status:=PARAM_BAD;
         id.error_string:="<reason why parameter string is bad>";
      endif
   end alg_inf

2.2.2 The Instance Creation Procedure alg_cre
---------------------------------------------
The instance creation procedure uses a parameter string passed by the
user to create a new, freshly-initialized instance of the algorithm. A
reference to the new instance is placed in 'instance'. If a problem
arises during creation, 'status' is set to other than SUCCESS and
'instance' remains undefined.

   alg_cre(IN  param    : string_t;
           OUT instance : instance_t;
           OUT status   : status_t);
   begin alg_cre
      if not legal(param) then
         status:=PARAM_BAD;
      else
         if <there are too many instances> then
            status:=TOO_MANY_INSTANCES;
         else
            instance:=new(instance_t);
            if instance=NULL then
               status:=MEMORY_OUT;
            else
               status:=SUCCESS
            endif
         endif
      endif
   end alg_cre

2.2.3 The Instance Destruction Procedure alg_des
------------------------------------------------
This procedure destroys a specified instance, if possible destroying
and deallocating the instance's data structures. This procedure may be
called regardless of the state of the instance (e.g. the instance may
be part way through an invocation group at the time of destruction).

   alg_des(INOUT instance : instance_t;
           OUT   status   : status_t);
   begin alg_des
      if not known(instance) then
         status:=UNKNOWN_INSTANCE;
      else
         deep_dispose(instance);
         status:=SUCCESS;
      endif
   end alg_des

2.2.4 The Reset Procedure alg_res
---------------------------------
This procedure is used to reset an instance back to its initial (just
created) state. alg_res is functionally the same as a call to alg_des
followed by a call to alg_cre (with the same parameters as the initial
alg_cre), but has been included so as to allow algorithm
implementations to provide a way to reset instances without using
alg_des and alg_cre, (which are likely to contain (possibly) expensive
heap deallocation and allocations). This procedure may be called at
any time after an alg_cre, regardless of the state of the instance.

   alg_res(INOUT instance : instance_t;
           OUT   status   : status_t);
   begin alg_res
      if not known(instance) then
         status:=UNKNOWN_INSTANCE;
      else
         Set instance back to the newly created state;
         status:=SUCCESS;
      endif
   end alg_res

2.2.5 The Processing Procedure alg_pro
--------------------------------------
The processing procedure is the heart of the implementation and
performs the actual data transformation. The input data stream is
processed by dividing it up into input blocks which are fed to the
procedure one at a time with the 'instance' variable maintaining the
stream algorithm's context between calls to alg_pro.

To allow the algorithm to produce an arbitrary amount of output for a
given input, but still place a bound on the output it can generate
during any one call, a multiple invocation system is used to process
each input block. To process a block, the user hands over the block of
input data, space for the output data (e.g. a pointer to memory or an
array of bytes or something), and specifies a maximum limit to the
length of the output data. The procedure then returns with zero or
more output bytes. If 'more_output'=false, the algorithm has finished
on that block of input data and the user program can then use the instance
to process further input data. If 'more_output'=true, the user program
must place a similar call to alg_pro to receive further output bytes.
Similar calls must be made until alg_pro returns with
'more_output'=false. A series of such calls is called an "invocation
group".

   alg_pro(INOUT instance      : instance_t;
           IN    inblock       : array of bytes;
           IN    flush         : boolean;
           IN    max_out_bytes : natural;
           OUT   outblock      : array of bytes;
           OUT   more_output   : boolean;
           OUT   status        : status_t;
           OUT   status_string : string_t);
   begin alg_pro
      <stream process code>
   end alg_pro

INOUT 'instance' is the reference to the instance of the stream
transformation algorithm that is to process the data.

IN 'inblock' is a block of zero or more bytes which are to be fed to
the instance during this invocation group. 'inblock' must have the
same value throughout all the calls in a particular invocation group.

Conforming algorithms on a particular implementation must be able to
process the largest blocks of input that could possibly be handed to
them. If an algorithm can process (say) only 64K, then it should be
wrapped in a loop before being encapsulated in a conforming procedure.

IN 'flush' is a boolean that instructs the algorithm to clear its
buffers. By setting this flag, the user program instructs the
algorithm to completely finish processing all the input given to the
algorithm (including the current input block) by the end of the
invocation group. Further semantic details of this flag are left up to
each algorithm.

In some cases, it may be impossible to flush the input without
compromising the operation of the algorithm (e.g. an LZW data
decompression algorithm requested to flush at a point when it has only
read one of two bytes forming a 16-bit code word). The user program
should know better than to cause such a situation to arise, but if it
does, the algorithm can return the status CANT_FLUSH.

IN 'max_out_bytes' is the maximum number of bytes that the algorithm
is allowed to place in the output block during this particular
invocation. This value can change from call to call, even within an
invocation group.

OUT 'outblock' is a block of zero to 'max_out_bytes' bytes which have
been produced by the transformation algorithm implementation. The
bytes in 'outblock' may have been buffered by the instance and do not
necessarily directly correspond in any way to the input bytes in
'inblock'. This parameter can change (e.g. its address or length)
within an invocation group.

OUT 'more_output' is true after a call to alg_pro iff the conforming
algorithm implementation needs to write more output before receiving
the next block of input (see general discussion above). The algorithm
is allowed to set this output flag to true, even if it has not placed
max_out_bytes bytes into the output block.

OUT 'status' indicates if anything went wrong during the call. A
status other than SUCCESS indicates that an error fatal to the
transformation occurred.

OUT 'status_string' is an informational string that is always well
defined. For success it is the empty string. If an error occurs, this
string gives a more detailed description of what happened. It may also
be used for other purposes (e.g. for noting errors corrected in an
error correction transformation).

2.3 Parameter String Format
---------------------------
The parameter string alg_inf contains algorithm parameters (e.g. the
code width parameter in Unix compress). A variable-length string
format was chosen to provide flexibility in the number, kind, and
information content of parameters, to provide a human-readable
representation, and to allow for easy transmission through text-only
channels.

Parameter strings should conform to the following syntax:

   <param_string> ::= <empty_string> | <param-set>
   <param-set>    ::= ( "+" | "=" ) "(" [<param> { ","<param> }] ")"
   <param>        ::= <name> "=" """" <value> """"
   <name>         ::= <letdig> { <letdig> }
   <letdig>       ::= "a".."z" | "A".."Z" | "0".."9"
   <value>        ::= { <any-printable-except-,> }

The semantics of all this syntax should be self evident, except
possibly for the "+" and "=" at the beginning of the string. These are
to indicate whether the user intends the list given to be a complete
list of the parameters of the algorithm or just a partial list. The
algorithm should always reject a parameter list beginning with "="
that does not specify a value for every parameter of the algorithm.

For parameters that consist of binary fields (e.g. a DES key), the
convention is suggested of using two-digit hexadecimal digits in text
form.

The strings are case sensitive.

Example parameter strings:
  ""                         ! Use default parameter settings.
  "+(mem=65536,depth=3)"     ! Modify params: memory and depth.
  "=(key=0A4BF63099A7B4B4)"  ! Specify crypto key. Assert only parameter.
  "=()"                      ! Assert that algorithm has no parameters.

2.4 Information Record
----------------------
This section defines the information contained in information records
returned by the alg_inf procedure. The standard does not define the
exact details of the record, which will vary from language to
language, but it is expected that each language-specific record will
conform quite closely to the "ideal" record specified below.

The record is divided into two parts. The first part gives information
about the algorithm implementation that is not dependent on the
parameters that are passed to the algorithm implementation (e.g. the
algorithm's name). The second part gives information that may change
with parameters (e.g. the amount of memory required).

   record

     /* Part 1: Information that does not depend on parameter values. */
     stanver         : natural
     name            : string
     version         : string
     date            : string
     format_in       : natural
     format_out      : natural
     dist_name       : string
     dist_contact    : string
     reference       : string
     mechanism       : string
     usage           : string
     legal           : string
     note            : string

     /* Part 2: Information that does depend on parameter values. */
     status          : status
     error_string    : string
     memory_total    : natural
     memory_stack    : natural
     memory_heap     : natural
     deterministic   : boolean
     sizechange      : boolean
     informationloss : boolean
     max_instances   : natural
     block_sensitive : boolean

   end record

stanver: The version number of the interface standard to which this
record conforms.

name: The name of the  algorithm (e.g. "LZRW1").

version: A string giving the version number of this particular piece
of code.

date: The date of release of this particular piece of code. The date
string must be of length 10 and have the format "yyyy-mm-dd", where yy
is a year in the range "1900" and "1999", mm is a month in the range
"01".."12" (Jan..Dec), and dd is a day number in the range "01".."31".

format_in: A data format number (see later) specifying the format in
which the algorithm expects input data to appear.

format_out: A data format number (see later) specifying the format of
the output data generated by the algorithm.

dist_name: The name of the person or organization distributing the
code.

dist_contact: How to contact the distributor.

reference: A reference to any publication describing the algorithm.

mechanism: A brief description of the way the algorithm works.

usage: Hints on how the algorithm is best used.

legal: A message about the legal status of the code and the algorithm.

note: Miscellaneous information about the algorithm.

The remaining fields are all parameter dependent. That is, their
values may vary depending on the parameter string passed to the
algorithm.

status: This is in fact the status value from the call to alg_inf. It
can be one of two values SUCCESS or PARAM_BAD depending on whether the
parameter string passed to the call is acceptable to the algorithm.

error_string: If status=PARAM_BAD then this field contains a string
describing what is wrong with the parameter string.

memory_total: An estimate of the maximum total number of bytes that
will be used by an instance of the algorithm if created with the
specified parameter list.

memory_stack: An estimate of the maximum number of bytes of stack that
will be used by an instance of the algorithm if created with the
specified parameter string.

memory_heap: An estimate of the maximum total number of bytes of heap
that will be used by an instance of the algorithm if created with the
specified parameter string.

deterministic: True iff the algorithm can only yield a single output
stream value for each possible input stream value (regardless of the
size of input blocks).

sizechange: True iff the transformation is capable of writing more or
less bytes that it reads (at flush points).

informationloss: True iff the transformation loses information in the
sense that it is a many to one mapping.

max_instances: The maximum number of instances of the algorithm that
can exist at any one time. This value can be from zero to infinity. A
value of zero indicates no limit to the number of instances. A value
of one can be used in algorithms that use static data structures.

block_sensitive: True iff the way in which the input stream is split
up into input blocks will affect the output produced. For example,
some data compression algorithms might compress better if handed the
input in 16K input blocks instead of 1 byte input blocks. Others will
produce the same output regardless of how the input is fed to them.

2.5 Data Formats
----------------
An earlier version of this standard required that each algorithm be
given a unique identification code which could be used to tag data
processed by the algorithm so that the same algorithm could be used to
"unprocess" the data (e.g. compress and decompress). This turned out
to be a mistake because (for example) it did not allow data by many
data compression algorithms using the same coding format to be
decompressed by the same decompressor.

In this revised standard, the "to" and "from" aspects of stream
processing algorithm pairs have been separated into distinct
algorithms. This means that (for example) a data compression algorithm
can exist independently from its corresponding decompression
algorithm. To allow a degree of independence between algorithms, but
still enable the user to match pairs, we focus not on the algorithms
but on the format of the data streams that they manipulate.

Each algorithm has two fields (in_format and out_format) of its
information record that specify the input and output data formats. The
format number 0 is reserved as a wildcard and represents any data (or
in algorithm pairs "the original data"). Any other number indicates a
particular format (such as a compressed data format). Thus for a data
compression algorithm consisting of a compression algorithm and a
decompression algorithm, with a compressed data format number of
12345, the compression algorithm would have (in=0,out=12345) and the
decompressor would have (in=12345,out=0). The user program would see
the 0 and realize that raw data can be fed in, and would also be able
to match the two algorithms and use them together.

By allowing the designer to specify an input and an output format,
this system allows complicated networks of transformations to be
constructed and used sensibly by the user program. It allows a single
algorithm to "decode" the data of dozens of "coder" algorithms, each
with its own performance characteristics.

A data format number is a 31-bit number whose bits are numbered 31..0.
Bit 31 is always zero (so as to allow signed 32-bit integers to be
used). Bit 30 is 0 if the format is unparameterized and 1 if the
format is parameterized. A parameterized format is one which may be
impossible to process further (e.g. decode) without supplying extra
information (e.g. most encrypted formats will be parameterized because
the decrypting algorithm requires the key). The remaining bits (0-29)
carry no semantic meaning but merely serve to distinguish from other
formats.

Designers of algorithm pairs not using any well-established data
format can create a new data format number by tossing a fair coin 30
times, once for each bit of the id. Bit30 should be set depending on
whether it is a parameterized format. The tossing of a coin is a
requirement of this standard. Under no circumstances is a computer's
random number generator to be used to generate an id. In the unlikely
event that two algorithm designers have chosen the same number, one of
the algorithm's numbers can be changed.

3. ADMINISTRATION
-----------------

3.1 Introduction
----------------
This section contains the administrative details of the standard. It
outlines what is expected of those constructing conforming data
compression algorithm implementations and records the details of the
standard's history.

3.2 Revision History
--------------------
This section gives the revision history of this standards document.

YYYY-MM-DD   Name  Action
----------   ----  ------
1991-06-19   RW    Created the first  version of the standard.
1991-07-12   RW    Created the second version of the standard.

3.3 Child Standards
-------------------
As stated earlier, a child standard should be created for each
programming language in which implementations are to be written. The
child standard need not duplicate the information in this standard.
Rather, it should be like a subclass in an object-oriented programming
language which adds detail and modifies some information in its
superclass.

In most cases, the child standard need consist only of a few type
declarations and some example procedure declarations (and lots of
comments!).

3.4 Benchmark Implementations in C
----------------------------------
Although algorithm designers are free to implement conforming
procedures in whatever language they choose, it is hoped that,
regardless of the language they choose, that they will also create a
version in C conforming to the C child-standard. If everyone who
creates a new algorithm does this there will soon be a big pool of
highly interchangeable benchmark implementations that users can use to
determine quickly the best algorithm/implementation for their needs.

The C programming language was chosen as a benchmark language because
it is so widely available.

Eventually, if enough algorithms are standardized, it may be possible
to construct an expert system to help users choose the best algorithms
for their application. Users could submit a sample of data to the
expert system and say how much memory and time they have during
compression and decompression, whether they are prepared to apy patent
royalties, and other such information. The expert system could then
test various standardized algorithms on the submitted data and
determine the best algorithm for the user's needs.


4. EXAMPLE APPLICATIONS
-----------------------
The following code fragments indicate how the interface could be used
in various stream processing applications.

To simplify the presentation, error handling has been omitted and
liberties have been taken with the programming notation which draws on
self-explanatory features from a variety of programming languages.

Comma (",") is used as a concatenation operator.

4.1 Compressing a Single Block of Memory
----------------------------------------
This piece of code uses a conformant implementation of a stream
transformation algorithm to perform data compression on a single block
of memory. The code might be found inside a random access file
compression system which has to compress fixed-length blocks of data.
Here the invocation group mechanism works to advantage. By feeding the
algorithm equal-sized input and output blocks, and setting flush=true,
the algorithm indicates in 'more_output' whether it has expanded the
data. The user program uses this information to decide whether to use
the output of the compression algorithm, simply to copy the data over.
The first byte of the (final) output is used as a flag to indicate
which choice was taken.

const BLOCK_SIZE  : integer = 16*1024;

var instance      : instance_t;
var inblock       : array[BLOCK_SIZE] of byte;
var outblock      : array[BLOCK_SIZE] of byte;
var more_output   : boolean;
var status        : status_t;
var status_string : string;

compress_cre("",status,instance);
compress_pro(instance,inblock,flush=>true,
             BLOCK_SIZE,outblock,more_output,status,status_string);
compress_des(instance,status)
if more_output then
   outblock:=0,inblock;   -- Copy over.
else
   outblock:=1,outblock;  -- Use compressed representation.
endif

4.2 Process a Stream of Bytes
-----------------------------
This example demonstrates how a simple stream of bytes can be fed
through an algorithm. The input and output buffers INSIZE and OUTSIZE
can be set to any positive values, including 1 (which might be used in
a Unix getc/putc interface).

const INSIZE      : natural :=1;
const OUTSIZE     : natural :=1;

var instance      : instance_t;
var inblock       : array[INSIZE ] of byte;
var outblock      : array[OUTSIZE] of byte;
var more_output   : boolean;
var status        : status_t;
var status_string : status_string_t;

open(instream);
create(outstream);
alg_cre("",status,instance)
while not eos(instream) do
   read(instream,inblock);
   repeat
      alg_pro(instance,inblock,flush=>false,
              OUTSIZ,outblock,more_output,status,status_string);
      write(outstream,outblock);
   until not more_output;
endwhile
inblock=empty;
repeat
   alg_pro(instance,inblock,flush=>true,
           OUTSIZ,outblock,more_output,status,status_string);
   write(outstream,outblock);
until not more_output;
alg_des(instance,status)
close(instream)
close(outstream)

4.3 Compressing and Encrypting in a Modem
-----------------------------------------
The following example demonstrates how the interface might be used
inside a modem to perform data compression and encryption. Two stream
algorithms are used in this example, one called 'compress' that
performs data compression, and one called 'crypt' that performs
ciphering. This example shows how the inverted procedure/instance/ADT
form used in this interface standard allows what are essentially
multiple processes to execute in interleaved fashion without the user
program ever losing control of the data flows.

More stream processors could be added into this example, by increasing
the number of nested repeat loops.

const INSIZE      : natural := 2*1024;  -- Any positive integer.
const OUTSIZE     : natural := 2*1024;  -- Any positive integer.
const TIMEOUT     : natural := 500;     -- 500 miliseconds.
var compress      : instance_t;
var encrypt       : instance_t;
var inblock       : array[INSIZE ] of byte;
var outblock      : array[OUTSIZE] of byte;
var outblock2     : array[OUTSIZE] of byte;
var more_output   : boolean;
var more_output_2 : boolean;
var status        : status_t;
var status_string : status_string_t;
var instream      : stream_t;
var outstream     : stream_t;
var time_snapshot : time_t;

open(instream);
create(outstream);
compress_cre("",status,compress)
crypt_cre("=(Key=SLOTH)",status,crypt)
while not eos(instream) do

   -- Accumulate bytes until we can't hold any more or until timeout occurs.
   time_snapshot=time();
   inblock:=empty;
   repeat
      read(instream,inbyte) OR timeout if time()>=time_snapshot+TIMEOUT;
      if not timeout then
         inblock:=inblock,inbyte;
      endif
   until size(inblock)=INSIZE or timeout;

   - Process buffered bytes.
   repeat
      compress_pro(compress,inblock,flush=>timeout,
              OUTSIZ,outblock,more_output,status,status_string);
      repeat
         crypt_pro(crypt,outblock,flush=>(timeout AND not more_output),
              OUTSIZ,outblock2,more_output_2,status,status_string);
         write(outstream,outblock2);
      until not more_output_2;
   until not more_output;

endwhile
compress_des(compress)
crypt_des(crypt)
close(instream)
close(outstream)


5. GLOSSARY
-----------

Algorithm - In this document short for "stream transformation
algorithm" which is an abstract procedural specification for a
transformation of a stream of bytes. It is important to distinguish
between an algorithm and an implementation of an algorithm.

Algorithm implementation - Code written in a particular programming
language that actually implements an algorithm.

Byte - In this document, a byte is defined to be eight bits. Memory is
measured in bytes.

Conforming Implementation - An implementation of a stream
transformation algorithm that conforms to this standard.

Instance - In this document, an instance refers to an instance of a
stream transformation algorithm. The term is often used to refer to
the data that is used to store the state of the instance between
activations. (Note: The term "instance" is used in the book "Adaptive
Data Compression" by Ross Williams to mean an instance of a symbol and
the two meanings should not be confused. In this documen the term
"byte" is used to refer to the entity referred to in the book as
"instance").

Invocation Group - A consecutive sequence of invocations of the "_pro"
procedure which collectively process a single block of input.

Printable character - A printable character is defined to be a byte in
the range [32,126]. In the ASCII character set, this corresponds to
the set of printable characters [' '..'~']. Note: On non-ASCII
machines this definition still stands --- the byte values are still to
be used. To print such a character on a non-ASCII machine will require
some kind of conversion.

Procedure - A procedure written in a conventional imperative
programming language.

RW - Ross Williams (ross@spam.ua.oz.au).

String - A string is a variable length array of printable characters.

User program - Any computer program that uses a conformant
implementation of a stream transformation algorithm.


6. REFERENCES
-------------

[Marlin80] C.D.Marlin, "Coroutines: A Programming Methodology, a Language
Design and an Implementation", Lecture Notes in Computer Science #95,
Springer Verlag, 1980.


--<End of Proposed Stream Transformation Algorithm Interface Standard>--