Path: spam!ross From: ross@spam.ua.oz.au (Ross Williams) Newsgroups: comp.compression Subject: *New*, *Improved*, DC Interface Standard (proposed)! Summary: Revised version of a proposed data compression interface standard. Keywords: data compression stream transformation proposed interface standard Message-ID: <933@spam.ua.oz> Date: 12 Jul 91 06:34:03 GMT Sender: ross@spam.ua.oz Followup-To: comp.compression Distribution: world Organization: Statistics, Pure & Applied Mathematics, University of Adelaide Lines: 1207 Announcing the *NEW*, *IMPROVED* Data Compression Interface Standard: Revised, generalized, and now being flogged under the name of STREAM TRANSFORMATION ALGORITHM INTERFACE STANDARD Hi there compressor heads. A few weeks ago I posted a proposed data compression interface standard for comment in the hope that some sort of consensus could be reached in standardizing the interfaces of data compression algorithms. I got lots of comments and used these to construct a revised standard. Here is a table of the people who replied and a list of their comments. Short Name Full Name Email Address ------------ ------------------ ------------------------- Douglas Douglas W. Jones jones@pyrite.cs.uiowa.edu Ian Ian Lance Taylor ian@airs.com Jean-loup Jean-loup Gailly jloup@chorus.fr Brad Brad Templeton brad@looking.on.ca Dan Dan Bernstein brnstnd@kramden.acf.nyu.edu Gordon Gordon Irlam gordoni@berlioz.ua.oz Ronald Ronald van Eijck ronald@ecl014.UUCP Graham Graham Toal gtoal@tharr.UUCP David David Gudeman gudeman@cs.arizona.edu Marcus Marcus J. Ranum mjr@hussar.dco.dec.com 20-Jun-1991 Douglas Use more than one procedure. 20-Jun-1991 Douglas Use a stream interface rather than a block interface. 21-Jun-1991 Ian Be less draconican about memory restrictions. Include a version number for the standard in ID record. 21-Jun-1991 Jean-loup Algorithm should allocate its own memory. 21-Jun-1991 Brad Allow statically allocated memory. (Be less draconian). 21-Jun-1991 Jean-loup Algorithm should be allowed to use lots of stack. 21-Jun-1991 Dan Algorithms should be allowed to have parameters. 21-Jun-1991 Dan Use a stream interface rather than a block interface. 22-Jun-1991 Gordon Use an ISO standard date format in the ID record. 22-Jun-1991 Ian Don't aim for cross machine portability. 22-Jun-1991 Ronald Standardize an archiver interface instead. 22-Jun-1991 Dan Allow parameters. Use a stream model. 22-Jun-1991 Graham Standard is pretentious twaddle. 22-Jun-1991 David Use global variables to implement parameters. Allow both block and stream interface. Use more than one procedure. 23-Jun-1991 Dan Agreement with David's message of 22-Jun-1991. 23-Jun-1991 Graham Standard should be implemented before being imposed. A data format would be a more useful standard. 23-Jun-1991 Marcus Use magic numbers to identify algorithms. 27-Jun-1991 Dan Implement standard before releasing it. Algorithm should obtain its own memory. Stack and heap memory consumption quoted separately. Allow parameters. 28-Jun-1991 Jean-loup Use a process model. Do IO through standard procedures. 28-Jun-1991 Dan Using IO procedures is not inefficient. PRIVATE EMAIL SUGGESTIONS FOLLOWING PUBLIC DISCUSSION 01-Jul-1991 Jean-loup Process model is best. 01-Jul-1991 Jean-loup Multiple instances are a problem in process model. 02-Jul-1991 Jean-loup Don't place a limit on the compression factor. Parameterized IO routines are not hard to use. 02-Jul-1991 Jean-loup Both process and ADT models have their problems, but agreed that ADT model is more general in use. When I drafted the standard, I thought it was a fairly clean, good standard, and I still believe that. However, once posted to the network it drew comments which made it clear to me that I had neglected some classes of user and had (with all best intentions of encouraging portability) cast a standard that was too fastidious about tying down things like memory consumption. Reluctantly, I revised the standard to satisfy the revised requirements. The result is a standard that is much more comprehensive, but not as clean and elegant. A good change, but a sad change. Here is the list of comments again with a summary of my final (i.e. at the end of the review process) decision on each comment. 20-Jun-1991 Douglas Use more than one procedure. Done. 20-Jun-1991 Douglas Use a stream interface rather than a block interface. Done. But data is still passed in blocks. 21-Jun-1991 Ian Be less draconican about memory restrictions. Include a version number for the standard in ID record. Done. 21-Jun-1991 Jean-loup Algorithm should allocate its own memory. Done. 21-Jun-1991 Brad Allow statically allocated memory. (Be less draconian). Done. (Just specify a single-instance abstraction). 21-Jun-1991 Jean-loup Algorithm should be allowed to use lots of stack. Done. 21-Jun-1991 Dan Algorithms should be allowed to have parameters. Done. 21-Jun-1991 Dan Use a stream interface rather than a block interface. Done. But data is still passed in blocks. 22-Jun-1991 Gordon Use an ISO standard date format in the ID record. Done. 22-Jun-1991 Ian Don't aim for cross machine portability. This is still important, but not as emphasized in the revised standard. 22-Jun-1991 Ronald Standardize an archiver interface instead. I think an internal interface standard is a better first step. 22-Jun-1991 Dan Allow parameters. Use a stream model. Done. 22-Jun-1991 Graham Standard is pretentious twaddle. Sorry, - the revised standard is even worse! 22-Jun-1991 David Use global variables to implement parameters. Allow both block and stream interface. Use more than one procedure. Parameters are now supported using a string interface. The revised standard supports a stream interface of which block interfaces are a special case. The revised standard uses ten procedures per data compression algorithm! 23-Jun-1991 Dan Agreement with David's message of 22-Jun-1991. 23-Jun-1991 Graham Standard should be implemented before being imposed. A data format would be a more useful standard. Disagree. People need standards to cling to to make them feel warm and fuzzy. Without standards everyone goes in different directions. 23-Jun-1991 Marcus Use magic numbers to identify algorithms. This was already in the original standard I think. 27-Jun-1991 Dan Implement standard before releasing it. Algorithm should obtain its own memory. Stack and heap memory consumption quoted separately. Allow parameters. All but the first request done. 28-Jun-1991 Jean-loup Use a process model. Do IO through standard procedures. Rejected because of the messiness of linking in procedures and because of the way such an arrangement dominates control (e.g. you can't hook them together in chains). 28-Jun-1991 Dan Using IO procedures is not inefficient. Fine, but I chose not to go that way. PRIVATE EMAIL SUGGESTIONS FOLLOWING PUBLIC DISCUSSION 01-Jul-1991 Jean-loup Process model is best. Possibly but it leads to a nasty interface. A procedure interface implemented using generators gives the best of both worlds. 01-Jul-1991 Jean-loup Multiple instances are a problem in process model. Revised standard allows multiple instances (as did original one). 02-Jul-1991 Jean-loup Don't place a limit on the compression factor. Parameterized IO routines are not hard to use. The revised standard places no limit on compression or on the length of streams. I think paramterized IO is messy. 02-Jul-1991 Jean-loup Both process and ADT models have their problems, but agreed that ADT model is more general in use. Agreed; the ADT model requires that algorithms that are most naturally described as processes be inverted into data abstractions. However, it is more general and in my opinion, its benefits outweigh its disadvantages. I feel that the revised standard addresses most of the comments above made during the last round. The revised standard is not as tight as the original, partly because it is inherently messier and partly because (given the response to my previous effort) I do not wish to put any further effort into tying up the details until I feel that the net is happy with its overall framework. The revised standard is by no means watertight, but it is well fleshed out. Once I had switched over to a stream model, I realized that with a few minor changes, the standard could be used to describe the interface of almost any stream model. I pursued this and the result is that the new standard is generalized to apply to stream transformation algorithms in general. The following areas caused particular difficulty during the revision process. FLUSH FLAG: The semantics of the flush flag were one of the hardest parts of forming this standard. All comments welcome. IDENTITY and FORMAT: The whole issue of the identity of algorithms and the data formats that they manipulate is extremely messy and it is very hard to arrive at a satisfactory solution. Algorithm parameters do not help the situation. I would greatly appreciate any ideas here for cleaning up this part of the standard. PARAMETERS IN INVOCATION GROUP: Is it better for the algorithm to tell the user how many input bytes it processed after each invocation (within an invocation group) and then have the user pass a reduced input block during the next call, or is the current scheme of passing the whole block each time better? I chose the whole block because I though it would allow the user to more simply place the call in an invocation group loop. It would also assist algorithms such as LZRW1 that wish to hold memory pointers to the input. ------ Below is the revised standard. I hope that it is closer to what people want. I look forward to your comments. Ross Williams ross@spam.ua.oz.au 12-Jul-1991. STREAM TRANSFORMATION ALGORITHM INTERFACE STANDARD ================================================== Author : Ross Williams (ross@spam.ua.oz.au). Date : 12-Jul-1991. CONTENTS -------- 1. INTRODUCTION 1.1 Summary of the Standard 1.2 Motivation for the Standard 1.3 Example Applications 1.4 Benefits of the Standard 1.5 Scope of the Standard 1.6 Portability 2. INTERFACE SPECIFICATION 2.1 Structure of the Interface 2.1.1 Overview 2.1.2 Instances 2.1.3 Statuses 2.2 Procedures 2.2.1 The Identification Procedure alg_inf 2.2.2 The Instance Creation Procedure alg_cre 2.2.3 The Instance Destruction Procedure alg_des 2.2.4 The Reset Procedure alg_res 2.2.5 The Processing Procedure alg_pro 2.3 Parameter String Format 2.4 Information Record 2.5 Data Formats 3. ADMINISTRATION 3.1 Introduction 3.2 Revision History 3.3 Child Standards 3.4 Benchmark Implementations in C 4. EXAMPLE APPLICATIONS 4.1 Compressing a Single Block of Memory 4.2 Process a Stream of Bytes 4.3 Compressing and Encrypting in a Modem 5. GLOSSARY 6. REFERENCES 1. INTRODUCTION --------------- 1.1 Summary of Standard ------------------------ This standard defines an interface for the communication between computer programs and implementations of stream transformation algorithms. Here the term "stream transformation algorithm" is used to mean an algorithm whose sole function is to read a (possibly infinite) stream of input bytes and produce a (possibly infinite) stream of output bytes. The standard constrains each stream transformation algorithm implementation to take the form of a well-defined set of procedures whose parameter lists are tightly defined. The standard is programming language independent, seeking not to control the details of the interface, but rather the structure of the interface and the nature of the information passing through it. 1.2 Motivation for the Standard ------------------------------- Many algorithms that read input and write output can be viewed as stream transformers. So common is this structure in computer programs that the Unix operating system embraces it by providing command language constructs for connecting stream processing programs together into sequences. At a lower programming level, there is no such construct, and use of the higher level construct entails the overhead of creating and destroying a process, an overhead that is likely to be prohibitive for small amounts of data. At the lower programming level there is also a need to keep control of the flow of data. This standard describes a data abstraction (with a procedural interface) that is general enough to allow a variety of stream transformations to be implemented with a uniform interface. The benefit of using the same interface for many stream algorithms is that it will allow stream algorithms to be easily interchanged at edit, compile, or even run time. 1.3 Example Applications ------------------------ This standard is highly applicable to, but not limited to, the following applications: * Data compression. * Cryptography. * Error correction and detection. * General coding (e.g. to text file format for transmission). * Miscellaneous "Unix filter" transformations. 1.4 Benefits of the Standard ---------------------------- Some benefits that will flow from this standard are listed below. * By separating user programs and stream transformers, the standard will yield a high degree of interchangeability of each. * The standard will assist those cataloging and comparing algorithms. * Just by existing, the standard will encourage the designers of new algorithms to provide conforming implementations. * The standard will enable programers to write programs that select from one of many algorithms at run time. 1.5 Scope of the Standard ------------------------- This standard defines a generic procedural interface for stream transformation algorithms. The standard specifies: * A set of interface procedures and the information content of their parameter lists. * The generic behaviour of the procedures (stopping short of specifying the actual transformation performed!). The standard does NOT specify: * The details of the interface (which are necessarily language specific). * The programming language or machine to be used. * The format of the data stream. 1.6 Portability --------------- A goal of this standard is to provide a uniform interface that will allow different stream algorithms to be easily interchanged. However, in striving to support this "portability" of algorithms, other portability issues arise: 1) Portability across machines and machine environments. 2) Portability across programming languages. Machine portability cannot be controlled by this standard and is left up to the implementor who may make as machine portable or as non-machine portable an implementation as desired. Language portability is more difficult as it is usually impossible to write a program that is simultaneously legal in more than one programming language! Because the details of procedure calls, abstraction, and interfacing will vary in each language, this standard does not attempt to define the details of the interface --- it couldn't. Instead it defines the behaviour of procedures and the nature of the information flowing through procedure parameter lists. It is proposed that the interface details be standardized for each programming language in a programming language specific child standard. +---------------------+ | This Language | | Inspecific Standard | +---------------------+ | +----------------------+-----------------------+ | | | +----------------+ +----------------+ +----------------+ | Child Standard | | Child Standard | | Child Standard | | for Language 1 | | for Language 2 | | for Language 3 | +----------------+ +----------------+ +----------------+ / | \ / | \ / | \ Alg1 Alg2 Alg3 Alg1 Alg2 Alg3 Alg1 Alg2 Alg3 Each algorithm can then be implemented in each different language using the specific interface for that language. Algorithms written in a particular language will then be easily interchangeable with other algorithms written in the same language. At the next step up, the conformance of each child standard to the generic standard will ensure that the task of translating an algorithm from one language to another will be simplified. On some systems (e.g. vaxen) that support cross-language procedure call standards, it may be possible to define a machine specific convention to allow implementations of various stream processing algorithms written in different languages to be interchanged at link time or run time. There may also arise particular conventions within this standard for particular classes of stream transformation algorithms. For example, a convention might arise for the method of specifying a private key in cryptography algorithms. 2. INTERFACE SPECIFICATION -------------------------- 2.1 Structure of the Interface ------------------------------ 2.1.1 Overview -------------- Stream transformation algorithms often come in pairs. For example, there may be a pair of data compression algorithms, one to compress the data and the other to decompress the data. This standard treats these two aspects of a single system of compression as separate algorithms. Thus, there will be one defined algorithm which performs compression and one defined algorithm that will perform decompression. A conforming implementation of a stream transformation algorithm may be implemented in any manner, but must present to the user program an interface consisting of five procedures whose parameter lists are tightly defined. These procedures could connect to: * Straight procedures. * An object in an object oriented language. * A process whose interface is cast as procedure calls. * A generator[Marlin80]. A generator implementation would be by far the best, as it has the procedure interface required by the standard, avoids the overheads usually associated with processes, and does not require the algorithm implementer to convert (in fact, "invert") the algorithm from process form to ADT/procedure form. Each interface procedure communicates only through its parameter list. 2.1.2 Instances --------------- This standard allows stream transformation algorithm implementations to allow more than one copy of an algorithm to exist at any one time. These activations are called instances, and the data associated with a particular instance is stored in an instance variable. User program can manipulate instance variables, but will usually manipulate references to instances. Such references are stored in variables of type instance_t. An instance reference may be implemented as: * The instance data itself. * A reference to the instance data. * A reference to a process. * A reference to a generator. * An integer identification number. * A dummy value in the case where there is at most a single instance. (Note: In the case of the integer identification number, the algorithm abstraction must maintain a table that maps between integers and instances). The particular system for referring to instances should be fixed for each programming language and defined in that language's child standard. Perhaps the best system is to use integers (an integer type long enough to hold pointer values if possible) as instance references and let the each individual algorithm determine (out of sight of the interface) how these integers are to be mapped into instances. If the implementation places any limit on the number of instances (e.g. 1 if the implementation uses static data structures for its instance data), then this limit must be advertised in the 'max_instances' field in the information record. 2.1.3 Statuses -------------- A number of things can go wrong during execution of a stream transformation, and it is necessary to provide a channel for status information in the interface. When a call to one of the standard interface procedures is made, the procedure conveys one of the following values upon return. SUCCESS - The procedure call was successful. PARAM_BAD - Specified parameter string illegal. DATA_BAD - Data being processed is bad. MEMORY_OUT - Algorithm has run out of memory. TOO_MANY_INSTANCES - Too many instances to allow creation of a new one. UNKNOWN_INSTANCE - The instance id is unknown. OVERRUN - Algorithm overrun by input data. CANT_FLUSH - Algorithm is buffering non-byte-aligned data. MISC_ERROR - An unspecified error occurred (e.g. assertion failed). Any value except SUCCESS indicates that a error has occurred that is fatal to the execution of the stream transformation algorithm and that the user should not attempt any more operations on that instance. The status channel can be implemented in a variety of ways. Most simply it can be implemented as an integer or enumeration type which can be returned by each procedure (as well as other information). However, the status could also be implemented by other mechanisms such as Ada exceptions. The exact mechanism should be fixed for each programming language by the child standard for each language. 2.2 Procedures -------------- The interface consists of five procedures. The naming of the procedures is up to the child standard for each particular programming language. However, it is suggested that the name of each procedure consist first of the algorithm name followed by one of the following three-letter codes or (if the identifier significance of the language allows it) words. inf - information - Return information about the stream transformer. cre - create - Create an instance of a stream transformer. des - destroy - Destroy an instance of a stream transformer. res - reset - Reset an instance of a stream transformer. pro - process - Process a block of bytes. For example, if the stream algorithm were called "alg", the procedures exported from the "alg" module might be called: alg_inf alg_cre alg_des alg_res alg_pro These names will be used in future examples. The keywords IN, INOUT, and OUT are used below in the Ada sense to describe the flow of information at the procedure interface: IN allows the procedure only to read a parameter. OUT allows the procedure only to write the parameter. INOUT allows the procedure to both read and write the parameter. The fact that the procedure has a particular power (i.e. reading and or writing) over a parameter does not require the procedure to exercise that power. 2.2.1 The Identification Procedure alg_inf ------------------------------------------ The information procedure alg_inf accepts a parameter text string in 'param' containing parameters to the algorithm and returns an information record containing general information about the algorithm as well as more specific information determined by the parameter settings. See the section on the information record for more information. This procedure may be called at any time. alg_inf(IN param : string; OUT id : information_record) begin alg_inf id:=ALG_ID; if legal(param) id.status:=SUCCESS; else id.status:=PARAM_BAD; id.error_string:=""; endif end alg_inf 2.2.2 The Instance Creation Procedure alg_cre --------------------------------------------- The instance creation procedure uses a parameter string passed by the user to create a new, freshly-initialized instance of the algorithm. A reference to the new instance is placed in 'instance'. If a problem arises during creation, 'status' is set to other than SUCCESS and 'instance' remains undefined. alg_cre(IN param : string_t; OUT instance : instance_t; OUT status : status_t); begin alg_cre if not legal(param) then status:=PARAM_BAD; else if then status:=TOO_MANY_INSTANCES; else instance:=new(instance_t); if instance=NULL then status:=MEMORY_OUT; else status:=SUCCESS endif endif endif end alg_cre 2.2.3 The Instance Destruction Procedure alg_des ------------------------------------------------ This procedure destroys a specified instance, if possible destroying and deallocating the instance's data structures. This procedure may be called regardless of the state of the instance (e.g. the instance may be part way through an invocation group at the time of destruction). alg_des(INOUT instance : instance_t; OUT status : status_t); begin alg_des if not known(instance) then status:=UNKNOWN_INSTANCE; else deep_dispose(instance); status:=SUCCESS; endif end alg_des 2.2.4 The Reset Procedure alg_res --------------------------------- This procedure is used to reset an instance back to its initial (just created) state. alg_res is functionally the same as a call to alg_des followed by a call to alg_cre (with the same parameters as the initial alg_cre), but has been included so as to allow algorithm implementations to provide a way to reset instances without using alg_des and alg_cre, (which are likely to contain (possibly) expensive heap deallocation and allocations). This procedure may be called at any time after an alg_cre, regardless of the state of the instance. alg_res(INOUT instance : instance_t; OUT status : status_t); begin alg_res if not known(instance) then status:=UNKNOWN_INSTANCE; else Set instance back to the newly created state; status:=SUCCESS; endif end alg_res 2.2.5 The Processing Procedure alg_pro -------------------------------------- The processing procedure is the heart of the implementation and performs the actual data transformation. The input data stream is processed by dividing it up into input blocks which are fed to the procedure one at a time with the 'instance' variable maintaining the stream algorithm's context between calls to alg_pro. To allow the algorithm to produce an arbitrary amount of output for a given input, but still place a bound on the output it can generate during any one call, a multiple invocation system is used to process each input block. To process a block, the user hands over the block of input data, space for the output data (e.g. a pointer to memory or an array of bytes or something), and specifies a maximum limit to the length of the output data. The procedure then returns with zero or more output bytes. If 'more_output'=false, the algorithm has finished on that block of input data and the user program can then use the instance to process further input data. If 'more_output'=true, the user program must place a similar call to alg_pro to receive further output bytes. Similar calls must be made until alg_pro returns with 'more_output'=false. A series of such calls is called an "invocation group". alg_pro(INOUT instance : instance_t; IN inblock : array of bytes; IN flush : boolean; IN max_out_bytes : natural; OUT outblock : array of bytes; OUT more_output : boolean; OUT status : status_t; OUT status_string : string_t); begin alg_pro end alg_pro INOUT 'instance' is the reference to the instance of the stream transformation algorithm that is to process the data. IN 'inblock' is a block of zero or more bytes which are to be fed to the instance during this invocation group. 'inblock' must have the same value throughout all the calls in a particular invocation group. Conforming algorithms on a particular implementation must be able to process the largest blocks of input that could possibly be handed to them. If an algorithm can process (say) only 64K, then it should be wrapped in a loop before being encapsulated in a conforming procedure. IN 'flush' is a boolean that instructs the algorithm to clear its buffers. By setting this flag, the user program instructs the algorithm to completely finish processing all the input given to the algorithm (including the current input block) by the end of the invocation group. Further semantic details of this flag are left up to each algorithm. In some cases, it may be impossible to flush the input without compromising the operation of the algorithm (e.g. an LZW data decompression algorithm requested to flush at a point when it has only read one of two bytes forming a 16-bit code word). The user program should know better than to cause such a situation to arise, but if it does, the algorithm can return the status CANT_FLUSH. IN 'max_out_bytes' is the maximum number of bytes that the algorithm is allowed to place in the output block during this particular invocation. This value can change from call to call, even within an invocation group. OUT 'outblock' is a block of zero to 'max_out_bytes' bytes which have been produced by the transformation algorithm implementation. The bytes in 'outblock' may have been buffered by the instance and do not necessarily directly correspond in any way to the input bytes in 'inblock'. This parameter can change (e.g. its address or length) within an invocation group. OUT 'more_output' is true after a call to alg_pro iff the conforming algorithm implementation needs to write more output before receiving the next block of input (see general discussion above). The algorithm is allowed to set this output flag to true, even if it has not placed max_out_bytes bytes into the output block. OUT 'status' indicates if anything went wrong during the call. A status other than SUCCESS indicates that an error fatal to the transformation occurred. OUT 'status_string' is an informational string that is always well defined. For success it is the empty string. If an error occurs, this string gives a more detailed description of what happened. It may also be used for other purposes (e.g. for noting errors corrected in an error correction transformation). 2.3 Parameter String Format --------------------------- The parameter string alg_inf contains algorithm parameters (e.g. the code width parameter in Unix compress). A variable-length string format was chosen to provide flexibility in the number, kind, and information content of parameters, to provide a human-readable representation, and to allow for easy transmission through text-only channels. Parameter strings should conform to the following syntax: ::= | ::= ( "+" | "=" ) "(" [ { "," }] ")" ::= "=" """" """" ::= { } ::= "a".."z" | "A".."Z" | "0".."9" ::= { } The semantics of all this syntax should be self evident, except possibly for the "+" and "=" at the beginning of the string. These are to indicate whether the user intends the list given to be a complete list of the parameters of the algorithm or just a partial list. The algorithm should always reject a parameter list beginning with "=" that does not specify a value for every parameter of the algorithm. For parameters that consist of binary fields (e.g. a DES key), the convention is suggested of using two-digit hexadecimal digits in text form. The strings are case sensitive. Example parameter strings: "" ! Use default parameter settings. "+(mem=65536,depth=3)" ! Modify params: memory and depth. "=(key=0A4BF63099A7B4B4)" ! Specify crypto key. Assert only parameter. "=()" ! Assert that algorithm has no parameters. 2.4 Information Record ---------------------- This section defines the information contained in information records returned by the alg_inf procedure. The standard does not define the exact details of the record, which will vary from language to language, but it is expected that each language-specific record will conform quite closely to the "ideal" record specified below. The record is divided into two parts. The first part gives information about the algorithm implementation that is not dependent on the parameters that are passed to the algorithm implementation (e.g. the algorithm's name). The second part gives information that may change with parameters (e.g. the amount of memory required). record /* Part 1: Information that does not depend on parameter values. */ stanver : natural name : string version : string date : string format_in : natural format_out : natural dist_name : string dist_contact : string reference : string mechanism : string usage : string legal : string note : string /* Part 2: Information that does depend on parameter values. */ status : status error_string : string memory_total : natural memory_stack : natural memory_heap : natural deterministic : boolean sizechange : boolean informationloss : boolean max_instances : natural block_sensitive : boolean end record stanver: The version number of the interface standard to which this record conforms. name: The name of the algorithm (e.g. "LZRW1"). version: A string giving the version number of this particular piece of code. date: The date of release of this particular piece of code. The date string must be of length 10 and have the format "yyyy-mm-dd", where yy is a year in the range "1900" and "1999", mm is a month in the range "01".."12" (Jan..Dec), and dd is a day number in the range "01".."31". format_in: A data format number (see later) specifying the format in which the algorithm expects input data to appear. format_out: A data format number (see later) specifying the format of the output data generated by the algorithm. dist_name: The name of the person or organization distributing the code. dist_contact: How to contact the distributor. reference: A reference to any publication describing the algorithm. mechanism: A brief description of the way the algorithm works. usage: Hints on how the algorithm is best used. legal: A message about the legal status of the code and the algorithm. note: Miscellaneous information about the algorithm. The remaining fields are all parameter dependent. That is, their values may vary depending on the parameter string passed to the algorithm. status: This is in fact the status value from the call to alg_inf. It can be one of two values SUCCESS or PARAM_BAD depending on whether the parameter string passed to the call is acceptable to the algorithm. error_string: If status=PARAM_BAD then this field contains a string describing what is wrong with the parameter string. memory_total: An estimate of the maximum total number of bytes that will be used by an instance of the algorithm if created with the specified parameter list. memory_stack: An estimate of the maximum number of bytes of stack that will be used by an instance of the algorithm if created with the specified parameter string. memory_heap: An estimate of the maximum total number of bytes of heap that will be used by an instance of the algorithm if created with the specified parameter string. deterministic: True iff the algorithm can only yield a single output stream value for each possible input stream value (regardless of the size of input blocks). sizechange: True iff the transformation is capable of writing more or less bytes that it reads (at flush points). informationloss: True iff the transformation loses information in the sense that it is a many to one mapping. max_instances: The maximum number of instances of the algorithm that can exist at any one time. This value can be from zero to infinity. A value of zero indicates no limit to the number of instances. A value of one can be used in algorithms that use static data structures. block_sensitive: True iff the way in which the input stream is split up into input blocks will affect the output produced. For example, some data compression algorithms might compress better if handed the input in 16K input blocks instead of 1 byte input blocks. Others will produce the same output regardless of how the input is fed to them. 2.5 Data Formats ---------------- An earlier version of this standard required that each algorithm be given a unique identification code which could be used to tag data processed by the algorithm so that the same algorithm could be used to "unprocess" the data (e.g. compress and decompress). This turned out to be a mistake because (for example) it did not allow data by many data compression algorithms using the same coding format to be decompressed by the same decompressor. In this revised standard, the "to" and "from" aspects of stream processing algorithm pairs have been separated into distinct algorithms. This means that (for example) a data compression algorithm can exist independently from its corresponding decompression algorithm. To allow a degree of independence between algorithms, but still enable the user to match pairs, we focus not on the algorithms but on the format of the data streams that they manipulate. Each algorithm has two fields (in_format and out_format) of its information record that specify the input and output data formats. The format number 0 is reserved as a wildcard and represents any data (or in algorithm pairs "the original data"). Any other number indicates a particular format (such as a compressed data format). Thus for a data compression algorithm consisting of a compression algorithm and a decompression algorithm, with a compressed data format number of 12345, the compression algorithm would have (in=0,out=12345) and the decompressor would have (in=12345,out=0). The user program would see the 0 and realize that raw data can be fed in, and would also be able to match the two algorithms and use them together. By allowing the designer to specify an input and an output format, this system allows complicated networks of transformations to be constructed and used sensibly by the user program. It allows a single algorithm to "decode" the data of dozens of "coder" algorithms, each with its own performance characteristics. A data format number is a 31-bit number whose bits are numbered 31..0. Bit 31 is always zero (so as to allow signed 32-bit integers to be used). Bit 30 is 0 if the format is unparameterized and 1 if the format is parameterized. A parameterized format is one which may be impossible to process further (e.g. decode) without supplying extra information (e.g. most encrypted formats will be parameterized because the decrypting algorithm requires the key). The remaining bits (0-29) carry no semantic meaning but merely serve to distinguish from other formats. Designers of algorithm pairs not using any well-established data format can create a new data format number by tossing a fair coin 30 times, once for each bit of the id. Bit30 should be set depending on whether it is a parameterized format. The tossing of a coin is a requirement of this standard. Under no circumstances is a computer's random number generator to be used to generate an id. In the unlikely event that two algorithm designers have chosen the same number, one of the algorithm's numbers can be changed. 3. ADMINISTRATION ----------------- 3.1 Introduction ---------------- This section contains the administrative details of the standard. It outlines what is expected of those constructing conforming data compression algorithm implementations and records the details of the standard's history. 3.2 Revision History -------------------- This section gives the revision history of this standards document. YYYY-MM-DD Name Action ---------- ---- ------ 1991-06-19 RW Created the first version of the standard. 1991-07-12 RW Created the second version of the standard. 3.3 Child Standards ------------------- As stated earlier, a child standard should be created for each programming language in which implementations are to be written. The child standard need not duplicate the information in this standard. Rather, it should be like a subclass in an object-oriented programming language which adds detail and modifies some information in its superclass. In most cases, the child standard need consist only of a few type declarations and some example procedure declarations (and lots of comments!). 3.4 Benchmark Implementations in C ---------------------------------- Although algorithm designers are free to implement conforming procedures in whatever language they choose, it is hoped that, regardless of the language they choose, that they will also create a version in C conforming to the C child-standard. If everyone who creates a new algorithm does this there will soon be a big pool of highly interchangeable benchmark implementations that users can use to determine quickly the best algorithm/implementation for their needs. The C programming language was chosen as a benchmark language because it is so widely available. Eventually, if enough algorithms are standardized, it may be possible to construct an expert system to help users choose the best algorithms for their application. Users could submit a sample of data to the expert system and say how much memory and time they have during compression and decompression, whether they are prepared to apy patent royalties, and other such information. The expert system could then test various standardized algorithms on the submitted data and determine the best algorithm for the user's needs. 4. EXAMPLE APPLICATIONS ----------------------- The following code fragments indicate how the interface could be used in various stream processing applications. To simplify the presentation, error handling has been omitted and liberties have been taken with the programming notation which draws on self-explanatory features from a variety of programming languages. Comma (",") is used as a concatenation operator. 4.1 Compressing a Single Block of Memory ---------------------------------------- This piece of code uses a conformant implementation of a stream transformation algorithm to perform data compression on a single block of memory. The code might be found inside a random access file compression system which has to compress fixed-length blocks of data. Here the invocation group mechanism works to advantage. By feeding the algorithm equal-sized input and output blocks, and setting flush=true, the algorithm indicates in 'more_output' whether it has expanded the data. The user program uses this information to decide whether to use the output of the compression algorithm, simply to copy the data over. The first byte of the (final) output is used as a flag to indicate which choice was taken. const BLOCK_SIZE : integer = 16*1024; var instance : instance_t; var inblock : array[BLOCK_SIZE] of byte; var outblock : array[BLOCK_SIZE] of byte; var more_output : boolean; var status : status_t; var status_string : string; compress_cre("",status,instance); compress_pro(instance,inblock,flush=>true, BLOCK_SIZE,outblock,more_output,status,status_string); compress_des(instance,status) if more_output then outblock:=0,inblock; -- Copy over. else outblock:=1,outblock; -- Use compressed representation. endif 4.2 Process a Stream of Bytes ----------------------------- This example demonstrates how a simple stream of bytes can be fed through an algorithm. The input and output buffers INSIZE and OUTSIZE can be set to any positive values, including 1 (which might be used in a Unix getc/putc interface). const INSIZE : natural :=1; const OUTSIZE : natural :=1; var instance : instance_t; var inblock : array[INSIZE ] of byte; var outblock : array[OUTSIZE] of byte; var more_output : boolean; var status : status_t; var status_string : status_string_t; open(instream); create(outstream); alg_cre("",status,instance) while not eos(instream) do read(instream,inblock); repeat alg_pro(instance,inblock,flush=>false, OUTSIZ,outblock,more_output,status,status_string); write(outstream,outblock); until not more_output; endwhile inblock=empty; repeat alg_pro(instance,inblock,flush=>true, OUTSIZ,outblock,more_output,status,status_string); write(outstream,outblock); until not more_output; alg_des(instance,status) close(instream) close(outstream) 4.3 Compressing and Encrypting in a Modem ----------------------------------------- The following example demonstrates how the interface might be used inside a modem to perform data compression and encryption. Two stream algorithms are used in this example, one called 'compress' that performs data compression, and one called 'crypt' that performs ciphering. This example shows how the inverted procedure/instance/ADT form used in this interface standard allows what are essentially multiple processes to execute in interleaved fashion without the user program ever losing control of the data flows. More stream processors could be added into this example, by increasing the number of nested repeat loops. const INSIZE : natural := 2*1024; -- Any positive integer. const OUTSIZE : natural := 2*1024; -- Any positive integer. const TIMEOUT : natural := 500; -- 500 miliseconds. var compress : instance_t; var encrypt : instance_t; var inblock : array[INSIZE ] of byte; var outblock : array[OUTSIZE] of byte; var outblock2 : array[OUTSIZE] of byte; var more_output : boolean; var more_output_2 : boolean; var status : status_t; var status_string : status_string_t; var instream : stream_t; var outstream : stream_t; var time_snapshot : time_t; open(instream); create(outstream); compress_cre("",status,compress) crypt_cre("=(Key=SLOTH)",status,crypt) while not eos(instream) do -- Accumulate bytes until we can't hold any more or until timeout occurs. time_snapshot=time(); inblock:=empty; repeat read(instream,inbyte) OR timeout if time()>=time_snapshot+TIMEOUT; if not timeout then inblock:=inblock,inbyte; endif until size(inblock)=INSIZE or timeout; - Process buffered bytes. repeat compress_pro(compress,inblock,flush=>timeout, OUTSIZ,outblock,more_output,status,status_string); repeat crypt_pro(crypt,outblock,flush=>(timeout AND not more_output), OUTSIZ,outblock2,more_output_2,status,status_string); write(outstream,outblock2); until not more_output_2; until not more_output; endwhile compress_des(compress) crypt_des(crypt) close(instream) close(outstream) 5. GLOSSARY ----------- Algorithm - In this document short for "stream transformation algorithm" which is an abstract procedural specification for a transformation of a stream of bytes. It is important to distinguish between an algorithm and an implementation of an algorithm. Algorithm implementation - Code written in a particular programming language that actually implements an algorithm. Byte - In this document, a byte is defined to be eight bits. Memory is measured in bytes. Conforming Implementation - An implementation of a stream transformation algorithm that conforms to this standard. Instance - In this document, an instance refers to an instance of a stream transformation algorithm. The term is often used to refer to the data that is used to store the state of the instance between activations. (Note: The term "instance" is used in the book "Adaptive Data Compression" by Ross Williams to mean an instance of a symbol and the two meanings should not be confused. In this documen the term "byte" is used to refer to the entity referred to in the book as "instance"). Invocation Group - A consecutive sequence of invocations of the "_pro" procedure which collectively process a single block of input. Printable character - A printable character is defined to be a byte in the range [32,126]. In the ASCII character set, this corresponds to the set of printable characters [' '..'~']. Note: On non-ASCII machines this definition still stands --- the byte values are still to be used. To print such a character on a non-ASCII machine will require some kind of conversion. Procedure - A procedure written in a conventional imperative programming language. RW - Ross Williams (ross@spam.ua.oz.au). String - A string is a variable length array of printable characters. User program - Any computer program that uses a conformant implementation of a stream transformation algorithm. 6. REFERENCES ------------- [Marlin80] C.D.Marlin, "Coroutines: A Programming Methodology, a Language Design and an Implementation", Lecture Notes in Computer Science #95, Springer Verlag, 1980. ----