FAST_COPY.68000 =============== Author : Ross N. Williams. Release : 16-Jun-1991. Modified : 25-Jun-1991: Included the header file fast_copy.h. 1. This file contains three sub files: a. A fast memory block copy routine written in 68000 machine code. b. A header file for a. c. A test package for a that is written in C. 2. The block copy routine has been written for speed. The routine examines the relative alignment of the source and destination blocks (byte, word, or longword) and uses the widest of three move instructions (move.b, move.w, or move.l) possible on a 68000. The chosen instruction is then used in an unrolled (by 16) move loop. There is a slightly faster technique (about 13% faster) that uses the MOVEM instruction, but I found out about it after I wrote this code. 3. WARNING: On later versions of the 680x0 family that do lots of sophisticated speed tricks (such as prefetch and caching) the unrolling used in this package may actually slow the code down. I don't know. The simplest way to find out is to try it. Perhaps someone could release a version that takes the processor type into account as well as the block alignment. 4. The code was written under Lightspeed C on a Macintosh. Minor changes may be required to port it to another environment (e.g. expand macros). In particular, I have omitted some header files for the test module (for assertions and portable type definitions and so on) that are probably more easily rewritten for the target environment than copied. The code was never designed for portability. Nevertheless, it should be fairly easy to port. 5. This code is public domain. 6. If you use this code in anger (e.g. in a product) drop me a note at ross@spam.ua.oz.au and I will put you on a mailing list which will be invoked if anyone finds a bug in this code. 7. The internet newsgroup comp.compression might also carry information on this algorithm from time to time. 8. This code was developed as part of the fast data compression algorithm LZRW1. When the algorithm finds that it has expanded the data instead of compressing it, it starts over using a copy operation instead of a compression one. /******************************************************************************/ /* fast_copy.h */ /******************************************************************************/ void fast_copy(void *src_adr, void *dst_adr, unsigned long src_len); /* This function copies a block of memory very quickly. */ /* The exact speed depends on the relative alignment of the blocks of memory. */ /* PRE : 0<=src_len<=(2^32)-1 . */ /* PRE : Source and destination blocks must not overlap. */ /* POST : MEM[dst_adr,dst_adr+src_len-1]=MEM[src_adr,src_adr+src_len-1]. */ /* POST : MEM[dst_adr,dst_adr+src_len-1] is the only memory changed. */ /******************************************************************************/ /* End of fast_copy.h */ /******************************************************************************/ /******************************************************************************/ /* */ /* FAST_COPY.C */ /* */ /******************************************************************************/ /* Author : Ross Williams. */ /* Date : 13-Apr-1990. */ /* */ /* This module contains a function called fast_copy that copies a block of */ /* memory extremely quickly using unrolled loops. The actual speed of the */ /* copy depends on the relative alignment of the source and destination */ /* blocks of memory. If the source and destination blocks are relatively */ /* longword aligned, the copy will go faster than if they are merely */ /* relatively byte aligned. */ /******************************************************************************/ #include "fast_copy.h" /* Just the function prototype. */ void fast_copy(src_adr,dst_adr,src_len) /* This function copies a block of memory very quickly. */ /* The exact speed depends on the relative alignment of the blocks of memory. */ /* PRE : 0<=src_len<=(2^32)-1 . */ /* PRE : Source and destination blocks must not overlap. */ /* POST : MEM[dst_adr,dst_adr+src_len-1]=MEM[src_adr,src_adr+src_len-1]. */ /* POST : MEM[dst_adr,dst_adr+src_len-1] is the only memory changed. */ void *src_adr; void *dst_adr; unsigned long src_len; { asm 68000 { ;Outline of Algorithm ;-------------------- ;1. Copy from 0 to 3 bytes to make A_SRC longword aligned. ;2. Choose byte, word, or longword move depending on alignment of A_DST. ;3. Execute a high speed unrolled loop to move most of the bytes. ;4. Finish off the remainder of the bytes one at a time. ;Register Map ;------------ #define D_LEN d0 ;Number of bytes left to copy. #define D_UNROLL d1 ;Counts unrolled loop body executions. #define D_T1 d2 ;Temporary register. #define A_SRC a0 ;Points to the next source byte. #define A_DST a1 ;Points to the next destination byte. ;Note: Lightspeed C doesn't mind us using a0-a1 and d0-d2. ; So there is no need to save any registers. ;Load key registers with parameters. move.l src_adr, A_SRC move.l dst_adr, A_DST move.l src_len, D_LEN ;Skip to the finishing off loop if there is not even enough ;bytes to allow a longword alignment (i.e. jump if D_LEN<3). cmp.l #3, D_LEN blo @finish_off @longword_align: ;Copy bytes (up to 3) until A_SRC is longword aligned. move.l A_SRC, D_T1 and.l #3, D_T1 beq @choose_copy_size move.b (A_SRC)+,(A_DST)+ sub.l #1, D_LEN bra @longword_align @choose_copy_size: ;Assert: We have to perform a copy (A_SRC,A_DST,D_LEN). ;Assert: A_SRC is longword aligned. ;Choose the longest unit move loop based on the alignment of A_DST. move.l A_DST, D_T1 and.l #3, D_T1 beq @copy_by_longword and.l #1, D_T1 beq @copy_by_word bra @copy_by_byte #define COPY_UNROLLED(SIZE,SHIFT,LABEL) \ ;Assert: We have to perform a copy (A_SRC,A_DST,D_LEN). \ ;Assert: A_SRC and A_DST are SIZE aligned. \ ;Perform D_LEN/(2^SHIFT) (2^SHIFT)-byte copy operations. \ ;Each (2^SHIFT) bytes is moved as 16 SIZE. \ ;D_UNROLL counts down the number of blocks. \ move.l D_LEN ,D_UNROLL \ lsr.l SHIFT ,D_UNROLL \ beq @finish_off \ ;Assert: D_UNROLL>0 \ move.l D_UNROLL,D_T1 \ lsl.l SHIFT ,D_T1 \ sub.l D_T1 ,D_LEN \ LABEL: \ move.SIZE (A_SRC)+,(A_DST)+ ; 1 \ move.SIZE (A_SRC)+,(A_DST)+ ; 2 \ move.SIZE (A_SRC)+,(A_DST)+ ; 3 \ move.SIZE (A_SRC)+,(A_DST)+ ; 4 \ move.SIZE (A_SRC)+,(A_DST)+ ; 5 \ move.SIZE (A_SRC)+,(A_DST)+ ; 6 \ move.SIZE (A_SRC)+,(A_DST)+ ; 7 \ move.SIZE (A_SRC)+,(A_DST)+ ; 8 \ move.SIZE (A_SRC)+,(A_DST)+ ; 9 \ move.SIZE (A_SRC)+,(A_DST)+ ;10 \ move.SIZE (A_SRC)+,(A_DST)+ ;11 \ move.SIZE (A_SRC)+,(A_DST)+ ;12 \ move.SIZE (A_SRC)+,(A_DST)+ ;13 \ move.SIZE (A_SRC)+,(A_DST)+ ;14 \ move.SIZE (A_SRC)+,(A_DST)+ ;15 \ move.SIZE (A_SRC)+,(A_DST)+ ;16 \ sub.l #1, D_UNROLL \ bne LABEL \ bra @finish_off ;Note: In an earlier version of this module, the faster DBRA instruction ; was used. However, it imposed a maximum of (2^15-1)*16 on src_len ; so I changed it to the sub/bne combination. I suppose I could have ; nested DBRAs but this is simpler and not much slower. @copy_by_longword: COPY_UNROLLED(l,#6,@copy_16_longwords) @copy_by_word: COPY_UNROLLED(w,#5,@copy_16_words) @copy_by_byte: COPY_UNROLLED(b,#4,@copy_16_bytes) @finish_off: ;Assert: We have to perform a copy (A_SRC,A_DST,D_LEN). ;Assert: D_LEN<64. ;The following rolled, single-byte copy loop finishes off the copy. tst.l D_LEN beq @finished_copy sub.l #1, D_LEN @copy_1_byte: move.b (A_SRC)+,(A_DST)+ dbra D_LEN ,@copy_1_byte @finished_copy: } /* End of assembly language code. */ } /* End fast_copy */ /******************************************************************************/ /* End of FAST_COPY.C */ /******************************************************************************/ /******************************************************************************/ /* */ /* FAST_COPY_TEST.C */ /* */ /******************************************************************************/ /* Author : Ross Williams. */ /* Date : 13-April-1990. */ /* */ /* This file contains a C program (a main() program) whose sole purpose is to */ /* test the function "fast_copy". The program was developed in the THINK C */ /* environment. */ /* */ /* The technique used is to allocate two slabs of memory. Then a number of */ /* copy operations are performed copying a portion of the first block to */ /* the second. After each copy the copy is checked. In addition to checking */ /* the actual data copied, test blocks are placed at either end of the source */ /* and destination blocks and tested for corruption. Tests are made using */ /* blocks of a variety of alignments and lengths. */ /* */ /******************************************************************************/ /* INCLUDE FILES */ /* ------------- */ #include "port.h" /* Portable definitions. */ #include /* IO stuff. */ #include /* Standard library headers. */ #include /* Contains memory routines. */ #include "assert.h" /* Contains error handling. */ #include "my_malloc.h" /* Contains custom memory routines. */ #include "fast_copy.h" /* Contains the routine under test. */ #define CHECK_LENGTH 300 /* Length of checking blocks. */ #define MAX_LENGTH 80000 /* Maximum length of blocks. */ #define BUFFER_SIZE (MAX_LENGTH+2*CHECK_LENGTH+3) /* Total buffer size. */ /* The 3 allows for alignment. */ UBYTE *p_src_buffer; /* Holds src block (and checkers). */ UBYTE *p_dst_buffer; /* Holds dst block (and checkers). */ UBYTE *p_src_chk_buffer; /* Copy of src block (and checkers). */ UBYTE *p_dst_chk1_buffer; /* Copy of left dest block checker. */ UBYTE *p_dst_chk2_buffer; /* Copy of right dest block checker. */ /******************************************************************************/ void allocate_buffers(void); void perform_test(ULONG length); void perform_aligned_test(ULONG length,UWORD src_align,UWORD dst_align); UBYTE random_byte(void); void fill_random(UBYTE *adr,UWORD len); /******************************************************************************/ main() /* Main program of copy routine test program. */ { ULONG length; UWORD i; printf("Welcome to the fast_copy program Tester\n"); printf("=======================================\n"); printf("Each of the following length tests is tested with\n"); printf("all 16 combinations of alignments.\n"); printf("\n"); allocate_buffers(); /* Test the routine on all small lengths. */ printf("Testing Lengths 0 to 300\n"); printf("------------------------\n"); for (length=0;length<=300;length++) perform_test(length); printf("\n"); /* Test the routine on and around powers of two. */ printf("Testing Powers of Two\n"); printf("---------------------\n"); for (length=256;length=255,"random_byte: RND_MAX is too small."); } result=(rand() & 0xFF); ass((0 <= result) && (result <= 255),"random_byte: Result out of range."); /* printf("Random byte = %u.\n",result); */ return result; } /******************************************************************************/ void fill_random(adr,len) /* Fills the memory at [adr..adr+len-1] with pseudo random bytes. */ UBYTE *adr; UWORD len; { UWORD i; UWORD seed; seed=random_byte(); /* Proper random numbers take too long to generate and we don't really need */ /* them here so we fudge it. */ for (i=0;i