2013-12-13 23 views
9

Ho bisogno di hash su file piuttosto grandi che sono memorizzati su FS distribuiti. Sono in grado di elaborare parti di file con prestazioni molto migliori rispetto all'intero file, quindi mi piacerebbe poter calcolare l'hash su parti e quindi sommarlo.Come ottenere il calcolo distribuito CRC64 (usa la sua proprietà linearità)?

Sto pensando a CRC64 come algoritmo di hash ma non ho idea di come utilizzare la sua proprietà teorica "funzione lineare" in modo da poter sommare CRC su parti di file. Qualche raccomandazione? Qualcosa che ho perso qui?

Note aggiuntive per questo che sto guardando CRC64:

  • posso controllare blocchi di file, ma a causa della applicazione della natura hanno bisogno di avere diverse dimensioni (fino a 1 byte, senza eventuali blocchi fissi sono possibili).
  • Conosco l'implementazione CRC32 (zlib) che include il modo di sommare CRC su parti ma vorrei qualcosa di più ampio. 8 byte sono belli per me.
  • So che CRC è piuttosto veloce. Mi piacerebbe trarne profitto perché il file può essere davvero enorme (fino a pochi GB).

risposta

55

deciso che questo era generalmente utile sufficiente per scrivere e mettere a disposizione:

/* crc64.c -- compute CRC-64 
* Copyright (C) 2013 Mark Adler 
* Version 1.4 16 Dec 2013 Mark Adler 
*/ 

/* 
    This software is provided 'as-is', without any express or implied 
    warranty. In no event will the author be held liable for any damages 
    arising from the use of this software. 

    Permission is granted to anyone to use this software for any purpose, 
    including commercial applications, and to alter it and redistribute it 
    freely, subject to the following restrictions: 

    1. The origin of this software must not be misrepresented; you must not 
    claim that you wrote the original software. If you use this software 
    in a product, an acknowledgment in the product documentation would be 
    appreciated but is not required. 
    2. Altered source versions must be plainly marked as such, and must not be 
    misrepresented as being the original software. 
    3. This notice may not be removed or altered from any source distribution. 

    Mark Adler 
    [email protected] 
*/ 

/* Compute CRC-64 in the manner of xz, using the ECMA-182 polynomial, 
    bit-reversed, with one's complement pre and post processing. Provide a 
    means to combine separately computed CRC-64's. */ 

/* Version history: 
    1.0 13 Dec 2013 First version 
    1.1 13 Dec 2013 Fix comments in test code 
    1.2 14 Dec 2013 Determine endianess at run time 
    1.3 15 Dec 2013 Add eight-byte processing for big endian as well 
        Make use of the pthread library optional 
    1.4 16 Dec 2013 Make once variable volatile for limited thread protection 
*/ 

#include <stdio.h> 
#include <inttypes.h> 
#include <assert.h> 

/* The include of pthread.h below can be commented out in order to not use the 
    pthread library for table initialization. In that case, the initialization 
    will not be thread-safe. That's fine, so long as it can be assured that 
    there is only one thread using crc64(). */ 
#include <pthread.h>   /* link with -lpthread */ 

/* 64-bit CRC polynomial with these coefficients, but reversed: 
    64, 62, 57, 55, 54, 53, 52, 47, 46, 45, 40, 39, 38, 37, 35, 33, 32, 
    31, 29, 27, 24, 23, 22, 21, 19, 17, 13, 12, 10, 9, 7, 4, 1, 0 */ 
#define POLY UINT64_C(0xc96c5795d7870f42) 

/* Tables for CRC calculation -- filled in by initialization functions that are 
    called once. These could be replaced by constant tables generated in the 
    same way. There are two tables, one for each endianess. Since these are 
    static, i.e. local, one should be compiled out of existence if the compiler 
    can evaluate the endianess check in crc64() at compile time. */ 
static uint64_t crc64_little_table[8][256]; 
static uint64_t crc64_big_table[8][256]; 

/* Fill in the CRC-64 constants table. */ 
static void crc64_init(uint64_t table[][256]) 
{ 
    unsigned n, k; 
    uint64_t crc; 

    /* generate CRC-64's for all single byte sequences */ 
    for (n = 0; n < 256; n++) { 
     crc = n; 
     for (k = 0; k < 8; k++) 
      crc = crc & 1 ? POLY^(crc >> 1) : crc >> 1; 
     table[0][n] = crc; 
    } 

    /* generate CRC-64's for those followed by 1 to 7 zeros */ 
    for (n = 0; n < 256; n++) { 
     crc = table[0][n]; 
     for (k = 1; k < 8; k++) { 
      crc = table[0][crc & 0xff]^(crc >> 8); 
      table[k][n] = crc; 
     } 
    } 
} 

/* This function is called once to initialize the CRC-64 table for use on a 
    little-endian architecture. */ 
static void crc64_little_init(void) 
{ 
    crc64_init(crc64_little_table); 
} 

/* Reverse the bytes in a 64-bit word. */ 
static inline uint64_t rev8(uint64_t a) 
{ 
    uint64_t m; 

    m = UINT64_C(0xff00ff00ff00ff); 
    a = ((a >> 8) & m) | (a & m) << 8; 
    m = UINT64_C(0xffff0000ffff); 
    a = ((a >> 16) & m) | (a & m) << 16; 
    return a >> 32 | a << 32; 
} 

/* This function is called once to initialize the CRC-64 table for use on a 
    big-endian architecture. */ 
static void crc64_big_init(void) 
{ 
    unsigned k, n; 

    crc64_init(crc64_big_table); 
    for (k = 0; k < 8; k++) 
     for (n = 0; n < 256; n++) 
      crc64_big_table[k][n] = rev8(crc64_big_table[k][n]); 
} 

/* Run the init() function exactly once. If pthread.h is not included, then 
    this macro will use a simple static state variable for the purpose, which is 
    not thread-safe. The init function must be of the type void init(void). */ 
#ifdef PTHREAD_ONCE_INIT 
# define ONCE(init) \ 
    do { \ 
     static pthread_once_t once = PTHREAD_ONCE_INIT; \ 
     pthread_once(&once, init); \ 
    } while (0) 
#else 
# define ONCE(init) \ 
    do { \ 
     static volatile int once = 1; \ 
     if (once) { \ 
      if (once++ == 1) { \ 
       init(); \ 
       once = 0; \ 
      } \ 
      else \ 
       while (once) \ 
        ; \ 
     } \ 
    } while (0) 
#endif 

/* Calculate a CRC-64 eight bytes at a time on a little-endian architecture. */ 
static inline uint64_t crc64_little(uint64_t crc, void *buf, size_t len) 
{ 
    unsigned char *next = buf; 

    ONCE(crc64_little_init); 
    crc = ~crc; 
    while (len && ((uintptr_t)next & 7) != 0) { 
     crc = crc64_little_table[0][(crc^*next++) & 0xff]^(crc >> 8); 
     len--; 
    } 
    while (len >= 8) { 
     crc ^= *(uint64_t *)next; 
     crc = crc64_little_table[7][crc & 0xff]^
       crc64_little_table[6][(crc >> 8) & 0xff]^
       crc64_little_table[5][(crc >> 16) & 0xff]^
       crc64_little_table[4][(crc >> 24) & 0xff]^
       crc64_little_table[3][(crc >> 32) & 0xff]^
       crc64_little_table[2][(crc >> 40) & 0xff]^
       crc64_little_table[1][(crc >> 48) & 0xff]^
       crc64_little_table[0][crc >> 56]; 
     next += 8; 
     len -= 8; 
    } 
    while (len) { 
     crc = crc64_little_table[0][(crc^*next++) & 0xff]^(crc >> 8); 
     len--; 
    } 
    return ~crc; 
} 

/* Calculate a CRC-64 eight bytes at a time on a big-endian architecture. */ 
static inline uint64_t crc64_big(uint64_t crc, void *buf, size_t len) 
{ 
    unsigned char *next = buf; 

    ONCE(crc64_big_init); 
    crc = ~rev8(crc); 
    while (len && ((uintptr_t)next & 7) != 0) { 
     crc = crc64_big_table[0][(crc >> 56)^*next++]^(crc << 8); 
     len--; 
    } 
    while (len >= 8) { 
     crc ^= *(uint64_t *)next; 
     crc = crc64_big_table[0][crc & 0xff]^
       crc64_big_table[1][(crc >> 8) & 0xff]^
       crc64_big_table[2][(crc >> 16) & 0xff]^
       crc64_big_table[3][(crc >> 24) & 0xff]^
       crc64_big_table[4][(crc >> 32) & 0xff]^
       crc64_big_table[5][(crc >> 40) & 0xff]^
       crc64_big_table[6][(crc >> 48) & 0xff]^
       crc64_big_table[7][crc >> 56]; 
     next += 8; 
     len -= 8; 
    } 
    while (len) { 
     crc = crc64_big_table[0][(crc >> 56)^*next++]^(crc << 8); 
     len--; 
    } 
    return ~rev8(crc); 
} 

/* Return the CRC-64 of buf[0..len-1] with initial crc, processing eight bytes 
    at a time. This selects one of two routines depending on the endianess of 
    the architecture. A good optimizing compiler will determine the endianess 
    at compile time if it can, and get rid of the unused code and table. If the 
    endianess can be changed at run time, then this code will handle that as 
    well, initializing and using two tables, if called upon to do so. */ 
uint64_t crc64(uint64_t crc, void *buf, size_t len) 
{ 
    uint64_t n = 1; 

    return *(char *)&n ? crc64_little(crc, buf, len) : 
         crc64_big(crc, buf, len); 
} 

#define GF2_DIM 64  /* dimension of GF(2) vectors (length of CRC) */ 

static uint64_t gf2_matrix_times(uint64_t *mat, uint64_t vec) 
{ 
    uint64_t sum; 

    sum = 0; 
    while (vec) { 
     if (vec & 1) 
      sum ^= *mat; 
     vec >>= 1; 
     mat++; 
    } 
    return sum; 
} 

static void gf2_matrix_square(uint64_t *square, uint64_t *mat) 
{ 
    unsigned n; 

    for (n = 0; n < GF2_DIM; n++) 
     square[n] = gf2_matrix_times(mat, mat[n]); 
} 

/* Return the CRC-64 of two sequential blocks, where crc1 is the CRC-64 of the 
    first block, crc2 is the CRC-64 of the second block, and len2 is the length 
    of the second block. */ 
uint64_t crc64_combine(uint64_t crc1, uint64_t crc2, uintmax_t len2) 
{ 
    unsigned n; 
    uint64_t row; 
    uint64_t even[GF2_DIM];  /* even-power-of-two zeros operator */ 
    uint64_t odd[GF2_DIM];  /* odd-power-of-two zeros operator */ 

    /* degenerate case */ 
    if (len2 == 0) 
     return crc1; 

    /* put operator for one zero bit in odd */ 
    odd[0] = POLY;    /* CRC-64 polynomial */ 
    row = 1; 
    for (n = 1; n < GF2_DIM; n++) { 
     odd[n] = row; 
     row <<= 1; 
    } 

    /* put operator for two zero bits in even */ 
    gf2_matrix_square(even, odd); 

    /* put operator for four zero bits in odd */ 
    gf2_matrix_square(odd, even); 

    /* apply len2 zeros to crc1 (first square will put the operator for one 
     zero byte, eight zero bits, in even) */ 
    do { 
     /* apply zeros operator for this bit of len2 */ 
     gf2_matrix_square(even, odd); 
     if (len2 & 1) 
      crc1 = gf2_matrix_times(even, crc1); 
     len2 >>= 1; 

     /* if no more bits set, then done */ 
     if (len2 == 0) 
      break; 

     /* another iteration of the loop with odd and even swapped */ 
     gf2_matrix_square(odd, even); 
     if (len2 & 1) 
      crc1 = gf2_matrix_times(odd, crc1); 
     len2 >>= 1; 

     /* if no more bits set, then done */ 
    } while (len2 != 0); 

    /* return combined crc */ 
    crc1 ^= crc2; 
    return crc1; 
} 

/* Test crc64() on vector[0..len-1] which should have CRC-64 crc. Also test 
    crc64_combine() on vector[] split in two. */ 
static void crc64_test(void *vector, size_t len, uint64_t crc) 
{ 
    uint64_t crc1, crc2; 

    /* test crc64() */ 
    crc1 = crc64(0, vector, len); 
    if (crc1^crc) 
     printf("mismatch: %" PRIx64 ", should be %" PRIx64 "\n", crc1, crc); 

    /* test crc64_combine() */ 
    crc1 = crc64(0, vector, (len + 1) >> 1); 
    crc2 = crc64(0, vector + ((len + 1) >> 1), len >> 1); 
    crc1 = crc64_combine(crc1, crc2, len >> 1); 
    if (crc1^crc) 
     printf("mismatch: %" PRIx64 ", should be %" PRIx64 "\n", crc1, crc); 
} 

/* Test vectors. */ 
#define TEST1 "123456789" 
#define TESTLEN1 9 
#define TESTCRC1 UINT64_C(0x995dc9bbdf1939fa) 
#define TEST2 "This is a test of the emergency broadcast system." 
#define TESTLEN2 49 
#define TESTCRC2 UINT64_C(0x27db187fc15bbc72) 

int main(void) 
{ 
    crc64_test(TEST1, TESTLEN1, TESTCRC1); 
    crc64_test(TEST2, TESTLEN2, TESTCRC2); 
    return 0; 
} 
+0

Perché il polinomio invertito? È a causa del modo in cui vengono generate le tabelle? O è per nessuna ragione specifica? – Mecki

+0

È possibile definire un CRC specifico da invertire o meno. Questo è definito per essere invertito, quindi è necessario generarlo in questo modo. I CRC storicamente invertiti sono più comuni, poiché la loro implementazione bit-bit e le implementazioni in termini di byte nel software sono un po 'più semplici e quindi un po' più veloci. –

+0

È possibile creare una funzione CRC per supportare qualsiasi polinomio CRC, indipendentemente dalla lunghezza? Non sono sicuro che sia possibile, e immagino di non capire come funziona abbastanza bene. – MarcusJ

3

OK, il mio contributo a questo. Portato su Java.

  • Non riesco a vincere da blocchi di 8 byte senza fare cose non sicure, quindi ho rimosso il calcolo dei blocchi.
  • Rimango con polynom ECMA - ISO uno sembra troppo trasparente come per me.
  • Ovviamente nella versione finale sposterò il codice di test sotto JUnit.

ecco il codice:

package com.test; 

import java.util.Arrays; 

/** 
* CRC-64 implementation with ability to combine checksums calculated over different blocks of data. 
**/ 
public class CRC64 { 

    private final static long POLY = (long) 0xc96c5795d7870f42L; // ECMA-182 

    /* CRC64 calculation table. */ 
    private final static long[] table; 

    /* Current CRC value. */ 
    private long value; 

    static { 
     table = new long[256]; 

     for (int n = 0; n < 256; n++) { 
      long crc = n; 
      for (int k = 0; k < 8; k++) { 
       if ((crc & 1) == 1) { 
        crc = (crc >>> 1)^POLY; 
       } else { 
        crc = (crc >>> 1); 
       } 
      } 
      table[n] = crc; 
     } 
    } 

    public CRC64() { 
     this.value = 0; 
    } 

    public CRC64(long value) { 
     this.value = value; 
    } 

    public CRC64(byte [] b, int len) { 
     this.value = 0; 
     update(b, len); 
    } 

    /** 
    * Construct new CRC64 instance from byte array. 
    **/ 
    public static CRC64 fromBytes(byte [] b) { 
     long l = 0; 
     for (int i = 0; i < 4; i++) { 
      l <<= 8; 
      l ^= (long) b[i] & 0xFF; 
     } 
     return new CRC64(l); 
    } 

    /** 
    * Get 8 byte representation of current CRC64 value. 
    **/ 
    public byte[] getBytes() { 
     byte [] b = new byte[8]; 
     for (int i = 0; i < 8; i++) { 
      b[7 - i] = (byte) (this.value >>> (i * 8)); 
     } 
     return b; 
    } 

    /** 
    * Get long representation of current CRC64 value. 
    **/ 
    public long getValue() { 
     return this.value; 
    } 

    /** 
    * Update CRC64 with new byte block. 
    **/ 
    public void update(byte [] b, int len) { 

     int idx = 0; 
     this.value = ~this.value; 
     while (len > 0) { 
      this.value = table[((int) (this.value^b[idx])) & 0xff]^(this.value >>> 8); 
      idx++; 
      len--; 
     } 
     this.value = ~this.value; 
    } 

    private static final int GF2_DIM = 64; /* dimension of GF(2) vectors (length of CRC) */ 

    private static long gf2MatrixTimes(long [] mat, long vec) 
    { 
     long sum = 0; 
     int idx = 0; 
     while (vec != 0) { 
      if ((vec & 1) == 1) 
       sum ^= mat[idx]; 
      vec >>>= 1; 
      idx++; 
     } 
     return sum; 
    } 

    private static void gf2MatrixSquare(long [] square, long [] mat) 
    { 
     for (int n = 0; n < GF2_DIM; n++) 
      square[n] = gf2MatrixTimes(mat, mat[n]); 
    } 

    /* 
    * Return the CRC-64 of two sequential blocks, where summ1 is the CRC-64 of the 
    * first block, summ2 is the CRC-64 of the second block, and len2 is the length 
    * of the second block. 
    */ 
    static public CRC64 combine(CRC64 summ1, CRC64 summ2, long len2) 
    { 
     // degenerate case. 
     if (len2 == 0) 
      return new CRC64(summ1.getValue()); 

     int n; 
     long row; 
     long [] even = new long[GF2_DIM]; // even-power-of-two zeros operator 
     long [] odd = new long[GF2_DIM]; // odd-power-of-two zeros operator 

     // put operator for one zero bit in odd 
     odd[0] = POLY;  // CRC-64 polynomial 

     row = 1; 
     for (n = 1; n < GF2_DIM; n++) { 
      odd[n] = row; 
      row <<= 1; 
     } 

     // put operator for two zero bits in even 
     gf2MatrixSquare(even, odd); 

     // put operator for four zero bits in odd 
     gf2MatrixSquare(odd, even); 

     // apply len2 zeros to crc1 (first square will put the operator for one 
     // zero byte, eight zero bits, in even) 
     long crc1 = summ1.getValue(); 
     long crc2 = summ2.getValue(); 
     do { 
      // apply zeros operator for this bit of len2 
      gf2MatrixSquare(even, odd); 
      if ((len2 & 1) == 1) 
       crc1 = gf2MatrixTimes(even, crc1); 
      len2 >>>= 1; 

      // if no more bits set, then done 
      if (len2 == 0) 
       break; 

      // another iteration of the loop with odd and even swapped 
      gf2MatrixSquare(odd, even); 
      if ((len2 & 1) == 1) 
       crc1 = gf2MatrixTimes(odd, crc1); 
      len2 >>>= 1; 

      // if no more bits set, then done 
     } while (len2 != 0); 

     // return combined crc. 
     crc1 ^= crc2; 
     return new CRC64(crc1); 
    } 

    private static void test(byte [] b, int len, long crcValue) throws Exception { 

     /* Test CRC64 default calculation. */ 
     CRC64 crc = new CRC64(b, len); 
     if (crc.getValue() != crcValue) { 
      throw new Exception("mismatch: " + String.format("%016x", crc.getValue()) 
       + " should be " + String.format("%016x", crcValue)); 
     } 

     /* test combine() */ 
     CRC64 crc1 = new CRC64(b, (len + 1) >>> 1); 
     CRC64 crc2 = new CRC64(Arrays.copyOfRange(b, (len + 1) >>> 1, b.length), len >>> 1); 
     crc = CRC64.combine(crc1, crc2, len >>> 1); 

     if (crc.getValue() != crcValue) { 
      throw new Exception("mismatch: " + String.format("%016x", crc.getValue()) 
       + " should be " + String.format("%016x", crcValue)); 
     } 
    } 

    public static void main(String [] args) throws Exception { 

     final byte[] TEST1 = "123456789".getBytes(); 
     final int TESTLEN1 = 9; 
     final long TESTCRC1 = 0x995dc9bbdf1939faL; // ECMA. 
     test(TEST1, TESTLEN1, TESTCRC1); 

     final byte[] TEST2 = "This is a test of the emergency broadcast system.".getBytes(); 
     final int TESTLEN2 = 49; 
     final long TESTCRC2 = 0x27db187fc15bbc72L; // ECMA. 
     test(TEST2, TESTLEN2, TESTCRC2); 

     final byte[] TEST3 = "IHATEMATH".getBytes(); 
     final int TESTLEN3 = 9; 
     final long TESTCRC3 = 0x3920e0f66b6ee0c8L; // ECMA. 
     test(TEST3, TESTLEN3, TESTCRC3); 
    } 
} 
+1

Mantieni inversioni pre e post. Senza di essi, una stringa di zeri ha sempre il CRC di zero, indipendentemente dalla lunghezza. Con loro, il CRC degli zeri non è zero e dipende da quanti zeri. –

+0

E dovresti amare la matematica. –

+0

Grazie, vengo alla stessa conclusione dopo ulteriori indagini. –