实现递归哈希算法
本文关键字:算法 哈希 递归 实现 | 更新日期: 2023-09-27 18:16:55
假设文件A的字节数为:
2
5
8
0
33
90
1
3
200
201
23
12
55
,我有一个简单的哈希算法,我存储最后三个连续字节的和,所以:
2
5
8 - = 8+5+2 = 15
0
33
90 - = 90+33+0 = 123
1
3
200 - = 204
201
23
12 - = 236
所以我可以将文件A表示为15, 123, 204, 236
假设我将该文件复制到新电脑B上并做了一些小修改文件B的字节数为:
255
2
5
8
0
33
90
1
3
200
201
23
12
255
255
"注意区别是文件开头多一个字节,末尾多两个字节,但其余部分非常相似"
,所以我可以执行相同的算法来确定文件的某些部分是否相同。记住,文件A是由哈希码15, 123, 204, 236
表示的,让我们看看文件B是否给了我一些哈希码!
那么在文件B中,我必须每3个连续字节执行一次
int[] sums; // array where we will hold the sum of the last bytes
255 sums[0] = 255
2 sums[1] = 2+ sums[0] = 257
5 sums[2] = 5+ sums[1] = 262
8 sums[3] = 8+ sums[2] = 270 hash = sums[3]-sums[0] = 15 --> MATHES FILE A!
0 sums[4] = 0+ sums[3] = 270 hash = sums[4]-sums[1] = 13
33 sums[5] = 33+ sums[4] = 303 hash = sums[5]-sums[2] = 41
90 sums[6] = 90+ sums[5] = 393 hash = sums[6]-sums[3] = 123 --> MATHES FILE A!
1 sums[7] = 1+ sums[6] = 394 hash = sums[7]-sums[4] = 124
3 sums[8] = 3+ sums[7] = 397 hash = sums[8]-sums[5] = 94
200 sums[9] = 200+ sums[8] = 597 hash = sums[9]-sums[6] = 204 --> MATHES FILE A!
201 sums[10] = 201+ sums[9] = 798 hash = sums[10]-sums[7] = 404
23 sums[11] = 23+ sums[10] = 821 hash = sums[11]-sums[8] = 424
12 sums[12] = 12+ sums[11] = 833 hash = sums[12]-sums[9] = 236 --> MATHES FILE A!
55 sums[13] = 55+ sums[12] = 888 hash = sums[13]-sums[10] = 90
255 sums[14] = 255+ sums[13] = 1143 hash = sums[14]-sums[11] = 322
255 sums[15] = 255+ sums[14] = 1398 hash = sums[15]-sums[12] = 565
所以从表中我知道文件B包含文件A加上额外的字节,因为哈希码匹配。
我展示这个算法的原因是因为它是n阶的,换句话说,我能够计算出最后3个连续字节的哈希值,而不必遍历它们!
如果我使用更复杂的算法,比如对最后3个字节进行md5,那么它将是n^3阶的,这是因为当我遍历文件B时,我必须有一个内部for循环来计算最后3个字节的哈希值。
我的问题是:
我如何改进算法使它保持n阶,也就是只计算一次哈希。如果我使用现有的散列算法,如md5,我将不得不在算法中放置一个内循环,这将显着增加算法的顺序。
注意,可以用乘法而不是加法来做同样的事情。但是计数器的增长非常快。也许我可以把乘法、加法和减法结合起来…
编辑
如果我搜索:
图中的递归哈希函数
大量的信息出现,我认为这些算法很难理解…我必须为一个项目实现这个算法,这就是为什么我在重新发明轮子…我知道现在有很多算法。
我正在考虑的另一个解决方案是执行相同的算法加上另一个强大的算法。所以在文件A上,我将每3个字节执行相同的算法加上每3个字节的md5。对于第二个文件,如果第一个算法成立,我将执行第二个算法....
我越想你所说的"递归"是什么意思,我就越怀疑我之前提出的解决方案是不是你应该实现的有用的东西。
你可能想实现一个哈希树算法,这是一个递归操作。
要做到这一点,您需要散列列表,将列表分成两部分,然后递归到这两个子列表。当列表的大小为1或最小期望的哈希大小时终止,因为每一级递归都会使总哈希输出的大小增加一倍。
伪代码:
create-hash-tree(input list, minimum size: default = 1):
initialize the output list
hash-sublist(input list, output list, minimum size)
return output list
hash-sublist(input list, output list, minimum size):
add sum-based-hash(list) result to output list // easily swap hash styles here
if size(input list) > minimum size:
split the list into two halves
hash-sublist(first half of list, output list, minimum size)
hash-sublist(second half of list, output list, minimum size)
sum-based-hash(list):
initialize the running total to 0
for each item in the list:
add the current item to the running total
return the running total
我认为整个算法的运行时间为O(hash(m)); m = n * (log(n) + 1)
, hash(m)
通常为线性时间。
存储空间类似于O(hash * s); s = 2n - 1
,散列通常是常量大小。
请注意,对于c#,我将输出列表设置为List<HashType>
,但我将输入列表设置为IEnumerable<ItemType>
以节省存储空间,并使用Linq快速"拆分"列表,而无需分配两个新的子列表。
我想你可以得到这是O(n + m)
的执行时间;其中n
是列表的大小,m
是运行计数的大小,n < m
(否则所有总和相等)。
内存消耗将是堆栈大小加上临时存储的m
大小。
为此,使用双端队列和运行总数。将新遇到的值推入列表,同时添加到运行总数中,当队列达到m
大小时,弹出列表并从运行总数中减去。
下面是一些伪代码:
initialize the running total to 0
for each item in the list:
add the current item to the running total
push the current value onto the end of the dequeue
if dequeue.length > m:
pop off the front of the dequeue
subtract the popped value from the running total
assign the running total to the current sum slot in the list
reset the index to the beginning of the list
while the dequeue isn't empty:
add the item in the list at the current index to the running total
pop off the front of the dequeue
subtract the popped value from the running total
assign the running total to the current sum slot in the list
increment the index
这不是递归的,而是迭代的
这个算法的运行看起来像这样(对于m = 3
):
value sum slot overwritten sum slot
2 2 92
5 7 74
8 15 70
0 15 15
33 46
90 131
1 124
3 127
200 294
201 405
23 427
12 436
55 291
与索引你可以删除队列和重新分配任何插槽,取最后一个m
值的和开始,并使用索引的偏移量,而不是突然退出队列,例如array[i - m]
。
这不会减少你的执行时间,因为你仍然有两个循环,一个建立运行计数,另一个填充所有的值。但是它会减少你的内存使用堆栈空间(有效O(1)
)。
下面是一些伪代码:
initialize the running total to 0
for the last m items in the list:
add those items to the running total
for each item in the list:
add the current item to the running total
subtract the value of the item m slots earlier from the running total
assign the running total to the current sum slot in the list
m slots earlier
是棘手的部分。你可以将它分成两个循环:
- 从列表末尾开始索引,减去m,加上i
- 从i - m开始索引
或者当i - m < 0
:
int valueToSutract = array[(i - m) % n];
http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm使用一个可更新的哈希函数,它称之为http://en.wikipedia.org/wiki/Rolling_hash。这将比MD5/SHA更容易计算,并且可能不会逊色。
你可以证明一些关于它的事情:它是一个d次的多项式,在一个选定的常数a中。假设有人提供了两段文本,你随机选择a。碰撞的概率是多少?好吧,如果哈希值是一样的,减去它们就得到一个以a为根的多项式。由于一个非零多项式最多有d个根,并且a是随机选择的,所以概率最多为模/d,对于大模来说,这个概率非常小。
当然MD5/SHA是安全的,但请参阅http://cr.yp.to/mac/poly1305-20050329.pdf了解安全变体。
这就是我目前得到的。我只是错过了不应该花费时间的步骤,如比较哈希数组和打开文件进行读取。
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace RecursiveHashing
{
static class Utilities
{
// used for circular arrays. If my circular array is of size 5 and it's
// current position is 2 if I shift 3 units to the left I shouls be in index
// 4 of the array.
public static int Shift(this int number, int shift, int divisor)
{
var tempa = (number + shift) % divisor;
if (tempa < 0)
tempa = divisor + tempa;
return tempa;
}
}
class Program
{
const int CHUNCK_SIZE = 4; // split the files in chuncks of 4 bytes
/*
* formula that I will use to compute hash
*
* formula = sum(chunck) * (a[c]+1)*(a[c-1]+1)*(a[c-2]+1)*(-1^a[c])
*
* where:
* sum(chunk) = sum of current chunck
* a[c] = current byte
* a[c-1] = last byte
* a[c-2] = last last byte
* -1^a[c] = eather -1 or +1
*
* this formula is efficient because I can get the sum of any current index by keeping trak of the overal sum
* thus this algorithm should be of order n
*/
static void Main(string[] args)
{
Part1(); // Missing implementation to open file for reading
Part2();
}
// fist part compute hashes on first file
static void Part1()
{
// pertend file b reads those bytes
byte[] FileB = new byte[]{2,3,5,8,2,0,1,0,0,0,1,2,4,5,6,7,8,2,3,4,5,6,7,8,11,};
// create an array where to store the chashes
// index 0 will use a fast hash algorithm. index 1 will use a more secure hashing algorithm
Int64[,] hashes = new Int64[(FileB.Length / CHUNCK_SIZE) + 10, 2];
// used to track on what index of the file we are at
int counter = 0;
byte[] current = new byte[CHUNCK_SIZE + 1]; // circual array needed to remember the last few bytes
UInt64[] sum = new UInt64[CHUNCK_SIZE + 1]; // circual array needed to remember the last sums
int index = 0; // position where in circular array
int numberOfHashes = 0; // number of hashes created so far
while (counter < FileB.Length)
{
int i = 0;
for (; i < CHUNCK_SIZE; i++)
{
if (counter == 0)
{
sum[index] = FileB[counter];
}
else
{
sum[index] = FileB[counter] + sum[index.Shift(-1, CHUNCK_SIZE + 1)];
}
current[index] = FileB[counter];
counter++;
if (counter % CHUNCK_SIZE == 0 || counter == FileB.Length)
{
// get the sum of the last chunk
var a = (sum[index] - sum[index.Shift(1, CHUNCK_SIZE + 1)]);
Int64 tempHash = (Int64)a;
// conpute my hash function
tempHash = tempHash * ((Int64)current[index] + 1)
* ((Int64)current[index.Shift(-1, CHUNCK_SIZE + 1)] + 1)
* ((Int64)current[index.Shift(-2, CHUNCK_SIZE + 1)] + 1)
* (Int64)(Math.Pow(-1, current[index]));
// add the hashes to the array
hashes[numberOfHashes, 0] = tempHash;
numberOfHashes++;
hashes[numberOfHashes, 1] = -1;// later store a stronger hash function
numberOfHashes++;
// MISSING IMPLEMENTATION TO STORE A SECOND STRONGER HASH FUNCTION
if (counter == FileB.Length)
break;
}
index++;
index = index % (CHUNCK_SIZE + 1); // if index is out of bounds in circular array place it at position 0
}
}
}
static void Part2()
{
// simulate file read of a similar file
byte[] FileB = new byte[]{1,3,5,8,2,0,1,0,0,0,1,2,4,5,6,7,8,2,3,4,5,6,7,8,11};
// place where we will place all matching hashes
Int64[,] hashes = new Int64[(FileB.Length / CHUNCK_SIZE) + 10, 2];
int counter = 0;
byte[] current = new byte[CHUNCK_SIZE + 1]; // circual array
UInt64[] sum = new UInt64[CHUNCK_SIZE + 1]; // circual array
int index = 0; // position where in circular array
while (counter < FileB.Length)
{
int i = 0;
for (; i < CHUNCK_SIZE; i++)
{
if (counter == 0)
{
sum[index] = FileB[counter];
}
else
{
sum[index] = FileB[counter] + sum[index.Shift(-1, CHUNCK_SIZE + 1)];
}
current[index] = FileB[counter];
counter++;
// here we compute the hash every time and we are missing implementation to
// check if hash is contained by the other file
if (counter >= CHUNCK_SIZE)
{
var a = (sum[index] - sum[index.Shift(1, CHUNCK_SIZE + 1)]);
Int64 tempHash = (Int64)a;
tempHash = tempHash * ((Int64)current[index] + 1)
* ((Int64)current[index.Shift(-1, CHUNCK_SIZE + 1)] + 1)
* ((Int64)current[index.Shift(-2, CHUNCK_SIZE + 1)] + 1)
* (Int64)(Math.Pow(-1, current[index]));
if (counter == FileB.Length)
break;
}
index++;
index = index % (CHUNCK_SIZE + 1);
}
}
}
}
}
使用相同的算法在表中表示相同的文件
hashes
bytes sum Ac A[c-1] A[c-2] -1^Ac sum * (Ac+1) * (A[c-1]+1) * (A[c-2]+1)
2 2
3 5
5 10 5 3 2 -1
8 18 8 5 3 1 3888
2 20 2 8 5 1
0 20 0 2 8 1
1 21 1 0 2 -1
0 21 0 1 0 1 6
0 21 0 0 1 1
0 21 0 0 0 1
1 22 1 0 0 -1
2 24 2 1 0 1 18
4 28 4 2 1 1
5 33 5 4 2 -1
6 39 6 5 4 1
7 46 7 6 5 -1 -7392
8 54 8 7 6 1
2 56 2 8 7 1
3 59 3 2 8 -1
4 63 4 3 2 1 1020
5 68 5 4 3 -1
6 74 6 5 4 1
7 81 7 6 5 -1
8 89 8 7 6 1 13104
11 100 11 8 7 -1 -27648
file b
rolling hashes
bytes 0 Ac A[c-1] A[c-2] -1^Ac sum * (Ac+1) * (A[c-1]+1) * (A[c-2]+1)
1 1
3 4
5 9 5 3 1 -1
8 17 8 5 3 1 3672
2 19 2 8 5 1 2916
0 19 0 2 8 1 405
1 20 1 0 2 -1 -66
0 20 0 1 0 1 6
0 20 0 0 1 1 2
0 20 0 0 0 1 1
1 21 1 0 0 -1 -2
2 23 2 1 0 1 18
4 27 4 2 1 1 210
5 32 5 4 2 -1 -1080
6 38 6 5 4 1 3570
7 45 7 6 5 -1 -7392
8 53 8 7 6 1 13104
2 55 2 8 7 1 4968
3 58 3 2 8 -1 -2160
4 62 4 3 2 1 1020
5 67 5 4 3 -1 -1680
6 73 6 5 4 1 3780
7 80 7 6 5 -1 -7392
8 88 8 7 6 1 13104
11 99 11 8 7 -1 -27648