在字节数组中查找模式的最有效方法

本文关键字：有效方法模式查找字节字节数数组 | 更新日期: 2023-09-27 18:18:44

我有以下代码:

var file = //Memory stream with a file in it
var bytes = file.ToArray();

我需要在bytes中搜索指定字节序列的第一次出现(如果有的话):0xff, 0xd8。(这样做的目的是查找嵌入在文件中的图像)

因此，如果例如bytes[6501]包含0xff和bytes[6502]包含0xd8，这是一个匹配，我需要返回位置的索引(6501)，或一个新数组，它是字节数组的副本，除了它没有旧数组中6501以下的键。

我当前的解决方案是循环:

 for (var index = 0; index < bytes.Length; index++)
 {
     if((new byte[] {0xff, 0xd8}).SequenceEqual(bytes.Skip(index).Take(2))
    ...

但是处理较大的文件时速度很慢。

有没有更有效的方法来处理这个问题?

在字节数组中查找模式的最有效方法

如果这是时间紧迫的代码，我发现c#编译器(包括Mono的实现和Microsoft的)有特殊的逻辑来优化简单的扫描循环。

因此，从分析经验来看，我将使用硬编码的第一元素搜索实现序列搜索，如下所示:

/// <summary>Looks for the next occurrence of a sequence in a byte array</summary>
/// <param name="array">Array that will be scanned</param>
/// <param name="start">Index in the array at which scanning will begin</param>
/// <param name="sequence">Sequence the array will be scanned for</param>
/// <returns>
///   The index of the next occurrence of the sequence of -1 if not found
/// </returns>
private static int findSequence(byte[] array, int start, byte[] sequence) {
  int end = array.Length - sequence.Length; // past here no match is possible
  byte firstByte = sequence[0]; // cached to tell compiler there's no aliasing
  while(start <= end) {
    // scan for first byte only. compiler-friendly.
    if(array[start] == firstByte) {
      // scan for rest of sequence
      for (int offset = 1;; ++offset) {
        if(offset == sequence.Length) { // full sequence matched?
          return start;
        } else if(array[start + offset] != sequence[offset]) {
          break;
        }
      }
    }
    ++start;
  }
  // end of array reached without match
  return -1;
}

比其他建议长得多，而且容易出现off-by-1错误，但如果你正在扫描大量数据，或者为了频繁的设备IO而这样做，这种设置将避免给垃圾收集器提供食物，并且优化得非常好。

编辑2019-10-03:修复了Warren Rox指出的问题。谢谢!测试:https://ideone.com/mmACYj

您希望使用for循环来检查您的数组。代码运行缓慢的原因很简单。

反编译说明原因:

public static IEnumerable<TSource> Skip<TSource>(this IEnumerable<TSource> source, int count)
{
  if (source == null)
    throw Error.ArgumentNull("source");
  else
    return Enumerable.SkipIterator<TSource>(source, count);
}
private static IEnumerable<TSource> SkipIterator<TSource>(IEnumerable<TSource> source, int count)
{
  using (IEnumerator<TSource> enumerator = source.GetEnumerator())
  {
    while (count > 0 && enumerator.MoveNext())
      --count;
    if (count <= 0)
    {
      while (enumerator.MoveNext())
        yield return enumerator.Current;
    }
  }
}

对于你循环的每一个For，你都在执行一次跳跃，基本上是不必要地再次遍历你的数组。

一些Linq操作包含尽可能使用索引器的优化-不幸的是跳过不是其中之一。

PS:

如果我是你，我会把你的代码改成

var search = new byte[] {0xff, 0xd8};
var current = new byte[2];
var maxSearchRange = bytes.Length -1;
for (var index = 0; index < maxSearchRange; index++)
{
   current[0] = bytes[index];
   current[1] = bytes[index+1];
   if((search).SequenceEqual(current))
       ...

简单的线性搜索有缺点吗?
如果找到则返回起始索引，否则返回-1

private const byte First = 0x0ff;
private const byte Second = 0x0d8;
private static int FindImageStart(IList<byte> bytes) {
    for (var index = 0; index < bytes.Count - 1; index++) {
        if (bytes[index] == First && bytes[index + 1] == Second) {
            return index;
        }
    }
    return -1;
}

public int FindSequence(byte[] source, byte[] seq)
{
    var start = -1;
    for (var i = 0; i < source.Length - seq.Length + 1 && start == -1; i++)
    {
        var j = 0;
        for (; j < seq.Length && source[i+j] == seq[j]; j++) {}
        if (j == seq.Length) start = i;
    }
    return start;
}

简单的…怎么样?

bytes[] pattern = new bytes[] { 1, 2, 3, 4, 5 };
for (var index = 0, end = bytes.Length - pattern.length; index < end; index++)
{
    bool found = false;
    for(int j = 0; j < pattern.Length && !found; j++)
    {
        found = bytes[index + j] == pattern[j];
    }
    if(found)
        return index;
}

请注意，我没有在c#代码很长一段时间，所以请原谅语法错误，如果有任何。将此视为伪代码(不再抛出索引错误):)