有没有更快的方法来搜索大文件,而不把它加载到内存中

本文关键字:加载 内存 文件 方法 搜索 有没有 | 更新日期: 2023-09-27 18:12:12

我有一个巨大的文件,我不能加载到内存和字节序列,我需要找到里面。

这是我现在使用的:

public static byte[] GetRangeFromStream(ref FileStream fs, long start_index, long count)
{
    byte[] data = new byte[count];
    long prev_pos = fs.Position;
    fs.Position = start_index;
    fs.Read(data, 0, data.Length);
    fs.Position = prev_pos;
    return data;
}
public static long GetSequenceIndexInFileStream(byte[] seq, ref FileStream fs, long index, bool get_beginning = true)
{
    if (index >= fs.Length)
        return -1;
    fs.Position = index;
    for (long i = index; i < fs.Length; i++)
    {
        byte temp_byte = (byte)fs.ReadByte();
        if (temp_byte == seq[0] && IsArraysSame(seq, GetRangeFromStream(ref fs, i, seq.Length))) //compare just first bytes and then compare seqs if needed
            return i;
    }
    return -1;
}

有没有更快的方法来搜索大文件,而不把它加载到内存中

执行此操作的最佳方法是逐字节读取文件流,仅查找正在搜索的字符串的第一个字节。当你得到匹配时,你就知道你可能找到了。读取接下来的N个字节(其中N是正在搜索的字符串的长度),并对从文件中读取的字节的内容和正在搜索的字节的内容进行数组比较。

让。net通过在打开文件缓冲区时设置适当的FileStream缓冲区来处理文件读缓冲区。不要担心读取转发来创建你自己的缓冲区——这是浪费时间——.net做得很好。

这种方法意味着你不需要对每个字节进行数组比较,也不需要关心缓冲区之间的分割等。

如果文件真的很大,而你不是I/O绑定,那么你可以考虑使用。net任务库创建多个任务,每个任务从文件流中的不同位置开始,并在所有任务完成后将每个任务的结果关联起来。

我建议使用修改版的Knuth Moriss Pratt算法。算法kmp_search:输入:字符流S(要搜索的文本)一个字符数组,W(被搜索的单词)输出:整数(W在S中从零开始的位置)

define variables:
    an integer, m ← 0 (the beginning of the current match in S)
    an integer, i ← 0 (the position of the current character in W)
    an array of integers, T (the table, computed elsewhere)
while m + i < length(S) do
    if W[i] = S[m + i] then
        if i = length(W) - 1 then
            return m
        let i ← i + 1
    else
        if T[i] > -1 then
            let m ← m + i - T[i], i ← T[i]
        else
            let m ← m + 1, i ← 0
(if we reach here, we have searched all of S unsuccessfully)
return the length of S

文本字符串可以流进,因为KMP算法不会在文本中回溯。(这是对朴素算法的另一个改进,朴素算法本身不支持流。)如果是流,处理传入字符的平摊时间是Ɵ(1),但最坏情况的时间是Ɵ(min(m, n ')),其中n '是迄今为止看到的文本字符的数量。源

参考(Java)实现可以在这里找到

package com.twitter.elephantbird.util;
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
/**
 * An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
 * For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
 */
public class StreamSearcher {
  protected byte[] pattern_;
  protected int[] borders_;
  // An upper bound on pattern length for searching. Results are undefined for longer patterns.
  public static final int MAX_PATTERN_LENGTH = 1024;
  public StreamSearcher(byte[] pattern) {
    setPattern(pattern);
  }
  /**
   * Sets a new pattern for this StreamSearcher to use.
   * @param pattern
   *          the pattern the StreamSearcher will look for in future calls to search(...)
   */
  public void setPattern(byte[] pattern) {
    pattern_ = Arrays.copyOf(pattern, pattern.length);
    borders_ = new int[pattern_.length + 1];
    preProcess();
  }
  /**
   * Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
   * that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
   * byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
   * another reasonable default, i.e. leave the stream unchanged.
   *
   * @return bytes consumed if found, -1 otherwise.
   * @throws IOException
   */
  public long search(InputStream stream) throws IOException {
    long bytesRead = 0;
    int b;
    int j = 0;
    while ((b = stream.read()) != -1) {
      bytesRead++;
      while (j >= 0 && (byte)b != pattern_[j]) {
        j = borders_[j];
      }
      // Move to the next character in the pattern.
      ++j;
      // If we've matched up to the full pattern length, we found it.  Return,
      // which will automatically save our position in the InputStream at the point immediately
      // following the pattern match.
      if (j == pattern_.length) {
        return bytesRead;
      }
    }
    // No dice, Note that the stream is now completely consumed.
    return -1;
  }
  /**
   * Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
   * and aids in implementation of the Knuth-Moore-Pratt string search.
   * <p>
   * For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
   */
  protected void preProcess() {
    int i = 0;
    int j = -1;
    borders_[i] = j;
    while (i < pattern_.length) {
      while (j >= 0 && pattern_[i] != pattern_[j]) {
        j = borders_[j];
      }
      borders_[++i] = ++j;
    }
  }
}

类似问题:在流中搜索字符串

的有效方法

您可能想要检查文件映射。它允许您基本上将大文件视为内存缓冲区,从而允许在磁盘文件上使用任何基于内存的API。不要乱搞显式文件I/o

MSDN:

文件映射是一个文件的内容与一个进程的虚拟地址空间的一部分的关联。系统创建一个文件映射对象(也称为节对象)来维护这种关联。文件视图是进程用来访问文件内容的虚拟地址空间的一部分。文件映射允许进程使用随机输入和输出(I/O)和顺序I/O。它还允许进程有效地处理大型数据文件(如数据库),而不必将整个文件映射到内存中。多个进程也可以使用内存映射文件共享数据....告诉我更多…