如何从多个txt文件中裁剪字符串

本文关键字:文件 裁剪 字符串 txt | 更新日期: 2023-09-27 18:23:57

我有数千个.log文件,我需要在所有文件中找到一些字符串。

我将举个例子来解释:在所有的.log文件上,我都有一个名为"AAA"的字符串,在这个字符串之后,我有一个可以从一个日志文件到其他日志文件的不同编号。我知道如何搜索AAA字符串。我不知道的是如何只裁剪AAA字符串之后的字符串编号。

更新:.log文件包含很多行。在.log文件中,我只有一行包含字符串"A12A"。在那行之后,我有一个号码(例如:5465)。我需要的是提取A12A后面的数字。注意:A12A和5465字符串编号之间有一个间隔。

示例:.log文件:"assddsf dfdfsd dfd A12A 5465 dffdsfsdf dfdfdf"我需要提取的是:5465。

到目前为止我拥有的是:

// Modify this path as necessary.
string startFolder = @"c:'program files'Microsoft Visual Studio 9.0'";
// Take a snapshot of the file system.
System.IO.DirectoryInfo dir = new System.IO.DirectoryInfo(startFolder);
// This method assumes that the application has discovery permissions
// for all folders under the specified path.
IEnumerable<System.IO.FileInfo> fileList = dir.GetFiles("*.*", System.IO.SearchOption.AllDirectories);
string searchTerm = @"Visual Studio";
// Search the contents of each file.
// A regular expression created with the RegEx class
// could be used instead of the Contains method.
// queryMatchingFiles is an IEnumerable<string>.
var queryMatchingFiles =
    from file in fileList
    where file.Extension == ".htm"
    let fileText = GetFileText(file.FullName)
    where fileText.Contains(searchTerm)
    select file.FullName;
// Execute the query.
Console.WriteLine("The term '"{0}'" was found in:", searchTerm);
foreach (string filename in queryMatchingFiles)
{
    Console.WriteLine(filename);
}
// Keep the console window open in debug mode.
Console.WriteLine("Press any key to exit");
Console.ReadKey();
}
// Read the contents of the file.
static string GetFileText(string name)
{
    string fileContents = String.Empty;
// If the file has been deleted since we took 
// the snapshot, ignore it and return the empty string.
if (System.IO.File.Exists(name))
{
    fileContents = System.IO.File.ReadAllText(name);
}
return fileContents;

}

如何从多个txt文件中裁剪字符串

我建议使用以下代码进行搜索:

private static readonly string _SearchPattern = "A12A";
private static readonly Regex _NumberExtractor = new Regex(@"'d+");
private static IEnumerable<Tuple<String, int>> FindMatches()
{
    var startFolder = @"D:'";
    var filePattern = @"*.htm";
    var matchingFiles = Directory.EnumerateFiles(startFolder, filePattern, SearchOption.AllDirectories);
    foreach (var file in matchingFiles)
    {
        // What encoding do your files use?
        var lines = File.ReadLines(file, Encoding.UTF8);
        foreach (var line in lines)
        {
            int number;
            if (TryGetNumber(line, out number))
            {
                yield return Tuple.Create(file, number);
                // Stop searching that file and continue with the next one.
                break;
            }
        }
    }
}
private static bool TryGetNumber(string line, out int number)
{
    number = 0;
    // Should casing be ignored??
    var index = line.IndexOf(_SearchPattern, StringComparison.InvariantCultureIgnoreCase);
    if (index >= 0)
    {
        var numberRaw = line.Substring(index + _SearchPattern.Length);
        var match = _NumberExtractor.Match(numberRaw);
        return Int32.TryParse(match.Value, out number);
    }
    return false;
}

原因是在执行I/O操作时,驱动器本身通常是瓶颈。因此,在不使用的情况下,并行地做任何事情或从文件中读取大量数据到内存中都是没有意义的

通过使用Directory.EnumerateFiles方法,驱动器将被延迟搜索,您可以在找到第一个文件后立即开始检查它。File.ReadLines方法也是如此。当您搜索模式时,它会懒洋洋地遍历文件。

通过这种方法,你应该获得最大的速度(取决于你的硬盘驱动器性能),因为它只需要最少的I/O调用就可以将文件和内容发送到你的代码中。