读取一个很大的单行txt文件并将其拆分

本文关键字:文件 txt 单行 拆分 一个 读取 | 更新日期: 2023-09-27 18:13:46

我有以下问题:我有一个将近500 MB大的文件。它的文本,全部在一行中。文本用一个名为ROW_DEL的虚拟行结尾分隔,并且在文本中如下所示:

this is a line ROW_DEL and this is a line

现在我需要做以下操作,我想把这个文件分成几行,这样我就得到了这样的文件:

this is a line
and this is a line

问题是,即使我用windows文本编辑器打开它,它也会崩溃,因为文件太大了。

有可能像我提到的用C#、Java或Python分割这个文件吗?最好不要过度消耗我的cpu。

读取一个很大的单行txt文件并将其拆分

实际上500 MB的文本并没有那么大,只是记事本太糟糕了。你可能还没有可用的sed,因为你在windows上,但至少在python中尝试一下天真的解决方案,我认为它会很好地工作:

import os
with open('infile.txt') as f_in, open('outfile.txt', 'w') as f_out:
  f_out.write(f_in.read().replace('ROW_DEL ', os.linesep))

分块读取此文件,例如在c#中使用StreamReader.ReadBlock。您可以在那里设置要读取的最大字符数。

对于每个读取的区块,您可以将ROW_DEL替换为'r'n,并将其附加到新文件中。

只需记住将当前索引增加您刚刚读取的字符数。

这是我的解决方案
原则上很简单(ŁukaszW.pl给出了它(,但如果想处理特殊情况,就不那么容易编码(''321'' ukasz W.pl没有(。

特殊的情况是当分隔符ROW_DEL被拆分为两个读取块时(正如I4V所指出的(,更微妙的是,如果有两个连续的ROW_DEL,其中第二个被拆分为二个读取块。

由于ROW_DEL比任何可能的换行符(''r'''n'''r'n'(都长,因此可以用操作系统使用的换行符替换文件中的换行符。这就是为什么我选择重写文件本身
为此,我使用模式'r+',它不会创建新文件
使用二进制模式'b'也是绝对强制性的。

其原理是读取一个区块(例如,在现实生活中,其大小为262144(和x附加字符,其中rx是分隔符-1的长度。
然后检查分隔符是否存在于块的末尾+x个字符中
根据它是否存在,在执行ROW_DEL的转换之前,块被缩短或不缩短,并就地重写。

裸代码是:

text = ('The hospital roommate of a man infected ROW_DEL'
        'with novel coronavirus (NCoV)ROW_DEL'
        '—a SARS-related virus first identified ROW_DELROW_DEL'
        'last year and already linked to 18 deaths—ROW_DEL'
        'has contracted the illness himself, ROW_DEL'
        'intensifying concerns about the ROW_DEL'
        "virus's ability to spread ROW_DEL"
        'from person to person.')
with open('eessaa.txt','w') as f:
    f.write(text)
with open('eessaa.txt','rb') as f:
    ch = f.read()
    print ch.replace('ROW_DEL','ROW_DEL'n')
    print ''nlength of the text : %d chars'n' % len(text)
#==========================================
from os.path import getsize
from os import fsync,linesep
def rewrite(whichfile,sep,chunk_length,OSeol=linesep):
    if chunk_length<len(sep):
        print 'Length of second argument, %d , is ''
              'the minimum value for the third argument''
              % len(sep)
        return
    x = len(sep)-1
    x2 = 2*x
    file_length = getsize(whichfile)
    with open(whichfile,'rb+') as fR,'
         open(whichfile,'rb+') as fW:
        while True:
            chunk = fR.read(chunk_length)
            pch = fR.tell()
            twelve = chunk[-x:] + fR.read(x)
            ptw = fR.tell()
            if sep in twelve:
                pt = twelve.find(sep)
                m = ("'n   !! %r is "
                     "at position %d in twelve !!" % (sep,pt))
                y = chunk[0:-x+pt].replace(sep,OSeol)
            else:
                pt = x
                m = ''
                y = chunk.replace(sep,OSeol)
            pos = fW.tell()
            fW.write(y)
            fW.flush()
            fsync(fW.fileno())
            if fR.tell()<file_length:
                fR.seek(-x2+pt,1)
            else:
                fW.truncate()
                break
rewrite('eessaa.txt','ROW_DEL',14)
with open('eessaa.txt','rb') as f:
    ch = f.read()
    print ''n'.join(repr(line)[1:-1] for line in ch.splitlines(1))
    print ''nlength of the text : %d chars'n' % len(ch)

为了跟踪执行,这里有另一个始终打印消息的代码:

text = ('The hospital roommate of a man infected ROW_DEL'
        'with novel coronavirus (NCoV)ROW_DEL'
        '—a SARS-related virus first identified ROW_DELROW_DEL'
        'last year and already linked to 18 deaths—ROW_DEL'
        'has contracted the illness himself, ROW_DEL'
        'intensifying concerns about the ROW_DEL'
        "virus's ability to spread ROW_DEL"
        'from person to person.')
with open('eessaa.txt','w') as f:
    f.write(text)
with open('eessaa.txt','rb') as f:
    ch = f.read()
    print ch.replace('ROW_DEL','ROW_DEL'n')
    print ''nlength of the text : %d chars'n' % len(text)
#==========================================
from os.path import getsize
from os import fsync,linesep
def rewrite(whichfile,sep,chunk_length,OSeol=linesep):
    if chunk_length<len(sep):
        print 'Length of second argument, %d , is ''
              'the minimum value for the third argument''
              % len(sep)
        return
    x = len(sep)-1
    x2 = 2*x
    file_length = getsize(whichfile)
    with open(whichfile,'rb+') as fR,'
         open(whichfile,'rb+') as fW:
        while True:
            chunk = fR.read(chunk_length)
            pch = fR.tell()
            twelve = chunk[-x:] + fR.read(x)
            ptw = fR.tell()
            if sep in twelve:
                pt = twelve.find(sep)
                m = ("'n   !! %r is "
                     "at position %d in twelve !!" % (sep,pt))
                y = chunk[0:-x+pt].replace(sep,OSeol)
            else:
                pt = x
                m = ''
                y = chunk.replace(sep,OSeol)
            print ('chunk  == %r   %d chars'n'
                   ' -> fR now at position  %d'n'
                   'twelve == %r   %d chars   %s'n'
                   ' -> fR now at position  %d'
                   % (chunk ,len(chunk),      pch,
                      twelve,len(twelve),m,   ptw) )
            pos = fW.tell()
            fW.write(y)
            fW.flush()
            fsync(fW.fileno())
            print ('          %r   %d long'n'
                   ' has been written from position %d'n'
                   ' => fW now at position  %d'
                   % (y,len(y),pos,fW.tell()))
            if fR.tell()<file_length:
                fR.seek(-x2+pt,1)
                print ' -> fR moved %d characters back to position %d''
                       % (x2-pt,fR.tell())
            else:
                print (" => fR is at position %d == file's size'n"
                       '    File has thoroughly been read'
                       % fR.tell())
                fW.truncate()
                break
            raw_input(''npress any key to continue')

rewrite('eessaa.txt','ROW_DEL',14)
with open('eessaa.txt','rb') as f:
    ch = f.read()
    print ''n'.join(repr(line)[1:-1] for line in ch.splitlines(1))
    print ''nlength of the text : %d chars'n' % len(ch)

为了检测ROW_DEL是否跨在两个块上以及是否有两个ROW_DEL相邻,在处理块的末端时有一些微妙之处。这就是为什么我花了很长时间发布我的解决方案:我最终不得不根据sep是否跨坐(代码中2*x是x2,ROW_DEL x和x2是6和12(来编写fR.seek(-x2+pt,1),而不仅仅是fR.seek(-2*x,1)fR.seek(-x,1)。任何对这一点感兴趣的人都会根据if 'ROW_DEL' is in twelve更改章节中的代码来检查它。