根据特定内容将大txt文件拆分为小文件

本文关键字:文件 txt 拆分 小文 | 更新日期: 2023-09-27 18:31:15

我得到了一个大的基因组序列,我需要把它分成小.txt文件。

序列如下所示

>supercont1.1 of Geomyces destructans 20631-21
AGATTTTCTTAATAACTTGTTCAATGTGTGTTCAAATGATATGCCGTGATGTATGTAGCA
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG
>supercont1.2 of Geomyces destructans 20631-21
AGATTTTCTTAATAACTTGTTCAATGTGTGTTCAAATGATATGCCGTGATGTATGTAGCA
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG
>supercont1.3 of Geomyces destructans 20631-21
AGATTTT (...)

它应该被拆分成带有名称的小文件:"1.1-Geomyces-destructans--20631-21","1.2-Geomyces..."用基因组数据实现。

@JimMischel帮助后我的代码如下所示:

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.IO;
namespace genom1
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }
        string filter = "Textové soubory|*.txt|Soubory FASTA|*.fasta|Všechny soubory|*.*";
        private void doit_Click(object sender, EventArgs e)
        {
            bar.Value = 0;
            OpenFileDialog opf = new OpenFileDialog();
            // filter for choosing file types
            opf.Filter = filter;
            string lineo = "error"; // test
            if (opf.ShowDialog() == DialogResult.OK)
            {
                var lineCount = 0;
                using (var reader = File.OpenText(opf.FileName))
                {
                    while (reader.ReadLine() != null)
                    {
                        lineCount++;
                    }
                }
                bar.Maximum = lineCount;
                bar.Step = 1;
                FolderBrowserDialog fbd = new FolderBrowserDialog();
                fbd.Description = "Vyber složku, do které chceš rozdělit načtený soubor: 'n'n" + opf.FileName; // dialog desc
                if (fbd.ShowDialog() == DialogResult.OK)
                {
                    List<string> lines = new List<string>();
                    foreach (var line in File.ReadLines(opf.FileName))
                    {
                        bar.PerformStep();
                        if (line[0] == '>')
                        {
                           if (lines.Count >= 0)
                            {
                                // write contents of lines list to file
                                //quicker replace for better file name
                                StringBuilder prep = new StringBuilder(line);
                                prep.Replace(">supercont", "");
                                prep.Replace("of", "");
                                prep.Replace(" ", "-");
                                lineo = prep.ToString();
                                // append or writeall? how to writeall lines without append?
                                //System.IO.File.WriteAllText(fbd.SelectedPath + "''" + lineo + ".txt", lineo);
                                StreamWriter SW;
                                SW = File.AppendText(fbd.SelectedPath + "''" + lineo + ".txt");
                                foreach (string s in lines)
                                    {
                                        SW.WriteLine(s);
                                    }
                                SW.Close();
                                // and clear the list.
                                lines.Clear();
                            }
                        }
                        lines.Add(line);
                    }
                    // here, do the last part
                    if (lines.Count >= 0)
                    {
                        // write contents of lines list to file.
                        /* starts being little buggy here...
                        StreamWriter SW;
                        SW = File.AppendText(fbd.SelectedPath + "''" + lineo + ".txt");
                        foreach (string s in lines)
                        {
                            SW.WriteLine(s);
                        }
                        SW.Close();
                        */
                    }
                }
            }
        }
    }
}

根据特定内容将大txt文件拆分为小文件

如果文件足够大,可以放入内存中,则可以调用 File.ReadAllText 将其放入字符串中。然后,您浏览并提取>字符之间的文本。像这样:

string s = File.ReadAllText("filename");
int pos = s.IndexOf('>');
while (pos != -1)
{
    int newpos = s.IndexOf('>', pos+1);
    string text = s.Substring(pos+1, newpos - pos);
    // now write text to a file
    // update current position
    pos = newpos;
}
// here you'll have to handle the last part of the file specially.

我假设您可以弄清楚如何正确命名文件。

如果无法将整个文件放入内存中,则可以逐个字符读取文件或执行某种缓冲。如果您知道>始终位于行的开头,则问题会更容易。然后你可以写:

List<string> lines = new List<string>();
foreach (var line in File.ReadLines("filename"))
{
    if (line[0] == '>')
    {
        if (lines.Count > 0)
        {
            // write contents of lines list to file.
            // and clear the list.
            lines.Clear();
        }
    }
    lines.Add(line);
}
// here, do the last part
if (lines.Count > 0)
{
    // write contents of lines list to file.
}

我认为最简单的方法是首先使用File.ReadAllText()读取整个文件。 然后只需使用 String.Split(">"),它将返回一个我认为是您的新文件内容的数组。