根据特定内容将大txt文件拆分为小文件
本文关键字:文件 txt 拆分 小文 | 更新日期: 2023-09-27 18:31:15
我得到了一个大的基因组序列,我需要把它分成小.txt文件。
序列如下所示
>supercont1.1 of Geomyces destructans 20631-21
AGATTTTCTTAATAACTTGTTCAATGTGTGTTCAAATGATATGCCGTGATGTATGTAGCA
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG
>supercont1.2 of Geomyces destructans 20631-21
AGATTTTCTTAATAACTTGTTCAATGTGTGTTCAAATGATATGCCGTGATGTATGTAGCA
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG
>supercont1.3 of Geomyces destructans 20631-21
AGATTTT (...)
它应该被拆分成带有名称的小文件:"1.1-Geomyces-destructans--20631-21","1.2-Geomyces..."用基因组数据实现。
@JimMischel帮助后我的代码如下所示:
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.IO;
namespace genom1
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
string filter = "Textové soubory|*.txt|Soubory FASTA|*.fasta|Všechny soubory|*.*";
private void doit_Click(object sender, EventArgs e)
{
bar.Value = 0;
OpenFileDialog opf = new OpenFileDialog();
// filter for choosing file types
opf.Filter = filter;
string lineo = "error"; // test
if (opf.ShowDialog() == DialogResult.OK)
{
var lineCount = 0;
using (var reader = File.OpenText(opf.FileName))
{
while (reader.ReadLine() != null)
{
lineCount++;
}
}
bar.Maximum = lineCount;
bar.Step = 1;
FolderBrowserDialog fbd = new FolderBrowserDialog();
fbd.Description = "Vyber složku, do které chceš rozdělit načtený soubor: 'n'n" + opf.FileName; // dialog desc
if (fbd.ShowDialog() == DialogResult.OK)
{
List<string> lines = new List<string>();
foreach (var line in File.ReadLines(opf.FileName))
{
bar.PerformStep();
if (line[0] == '>')
{
if (lines.Count >= 0)
{
// write contents of lines list to file
//quicker replace for better file name
StringBuilder prep = new StringBuilder(line);
prep.Replace(">supercont", "");
prep.Replace("of", "");
prep.Replace(" ", "-");
lineo = prep.ToString();
// append or writeall? how to writeall lines without append?
//System.IO.File.WriteAllText(fbd.SelectedPath + "''" + lineo + ".txt", lineo);
StreamWriter SW;
SW = File.AppendText(fbd.SelectedPath + "''" + lineo + ".txt");
foreach (string s in lines)
{
SW.WriteLine(s);
}
SW.Close();
// and clear the list.
lines.Clear();
}
}
lines.Add(line);
}
// here, do the last part
if (lines.Count >= 0)
{
// write contents of lines list to file.
/* starts being little buggy here...
StreamWriter SW;
SW = File.AppendText(fbd.SelectedPath + "''" + lineo + ".txt");
foreach (string s in lines)
{
SW.WriteLine(s);
}
SW.Close();
*/
}
}
}
}
}
}
如果文件足够大,可以放入内存中,则可以调用 File.ReadAllText
将其放入字符串中。然后,您浏览并提取>
字符之间的文本。像这样:
string s = File.ReadAllText("filename");
int pos = s.IndexOf('>');
while (pos != -1)
{
int newpos = s.IndexOf('>', pos+1);
string text = s.Substring(pos+1, newpos - pos);
// now write text to a file
// update current position
pos = newpos;
}
// here you'll have to handle the last part of the file specially.
我假设您可以弄清楚如何正确命名文件。
如果无法将整个文件放入内存中,则可以逐个字符读取文件或执行某种缓冲。如果您知道>
始终位于行的开头,则问题会更容易。然后你可以写:
List<string> lines = new List<string>();
foreach (var line in File.ReadLines("filename"))
{
if (line[0] == '>')
{
if (lines.Count > 0)
{
// write contents of lines list to file.
// and clear the list.
lines.Clear();
}
}
lines.Add(line);
}
// here, do the last part
if (lines.Count > 0)
{
// write contents of lines list to file.
}
我认为最简单的方法是首先使用File.ReadAllText()读取整个文件。 然后只需使用 String.Split(">"),它将返回一个我认为是您的新文件内容的数组。