对文本文件中的字符进行计数/排序
本文关键字:排序 字符 文本 文件 | 更新日期: 2023-09-27 17:49:38
我正在尝试编写一个程序,读取文本文件,按字符进行排序,并跟踪每个字符在文档中出现的次数。这是我目前得到的。
class Program
{
static void Main(string[] args)
{
CharFrequency[] Charfreq = new CharFrequency[128];
try
{
string line;
System.IO.StreamReader file = new System.IO.StreamReader(@"C:'Users'User'Documents'Visual Studio 2013'Projects'Array_Project'wap.txt");
while ((line = file.ReadLine()) != null)
{
int ch = file.Read();
if (Charfreq.Contains(ch))
{
}
}
file.Close();
Console.ReadLine();
}
catch (Exception e)
{
Console.WriteLine("The process failed: {0}", e.ToString());
}
}
}
我的问题是,这里的if语句应该包含什么?
我也有一个Charfrequency类,我将在这里包括它,以防它是有帮助的/必要的,我包括它(是的,这是必要的,我使用数组与列表或数组列表)。
public class CharFrequency
{
private char m_character;
private long m_count;
public CharFrequency(char ch)
{
Character = ch;
Count = 0;
}
public CharFrequency(char ch, long charCount)
{
Character = ch;
Count = charCount;
}
public char Character
{
set
{
m_character = value;
}
get
{
return m_character;
}
}
public long Count
{
get
{
return m_count;
}
set
{
if (value < 0)
value = 0;
m_count = value;
}
}
public void Increment()
{
m_count++;
}
public override bool Equals(object obj)
{
bool equal = false;
CharFrequency cf = new CharFrequency(''0', 0);
cf = (CharFrequency)obj;
if (this.Character == cf.Character)
equal = true;
return equal;
}
public override int GetHashCode()
{
return m_character.GetHashCode();
}
public override string ToString()
{
String s = String.Format("'{0}' ({1}) = {2}", m_character, (byte)m_character, m_count);
return s;
}
}
看看这篇文章。
https://codereview.stackexchange.com/questions/63872/counting-the-number-of-character-occurrences它使用LINQ来实现您的目标
你不应该使用Contains
首先你需要初始化你的Charfreq
数组:
CharFrequency[] Charfreq = new CharFrequency[128];
for (int i = 0; i < Charferq.Length; i++)
{
Charfreq[i] = new CharFrequency((char)i);
}
try
那么你可以
int ch;
// -1 means that there are no more characters to read,
// otherwise ch is the char read
while ((ch = file.Read()) != -1)
{
CharFrequency cf = new CharFrequency((char)ch);
// This works because CharFrequency overloads the
// Equals method, and the Equals method checks only
// for the Character property of CharFrequency
int ix = Array.IndexOf(Charfreq, cf);
// if there is the "right" charfrequency
if (ix != -1)
{
Charfreq[ix].Increment();
}
}
注意这个不是我编写程序的方式。这是使您的程序工作所需的最小更改。
作为旁注,此程序将计算ASCII字符(代码为<= 127的字符)的"频率"
CharFrequency cf = new CharFrequency(''0', 0);
cf = (CharFrequency)obj;
这是一个无用的初始化:
CharFrequency cf = (CharFrequency)obj;
就足够了,否则你创建的CharFrequency
只是为了在下一行丢弃它。
字典非常适合这样的任务。您没有说明文件使用的是哪个字符集和编码。因此,由于Unicode非常普遍,我们假设使用Unicode字符集和UTF-8编码。(毕竟,这是。net, Java, JavaScript, HTML, XML,....的默认设置)如果不是这样,那么使用适用的编码读取文件并修复您的代码,因为您目前在StreamReader中使用UTF-8。
下一步是遍历"字符"。然后对字典中"字符"的计数进行递增,就像在文本中看到的那样。
Unicode确实有一些复杂的特性。一种是组合字符,其中基本字符可以用变音符等覆盖。用户将这样的组合视为一个"字符",或者Unicode称之为字素。值得庆幸的是,. net提供了一个StringInfo类,它将它们作为"文本元素"进行迭代。
所以,如果你仔细想想,使用数组是相当困难的。你必须在你的数组上建立你自己的字典。
下面的示例使用一个Dictionary,并且可以使用LINQPad脚本运行。在它创建字典之后,它排序并转储它,并显示一个漂亮的显示。
var path = Path.GetTempFileName();
// Get some text we know is encoded in UTF-8 to simplify the code below
// and contains combining codepoints as a matter of example.
using (var web = new WebClient())
{
web.DownloadFile("http://superuser.com/questions/52671/which-unicode-characters-do-smilies-like-%D9%A9-%CC%AE%CC%AE%CC%83-%CC%83%DB%B6-consist-of", path);
}
// since the question asks to analyze a file
var content = File.ReadAllText(path, Encoding.UTF8);
var frequency = new Dictionary<String, int>();
var itor = System.Globalization.StringInfo.GetTextElementEnumerator(content);
while (itor.MoveNext())
{
var element = (String)itor.Current;
if (!frequency.ContainsKey(element))
{
frequency.Add(element, 0);
}
frequency[element]++;
}
var histogram = frequency
.OrderByDescending(f => f.Value)
// jazz it up with the list of codepoints in each text element
.Select(pair =>
{
var bytes = Encoding.UTF32.GetBytes(pair.Key);
var codepoints = new UInt32[bytes.Length/4];
Buffer.BlockCopy(bytes, 0, codepoints, 0, bytes.Length);
return new {
Count = pair.Value,
textElement = pair.Key,
codepoints = codepoints.Select(cp => String.Format("U+{0:X4}", cp) ) };
});
histogram.Dump(); // For use in LINQPad