最小-最大数据点规范化

本文关键字:规范化 数据 最小 | 更新日期: 2023-09-27 18:35:28

我有一个数据点列表,例如

List<DataPoint> newpoints=new List<DataPoint>(); 

其中 DataPoint 是一个类,由从 A 到 I 的九个双特征组成,并且

newpoints.count=100000 double points (i.e each point consists of nine double features from A to I)

我需要使用最小-最大规范化方法和 0 到 1 之间的scale_range来应用列表新点的规范化。

到目前为止,我已经暗示了以下步骤

  1. 每个数据点功能都分配给一个维度数组。 例如,功能 A 的代码

    for (int i = 0; i < newpoints.Count; i++)
        {  array_A[i] = newpoints[i].A;} and so on for all nine double features
    
  2. 我应用了最大最小规范化方法。 例如,功能 A 的代码:

    normilized_featureA= (((array_A[i] - array_A.Min()) * (1 - 0)) / 
                      (array_A.Max() - array_A.Min()))+0;
    

该方法成功完成,但需要更多时间(即 3 分 45 秒)

如何使用 C# 中的 LINQ 代码应用Max_min规范化,将时间缩短到几秒钟?我在 Stackoverflow 中找到了这个问题 如何规范化整数值列表,但我的问题是

double valueMax = list.Max(); // I need Max point for feature A  for all 100000
double valueMin = list.Min(); //I need Min point for feature A  for all 100000

以此类推,所有其他九个功能您的帮助将不胜感激。

最小-最大数据点规范化

作为将 9 个特征建模为类"DataPoint"上的双精度属性的替代方法,您还可以将 9 个双精度的数据点建模为数组,这样做的好处是可以再次使用 LINQ 一次性完成所有 9 个计算:

var newpoints = new List<double[]>
{
    new []{1.23, 2.34, 3.45, 4.56, 5.67, 6.78, 7.89, 8.90, 9.12},
    new []{2.34, 3.45, 4.56, 5.67, 6.78, 7.89, 8.90, 9.12, 12.23},
    new []{3.45, 4.56, 5.67, 6.78, 7.89, 8.90, 9.12, 12.23, 13.34},
    new []{4.56, 5.67, 6.78, 7.89, 8.90, 9.12, 12.23, 13.34, 15.32}
};
var featureStats = newpoints
// We make the assumption that all 9 data points are present on each row.
.First()
// 2 Anon Projections - first to determine min / max as a function of column
.Select((np, idx) => new
{ 
   Idx = idx,
   Max = newpoints.Max(x => x[idx]),
   Min = newpoints.Min(x => x[idx])
})
// Second to add in the dynamic Range
.Select(x => new {
  x.Idx,
  x.Max,
  x.Min,
  Range = x.Max - x.Min
})
// Back to array for O(1) lookups.
.ToArray();
// Do the normalizaton for the columns, for each row.
var normalizedFeatures = newpoints
   .Select(np => np.Select(
      (i, idx) => (i - featureStats[idx].Min) / featureStats[idx].Range));
foreach(var datapoint in normalizedFeatures)
{
  Console.WriteLine(string.Join(",", datapoint.Select(x => x.ToString("0.00"))));
}

结果:

0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
0.33,0.33,0.33,0.33,0.34,0.47,0.23,0.05,0.50
0.67,0.67,0.67,0.67,0.69,0.91,0.28,0.75,0.68
1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00

停止一遍又一遍地重新计算最大值/最小值,它不会改变。

double maxInFeatureA = array_A.Max();
double minInFeatureA = array_A.Min();
// somewher in the loop:
normilized_featureA= (((array_A[i] - minInFeatureA ) * (1 - 0)) / 
                  (maxInFeatureA  - minInFeatureA ))+0;

当用于具有许多元素的foreach/for时,最大/最小值对于数组来说非常昂贵。

我建议你采用以下代码:数组数据规范化

并将其用作

var normalizedPoints = newPoints.Select(x => x.A)
            .NormalizeData(1, 1)
            .ToList(); 
double min = newpoints.Min(p => p.A);
double max = newpoints.Max(p => p.A);
double readonly normalizer = 1 / (max - min);
var normalizedFeatureA = newpoints.Select(p => (p.A - min) * normalizer);