如何建立信心颜色表估算了K近邻分类近邻、信心、颜色

2023-09-11 06:57:19 作者:旧时光 。那年芳华っ

我要什么:

要显示我的简单分类算法的结果(见下文),如Python中的颜色表(该数据在2D),其中每类被分配一种颜色,一个prediction在2D的信心在任何地方地图成正比与类prediction关联的颜色的饱和度。排序下面的图片说明了什么我要为一个二进制(二类问题),其中红色部分可能会建议强烈的信心在1级,而蓝色部分会说话类(2)中间色建议不确定性任。很显然,我想要的颜色方案推广到多个类,所以我需要很多的颜色,然后将规模从白色(不确定性)去同一个类关联到非常丰富多彩的颜色。

某些示范code:

我的样品code只使用一个简单的kNN算法,其中最近的k个数据点被允许'投票'的类的地图上的一个新的起点。的prediction的信心,简单地通过中标类别的相对频率,其中票选k的说明。我还没有处理的关系,我知道有这种方法更好的概率版本,但我想要的是想象我的数据显示观众一类在2D平面的特定部分存在的机会。

 进口numpy的为NP
进口matplotlib.pyplot为PLT


#生成三个类的一些训练数据
N =协(采样点)每类培训一套100#号。
mean1,mean2,mean3 = [-1.5,0],[1.5,0],[0,1.5]
cov1,cov2,cov3 = [[1,0],[0,1],[[1,0],[0,1],[[1,0],[0,1]]
X1 = np.asarray(np.random.multivariate_normal(mean1,cov1,N))
X2 = np.asarray(np.random.multivariate_normal(mean2,cov2,N))
X3 = np.asarray(np.random.multivariate_normal(mean3,cov3,N))


plt.plot(X1 [:,0],X [:,1],'ロ',X2 [:,0],X 2 [:,1],'博',X3 [:,0],X 3 [: ,1],走出去)

plt.axis(平等); plt.show()#Display训练数据


#prepare设定为3n的* 3阵列,其中每行是一个数据点和其相关联的类的数据
D = np.zeros((3 * N,3))
D [0:N,0:2] = X1; D [0:N,2] = 1
D [n为2 * N,0:2] = X 2; D [N:2 * N,2] = 2
D [2 * n为3 * N,0:2] = X 3; D [2 * N:3 * N,2] = 3

高清的kNN(X,D,K = 3):
    X = np.asarray(X)
    DIST = np.linalg.norm(X-D [:,0:2],轴= 1)
    I = dist.argsort()[:K] #Returnķ最小到最大项指数
    数= np.bincount(D [我,2] .astype(INT))
    predicted_class = np.argmax(计数)
    信心=浮动(np.max(计数))/ K
    返回predicted_class,信心

打印(k近邻([ -  2,0],D,20))
 

解决方案

那么,你可以计算出两个号码在2D平面上的每个点

自信(0 .. 1) 类(整数)

一种可能性是,计算自己的RGB的地图,并显示其与 imshow 。像这样的:

 进口numpy的为NP
进口matplotlib.pyplot为PLT

用N×3颜色,其中N是类的最大数量和颜色的RGB#颜色矢量
mycolors = np.array([
  [0,0,1],
  [0,1,0],
  [1,0,1],
  [1,1,0],
  [0,1,1],
  [0,0,0],
  [0,0.5,1]])

#否定的颜色
mycolors = 1  -  mycolors

#扩展区域
X0 = -2
X1 = 2
Y0 = -2
Y1 = 2

#网格在区域
的X,Y = np.meshgrid(np.linspace(X0,X1,1000),np.linspace(Y0,Y1,1000))

#计算的分类和概率
类= classify_func(X,Y)
概率= prob_func(X,Y)

#创建基本的彩色地图由类
IMG = mycolors [班]

#由概率褪色的颜色(黑零概率)
IMG * =概率[:,:,无]

#扭转负面形象回来
IMG = 1  -  IMG

#绘制
plt.imshow(IMG,程度= [X0,X1,Y0,Y1],原点='低')
plt.axis(平等)

# 保存
plt.savefig(mymap.png)
 
建兰模式科学班课程介绍

做出消极色彩的窍门是那里只是为了让数学更容易一点undestand。当然,code可以写成密集得多。

我创建了两个非常简单的功能,以模拟分类和概率:

 高清classify_func(X,Y):
    返回np.round(ABS(X + Y))。astype(INT)

高清prob_func(X,Y):
    返回1  -  2 * ABS(ABS(X + Y)-classify_func(X,Y))
 

前者给出了给定区域的整数值从0到4,而后者给出平滑地变化的概率。

结果:

如果你不喜欢的方式的颜色褪色到零的概率,你总是可以创造出一些非线性,它与概率相乘时应用。

下面的功能 classify_func prob_func 给出两个数组作为参数,第一个是在X坐标在哪里值被进行计算,并第二个Y坐标。这个效果很好,如果基础计算完全量化。与code中的问题,这是不是这种情况,因为它仅计算单值

在这种情况下,code稍有变化:

  X = np.linspace(X0,X1,1000)
Y = np.linspace(Y0,Y1,1000)
类= np.empty((LEN(Y),LEN(X)),DTYPE ='廉政')
概率= np.empty((len个(y)的,LEN(X)))
易建联,YV在历数(Y):
    对于十一,十五中枚举(X):
    班[义,喜],概率[义,喜] = k近邻((XV,YV),D)
 

另外,作为你的信心估计不0..1,他们需要进行调整:

 概率 -  = np.amin(概率)
概率/ = np.amax(概率)
 

在这样做,你的地图看起来应该是这个样子范围-4,-4..4,4(按颜色图:绿= 1,品红= 2,黄色= 3):

向量化,或不向量化 - 这是个问题

此问题会弹出不时。有很多关于矢量化的网络信息,但作为一个快速搜索并没有发现任何简短总结,我会给一些想法在这里。这是一个非常主观的问题,所以一切都只是重新presents我卑微的意见。其他人可能有不同的看法。

有三个因素需要考虑:

性能 在易读性 内存使用

通常(但并不总是)矢量使code更快,更难理解,并占用更多的内存。内存使用通常不是一个大问题,但随着大型阵列是值得思考的(几百个兆通常都是好的,千兆字节的麻烦)。

琐碎的情况下,拨出(逐元素简单的操作,简单的矩阵运算),我的做法是:

写code,而不vectorizations,并检查它的工作原理 在配置文件中的code 向量化内部循环,如果需要和可能的(一维矢量) 创建一个2D矢量,如果是简单的

例如,逐像素的图像处理操作可导致这样一种情况:我结束了一维vectorizations(每行)。然后内环(每个像素)快,外循环(对于每个行)并不重要。在code可能看起来非常简单的,如果它不尝试与所有可能的输入尺寸可用。

我这样一个糟糕的algorithmist,在更复杂的情况下,我想验证对非量化的版本我向量化code。因此,我几乎总是先优化它在所有之前创建的非矢量化code。

有时候矢量不提供任何性能优势。例如,方便的功能, numpy.vectorize 可用于矢量化几乎所有的功能,但它的文档状态:

矢量化功能主要提供了方便,而不是性能。实施本质上是一个for循环。的

(此功能可以被用来在code以上,也是如此。我选择了循环版本可读性的人不是很熟悉的 numpy的)。

矢量化提供了更多的性能只有在底层的矢量化功能更快。它们有时是,有时不是。只有分析和经验会告诉我们。此外,它不总是需要向量化的一切。你可以具有图像处理算法,它既有向量化和逐像素操作。有 numpy.vectorize 是非常有用的。

我会尝试向量化以上至少一个维度的k近邻搜索算法。没有条件code(它不会是一个表明,塞但是它复杂的事情),算法是相当直接的。内存消耗将上升,但与一维矢量也没关系。

和可能发生的是,一路上,你注意到了多维推广是不是要复杂得多。那么做,如果内存允许。

What I want:

To display the results of my simple classification algorithm (see below) as a colormap in python (the data is in 2D), where each class is assigned a color, and the confidence of a prediction anywhere on the 2D map is proportional to the saturation of the color associated with the class prediction. The image below sort of illustrates what I want for a binary (two class problem) in which the red parts might suggest strong confidence in class 1, whereas blue parts would speak for class 2. The intermediate colors would suggest uncertainty about either. Obviously I want the color scheme to generalize to multiple classes, so I would need many colors and the scale would then go from white (uncertainty) to very colorful color associated with a class.

Some Sample Code:

My sample code just uses a simple kNN algorithm where the nearest k data points are allowed to 'vote' on the class of a new point on the map. The confidence of the prediction is simply given by relative frequency of the winning class, out of the k which voted. I haven't dealt with ties and I know there are better probabilistic versions of this method, but all I want is to visualize my data to show a viewer the chances of a class being in a particular part of the 2D plane.

import numpy as np
import matplotlib.pyplot as plt


# Generate some training data from three classes
n = 100 # Number of covariates (sample points) for each class in training set. 
mean1, mean2, mean3 = [-1.5,0], [1.5, 0], [0,1.5]
cov1, cov2, cov3 = [[1,0],[0,1]], [[1,0],[0,1]], [[1,0],[0,1]]
X1 = np.asarray(np.random.multivariate_normal(mean1,cov1,n))
X2 = np.asarray(np.random.multivariate_normal(mean2,cov2,n))
X3 = np.asarray(np.random.multivariate_normal(mean3,cov3,n))


plt.plot(X1[:,0], X1[:,1], 'ro', X2[:,0], X2[:,1], 'bo', X3[:,0], X3[:,1], 'go' )

plt.axis('equal'); plt.show() #Display training data


# Prepare the data set as a 3n*3 array where each row is a data point and its associated class
D = np.zeros((3*n,3))
D[0:n,0:2] = X1; D[0:n,2] = 1
D[n:2*n,0:2] = X2; D[n:2*n,2] = 2
D[2*n:3*n,0:2] = X3; D[2*n:3*n,2] = 3

def kNN(x, D, k=3):
    x = np.asarray(x)
    dist = np.linalg.norm(x-D[:,0:2], axis=1)
    i = dist.argsort()[:k] #Return k indices of smallest to highest entries
    counts = np.bincount(D[i,2].astype(int))
    predicted_class = np.argmax(counts) 
    confidence = float(np.max(counts))/k
    return predicted_class, confidence 

print(kNN([-2,0], D, 20))

解决方案

So, you can calculate two numbers for each point in the 2D plane

confidence (0 .. 1) class (an integer)

One possibility is to calculate your own RGB map and show it with imshow. Like this:

import numpy as np
import matplotlib.pyplot as plt

# color vector with N x 3 colors, where N is the maximum number of classes and the colors are in RGB
mycolors = np.array([
  [ 0, 0, 1],
  [ 0, 1, 0],
  [ 1, 0, 1],
  [ 1, 1, 0],
  [ 0, 1, 1],
  [ 0, 0, 0],
  [ 0, .5, 1]])

# negate the colors
mycolors = 1 - mycolors 

# extents of the area
x0 = -2
x1 = 2
y0 = -2
y1 = 2

# grid over the area
X, Y = np.meshgrid(np.linspace(x0, x1, 1000), np.linspace(y0, y1, 1000))

# calculate the classification and probabilities
classes = classify_func(X, Y)
probabilities = prob_func(X, Y)

# create the basic color map by the class
img = mycolors[classes]

# fade the color by the probability (black for zero prob)
img *= probabilities[:,:,None]

# reverse the negative image back
img = 1 - img

# draw it
plt.imshow(img, extent=[x0,x1,y0,y1], origin='lower')
plt.axis('equal')

# save it
plt.savefig("mymap.png")

The trick of making negative colors is there just to make the maths a bit easier to undestand. The code can of course be written much denser.

I created two very simple functions to mimic the classification and probabilities:

def classify_func(X, Y):
    return np.round(abs(X+Y)).astype('int')

def prob_func(X,Y):
    return 1 - 2*abs(abs(X+Y)-classify_func(X,Y))

The former gives for the given area integer values from 0 to 4, and the latter gives smoothly changing probabilities.

The result:

If you do not like the way the colors fade towards zero probability, you may always create some non-linearity which is the applied when multiplying with the probabilities.

Here the functions classify_func and prob_func are given two arrays as the arguments, first one being the X coordinates where the values are to be calculated, and second one Y coordinates. This works well, if the underlying calculations are fully vectorized. With the code in the question this is not the case, as it only calculates single values.

In that case the code changes slightly:

x = np.linspace(x0, x1, 1000)
y = np.linspace(y0, y1, 1000)
classes = np.empty((len(y), len(x)), dtype='int')
probabilities = np.empty((len(y), len(x)))
for yi, yv in enumerate(y):
    for xi, xv in enumerate(x):
    classes[yi, xi], probabilities[yi, xi] = kNN((xv, yv), D)

Also as your confidence estimates are not 0..1, they need to be scaled:

probabilities -= np.amin(probabilities)
probabilities /= np.amax(probabilities)

After this is done, your map should look like this with extents -4,-4..4,4 (as per the color map: green=1, magenta=2, yellow=3):

To vectorize or not to vectorize - that is the question

This question pops up from time to time. There is a lot of information about vectorization in the web, but as a quick search did not reveal any short summaries, I'll give some thoughts here. This is quite a subjective matter, so everything just represents my humble opinions. Other people may have different opinions.

There are three factors to consider:

performance legibility memory use

Usually (but not always) vectorization makes code faster, more difficult to understand, and consume more memory. Memory use is not usually a big problem, but with large arrays it is something to think of (hundreds of megs are usually ok, gigabytes are troublesome).

Trivial cases aside (element-wise simple operations, simple matrix operations), my approach is:

write the code without vectorizations and check it works profile the code vectorize the inner loops if needed and possible (1D vectorization) create a 2D vectorization if it is simple

For example, a pixel-by-pixel image processing operation may lead to a situation where I end up with one-dimensional vectorizations (for each row). Then the inner loop (for each pixel) is fast, and the outer loop (for each row) does not really matter. The code may look much simpler if it does not try to be usable with all possible input dimensions.

I am such a lousy algorithmist that in more complex cases I like to verify my vectorized code against the non-vectorized versions. Hence I almost invariably first create the non-vectorized code before optimizing it at all.

Sometimes vectorization does not offer any performance benefit. For example, the handy function numpy.vectorize can be used to vectorize practically any function, but its documentation states:

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

(This function could have been used in the code above, as well. I chose the loop version for legibility for people not very familiar with numpy.)

Vectorization gives more performance only if the underlying vectorized functions are faster. They sometimes are, sometimes aren't. Only profiling and experience will tell. Also, it is not always necessary to vectorize everything. You may have an image processing algorithm which has both vectorized and pixel-by-pixel operations. There numpy.vectorize is very useful.

I would try to vectorize the kNN search algorithm above at least to one dimension. There is no conditional code (it wouldn't be a show-stopper but it would complicates things), and the algorithm is rather straight-forward. The memory consumption will go up, but with one-dimensional vectorization it does not matter.

And it may happen that along the way you notice that a n-dimensional generalization is not much more complicated. Then do that if memory allows.