火花源主元分析中最佳特征个数的确定火花、个数、特征、源主元

2023-09-03 13:55:36 作者：樱花落心扉

通过SCI-KIT学习，我们可以根据累积方差图确定希望保留的功能数量，如下所示

from sklearn.decomposition import PCA

pca = PCA() # init pca
pca.fit(dataset) # fit the dataset into pca model

pca.explained_variance_ratio # this attribute shows how much variance is explained by each of the seven individual component

we can plot the cumulative value as below
plt.figure(figsize= (10, 8)) # size of the chart(size of the vectors)
cumulativeValue = pca.explained_variance_ratio_.cumsum() # get the cumulative sum

plt.plot(range(1,8), cumulativeValue, marker = 'o', linestyle="--")

然后接近80%是我们可以为主成分分析选择的最佳特征数。

我的问题是如何确定使用pysppark的最佳功能数量

推荐答案

我们可以通过以下explainedVariance我是如何做到的来确定这一点。

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import PCA

# used vector assembler to create the input the vector 
vectorAssembler = VectorAssembler(inputCols=['inputCol1', 'inputCol2', 'inputCol3', 'inputCol4'], outputCol='pcaInput')

df = vectorAssembler.transform(dataset) # fetch data into vector assembler
pca = PCA(k=8, inputCol="pcaInput", outputCol="features") # here I Have defined maximum number of features that I have
pcaModel = pca.fit(df) # fit the data to pca to make the model
print(pcaModel.explainedVariance) # here it will explain the variances
cumValues = pcaModel.explainedVariance.cumsum() # get the cumulative values
# plot the graph 
plt.figure(figsize=(10,8))
plt.plot(range(1,9), cumValues, marker = 'o', linestyle='--')
plt.title('variance by components')
plt.xlabel('num of components')
plt.ylabel('cumulative explained variance')

选择接近80%的参数数量

因此，在这种情况下，参数的最佳数量为2

上一篇：在.NET XML code评论NET、XML、code

下一篇：OpenCV 2.4.3主元分析类-当样本数量少于维度数量时数量、维度、样本、OpenCV

相关推荐

精彩图集

精彩推荐

图片推荐