星火 - 哪些实例类型为preferred为AWS EMR集群?星火、集群、实例、类型

2023-09-11 09:16:24 作者:ゝ真心投入却狼狈退出

我正在运行电子病历的Spark集群上的一些机器学习算法。我很好奇哪一种情况下的使用,因此我可以得到最佳的成本/性能提升?

I am running some machine learning algorithms on EMR Spark cluster. I am curious about which kind of instance to use so I can get the optimal cost/performance gain?

有关价格相同的水平,我可以选择其中:

For the same level of prices, I can choose among:

          vCPU  ECU  Memory(GiB)
m3.xlarge  4     13     15     
c4.xlarge  4     16      7.5
r3.xlarge  4     13     30.5

哪种情况下应在EMR星火群集中使用?

Which kind of instance should be used in EMR Spark cluster?

推荐答案

一般而言,这取决于你的使用情况,需要等...但我可以建议最低配置考虑您分享的信息。

Generally, it depends on your use case, needs, etc... But I can suggest a minimum configuration considering the information you shared.

您似乎在努力训练的ALS分解或SVD在基质中超过2〜4个绿带数据。 所以实际上是没有太多的数据。

You seem to be trying to train an ALS factorization or SVD on matricies over 2 ~ 4 GBs of data. So actually that's not too much of data.

您会需要至少1个主站和2个节点安装和配置一个小型分布式集群。主不会是任何做任何的计算,所以它不会需要太多的资源,但我当然要处理的任务调度等。 您可以根据您的需要补充的奴隶(实例)。

You'll be needing at least 1 master and 2 nodes to setup and configure a small distributed cluster. The master won't be doing any computing whatsoever so it won't need much resources but of course I would be dealing task scheduling, etc. You can add slaves (instances) according to your need.

1 x master  : m3.xlarge  - vCPU : 4 , RAM : 15 GB and 2 x 40 GB SSDs
2 x slaves  : c3.4xlarge - vCPU : 16, RAM : 30 GB and 2 x 160GB SSDs.

C3和C4计算优化的情况下,具有高性能处理器和EC2中的最低价格/计算性能相比,R3,虽然它的建议用例分布式内存缓存和内存分析。但是,C3将做的工作给你以较低的价格。

C3 and C4 are Compute Optimized instances featuring high performance processors and with a lowest price/compute performance in EC2 compared to R3 although it's recommended use cases are distributed memory caches and in-memory analytics. But C3 will do the job for you for a lower price.

性能优化:

  

每小时的增量亚马逊电子病历费。这意味着一旦你运行一个集群,你付出了整个小时。这很重要,要记住,因为如果你付出亚马逊EMR集群的整整一个小时,提高了数据的处理时间由几分钟之内可能不值得你的时间和精力。

Amazon EMR charges on hourly increments. This means once you run a cluster, you are paying for the entire hour. That's important to remember because if you are paying for a full hour of Amazon EMR cluster, improving your data processing time by matter of minutes may not be worth your time and effort.

不要忘记,增加更多的节点来提高性能是不是花时间优化集群便宜。

Don't forget that adding more nodes to increase performance is cheaper than spending time optimizing your cluster.

参考:亚马逊EMR最佳实践 - 帕尔维兹Deyhim

reference : Amazon EMR Best Practices - Parviz Deyhim