卡桑德拉阅读基准星火星火、基准、卡桑德拉

2023-09-11 08:53:29 作者：你青丝如乌

我做对Cassandra的读性能的基准。在测试设置步骤，我创建1/2/4的EC2实例和数据节点的群集。我用100万条（约3 GB的CSV文件）写了1台。然后，我推出的Spark应用程序中使用的火花卡桑德拉连接器读取数据到一个RDD。

I'm doing a benchmark on Cassandra's Reading performance. In the test-setup step I created a cluster with 1 / 2 / 4 ec2-instances and data nodes. I wrote 1 table with 100 million of entries (~3 GB csv-file). Then I launch a Spark application which reads the data into a RDD using the spark-cassandra-connector.

不过，我认为这样的行为应该是以下情况：多个实例卡桑德拉（在星火同一个实例量）使用，更快的读取！随着写的一切似乎是正确的（快〜2倍，如果集群2倍大）。

However, I thought the behavior should be the following: The more instances Cassandra (same instance amount on Spark) uses, the faster the reads! With the writes everything seems to be correct (~2-times faster if cluster 2-times larger).

不过：在我的基准读取总是快了1实例群集然后用2或4实例集群！

But: In my benchmark the read is always faster with a 1-instance-cluster then with a 2- or 4-instance-cluster!!!

我的测试结果：

簇大小4：写：1750秒/阅读：360秒

Cluster-size 4: Write: 1750 seconds / Read: 360 seconds

簇大小2：写：3446秒/阅读：420秒

Cluster-size 2: Write: 3446 seconds / Read: 420 seconds

簇大小1：写：7595秒/阅读：284秒

Cluster-size 1: Write: 7595 seconds / Read: 284 seconds

其他TRY - 与卡桑德拉 - 压力工具

我公司推出的卡桑德拉集群（尺寸1/2/3/4个节点）上的卡桑德拉应力的工具，具有以下结果：

I launched the "cassandra-stress" tool on the Cassandra cluster (size 1 / 2 / 3 / 4 nodes), with following results:

Clustersize    Threads     Ops/sek  Time
1              4           10146    30,1
               8           15612    30,1
              16           20037    30,2
              24           24483    30,2
             121           43403    30,5
             913           50933    31,7
2              4            8588    30,1
               8           15849    30,1
              16           24221    30,2
              24           29031    30,2
             121           59151    30,5
             913           73342    31,8
3              4            7984    30,1
               8           15263    30,1
              16           25649    30,2
              24           31110    30,2
             121           58739    30,6
             913           75867    31,8
4              4            7463    30,1
               8           14515    30,1
              16           25783    30,3
              24           31128    31,1
             121           62663    30,9
             913           80656    32,4

结果：有4个或8个线程的单节点集群是一样快或更快那么更大的集群！

Results: With 4 or 8 threads the single-node cluster is as fast or faster then the larger clusters!!!

结果图：数据集是簇大小（1/2/3/4），x轴的螺纹，和y轴该等op /秒。

Results as diagram: The data-sets are the cluster sizes (1/2/3/4), x-axis the threads, and y-axis the ops/sec.

- >在这里问：难道这些结果集群范围的结果，或者这是一个测试本地节点（等环只有一个实例的结果）???

谁能给个说法？谢谢！

推荐答案

我跑了类似的测试，每个卡桑德拉节点上运行的火花工人。

I ran a similar test with a spark worker running on each Cassandra node.

使用卡桑德拉台15万行（约1.75 GB的数据），我跑了火花的工作，从表中创建一个RDD每行作为一个字符串，然后打印行数的计数。

Using a Cassandra table with 15 million rows (about 1.75 GB of data), I ran a spark job to create an RDD from the table with each row as a string, and then printed a count of the number of rows.

下面是我得到的时间：

1 C* node, 1 spark worker - 1 min. 42 seconds
2 C* nodes, 2 spark workers - 55 seconds
4 C* nodes, 4 spark workers - 35 seconds

因此，似乎比例pretty的井与节点的数量时的火花工人共同位于与C *的节点

So it seems to scale pretty well with the number of nodes when the spark workers are co-located with the C* nodes.

如果不协同定位你的员工与卡桑德拉，你强迫所有表数据要在网络上。这将是缓慢的，也许在你的环境是一个瓶颈。如果您共同找到它们，那么你就受益于数据局部性，因为火花会从本地到每台计算机的令牌创建RDD分区。

By not co-locating your workers with Cassandra, you are forcing all the table data to go across the network. That will be slow and perhaps in your environment is a bottleneck. If you co-locate them, then you benefit from data locality since spark will create the RDD partitions from the tokens that are local to each machine.

您可能也有一些其他的瓶颈。我不熟悉的EC2和它所提供。希望它具有本地磁盘存储，而不是网络存储，因为C *不喜欢的网络存储。

You may also have some other bottleneck. I'm not familiar with EC2 and what it offers. Hopefully it has local disk storage rather than network storage since C* doesn't like network storage.

上一篇：在负载均衡的Amazon EC2节点socket.io节点、负载均衡、Amazon、socket

下一篇：服务存储在S3在EX preSS / nodejs的应用程序文件应用程序、文件、EX、nodejs

相关推荐

精彩图集

精彩推荐

图片推荐

曾精心痛何以形成生疏是什么歌，曾精心痛歌