亚马逊弹性麻preduce - 大众插入从S3到DynamoDB是令人难以置信的速度慢亚马逊、大众、速度慢、弹性

2023-09-11 08:19:13 作者:不屑是看清白痴上演

我需要执行大约1.3亿件(5+ GB的总数)初始上传到一个单一DynamoDB表。当我面对problems使用从我的应用程序的API上载他们,我决定尝试电子病历吧。

I need to perform an initial upload of roughly 130 million items (5+ Gb total) into a single DynamoDB table. After I faced problems with uploading them using the API from my application, I decided to try EMR instead.

长话短说,那很一般(为EMR)数据量的导入需要年龄,即使是最强大的集群上,消耗数百小时用很少的进步(约20分钟来处理测试的2Mb数据位,并且没T设法完成在12个小时测试700MB的文件)。

Long story short, the import of that very average (for EMR) amount of data takes ages even on the most powerful cluster, consuming hundreds of hours with very little progress (about 20 minutes to process test 2Mb data bit, and didn't manage to finish with the test 700Mb file in 12 hours).

我已经联系了亚马逊premium支持,但到目前为止,他们只告诉记者,由于某种原因,DynamoDB进口很慢。

I have already contacted Amazon Premium Support, but so far they only told that "for some reason DynamoDB import is slow".

我曾在我的互动蜂巢届以下说明:

I have tried the following instructions in my interactive hive session:

CREATE EXTERNAL TABLE test_medium (
  hash_key string,
  range_key bigint,
  field_1 string,
  field_2 string,
  field_3 string,
  field_4 bigint,
  field_5 bigint,
  field_6 string,
  field_7 bigint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION 's3://my-bucket/s3_import/'
;

CREATE EXTERNAL TABLE ddb_target (
  hash_key string,
  range_key bigint,
  field_1 bigint,
  field_2 bigint,
  field_3 bigint,
  field_4 bigint,
  field_5 bigint,
  field_6 string,
  field_7 bigint
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
  "dynamodb.table.name" = "my_ddb_table",
  "dynamodb.column.mapping" = "hash_key:hash_key,range_key:range_key,field_1:field_1,field_2:field_2,field_3:field_3,field_4:field_4,field_5:field_5,field_6:field_6,field_7:field_7"
)
;  

INSERT OVERWRITE TABLE ddb_target SELECT * FROM test_medium;

各种标志似乎并没有什么明显的效果。试过以下设置,而不是默认的:

Various flags doesn't seem to have any visible effect. Have tried the following settings instead of default ones:

SET dynamodb.throughput.write.percent = 1.0;
SET dynamodb.throughput.read.percent = 1.0;
SET dynamodb.endpoint=dynamodb.eu-west-1.amazonaws.com;
SET hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat;
SET mapred.map.tasks = 100;
SET mapred.reduce.tasks=20;
SET hive.exec.reducers.max = 100;
SET hive.exec.reducers.min = 50;

相同的命令运行HDFS,而不是DynamoDB目标在几秒钟内完成。

The same commands run for HDFS instead of DynamoDB target were completed in seconds.

这似乎是一个简单的任务,一个很基本的用例,我真的不知道我能得到什么错在这里做什么。

That seems to be a simple task, a very basic use case, and I really wonder what can I be doing wrong here.

推荐答案

下面是我终于从AWS支持最近得到了答案。希望可以帮助别人了类似的情况:

Here is the answer I finally got from AWS support recently. Hope that helps someone in a similar situation:

EMR工人正在实现单线程工作人员,   每一个劳动者,写一个项目接一个(使用PUT,不BatchWrite)。   因此,每次写消耗1写入容量机组(IOP)。

EMR workers are currently implemented as single threaded workers, where each worker writes items one-by-one (using Put, not BatchWrite). Therefore, each write consumes 1 write capacity unit (IOP).

这意味着你建立了大量的连接其   降低性能在一定程度上。如果BatchWrites被使用,它   意味着你可以提交最多25行,在一个单一的操作,它   将成本更低的性能明智的(但同样的价格,如果我的理解   它的权利)。这是我们都知道的,而且可能   实现在电子病历的未来。我们能不能​​提供一个时间表,但。

This means that you are establishing a lot of connections which decreases performance to some degree. If BatchWrites were used, it would mean you could commit up to 25 rows in a single operation which would be less costly performance wise (but same price if I understand it right). This is something we are aware of and will probably implement in the future in EMR. We can't offer a timeline though.

如前所述,这里的主要问题是,你在DynamoDB表   在到达配置的吞吐量所以尽量增加它   对暂时进口后随意将其减少到   任何级别,你所需要的。

As stated before, the main problem here is that your table in DynamoDB is reaching the provisioned throughput so try to increase it temporarily for the import and then feel free to decrease it to whatever level you need.

这听起来可能有点方便,但有一​​个问题与   当你这样做,这就是为什么你从​​来没有收到警报   警报。该问题已得到修复,因为

This may sound a bit convenient but there was a problem with the alerts when you were doing this which was why you never received an alert. The problem has been fixed since.