亚马逊弹性麻preduce - 大众插入从S3到DynamoDB是令人难以置信的速度慢亚马逊、大众、速度慢、弹性

2023-09-11 08:19:13 作者：不屑是看清白痴上演

我需要执行大约1.3亿件（5+ GB的总数）初始上传到一个单一DynamoDB表。当我面对problems使用从我的应用程序的API上载他们，我决定尝试电子病历吧。

I need to perform an initial upload of roughly 130 million items (5+ Gb total) into a single DynamoDB table. After I faced problems with uploading them using the API from my application, I decided to try EMR instead.

长话短说，那很一般（为EMR）数据量的导入需要年龄，即使是最强大的集群上，消耗数百小时用很少的进步（约20分钟来处理测试的2Mb数据位，并且没T设法完成在12个小时测试700MB的文件）。

Long story short, the import of that very average (for EMR) amount of data takes ages even on the most powerful cluster, consuming hundreds of hours with very little progress (about 20 minutes to process test 2Mb data bit, and didn't manage to finish with the test 700Mb file in 12 hours).

我已经联系了亚马逊premium支持，但到目前为止，他们只告诉记者，由于某种原因，DynamoDB进口很慢。

I have already contacted Amazon Premium Support, but so far they only told that "for some reason DynamoDB import is slow".

我曾在我的互动蜂巢届以下说明：

I have tried the following instructions in my interactive hive session:

CREATE EXTERNAL TABLE test_medium (
  hash_key string,
  range_key bigint,
  field_1 string,
  field_2 string,
  field_3 string,
  field_4 bigint,
  field_5 bigint,
  field_6 string,
  field_7 bigint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION 's3://my-bucket/s3_import/'
;

CREATE EXTERNAL TABLE ddb_target (
  hash_key string,
  range_key bigint,
  field_1 bigint,
  field_2 bigint,
  field_3 bigint,
  field_4 bigint,
  field_5 bigint,
  field_6 string,
  field_7 bigint
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
  "dynamodb.table.name" = "my_ddb_table",
  "dynamodb.column.mapping" = "hash_key:hash_key,range_key:range_key,field_1:field_1,field_2:field_2,field_3:field_3,field_4:field_4,field_5:field_5,field_6:field_6,field_7:field_7"
)
;  

INSERT OVERWRITE TABLE ddb_target SELECT * FROM test_medium;

各种标志似乎并没有什么明显的效果。试过以下设置，而不是默认的：

Various flags doesn't seem to have any visible effect. Have tried the following settings instead of default ones:

SET dynamodb.throughput.write.percent = 1.0;
SET dynamodb.throughput.read.percent = 1.0;
SET dynamodb.endpoint=dynamodb.eu-west-1.amazonaws.com;
SET hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat;
SET mapred.map.tasks = 100;
SET mapred.reduce.tasks=20;
SET hive.exec.reducers.max = 100;
SET hive.exec.reducers.min = 50;

相同的命令运行HDFS，而不是DynamoDB目标在几秒钟内完成。

The same commands run for HDFS instead of DynamoDB target were completed in seconds.

这似乎是一个简单的任务，一个很基本的用例，我真的不知道我能得到什么错在这里做什么。

That seems to be a simple task, a very basic use case, and I really wonder what can I be doing wrong here.

亚马逊弹性麻preduce - 大众插入从S3到DynamoDB是令人难以置信的速度慢亚马逊、大众、速度慢、弹性

推荐答案