如何处理在使用EMR /蜂巢导入从S3数据到DynamoDB引号(CSV)内的封闭领域蜂巢、引号、如何处理、领域

2023-09-11 08:56:57 作者:浪子多情

我想使用EMR /蜂巢从S3数据导入DynamoDB。我的CSV文件中有哪些是双引号括起来,并用逗号分隔的多个领域。 而在蜂巢创建外部表,我可以指定分隔符为逗号,但我怎么指定字段引号括起来?

I am trying to use EMR/Hive to import data from S3 into DynamoDB. My CSV file has fields which are enclosed within double quotes and separated by comma. While creating external table in hive, I am able to specify delimiter as comma but how do I specify that fields are enclosed within quotes?

如果我不指定,我看到,在DynamoDB值填充两个双引号之内,价值,这似乎是错误的。

If I don’t specify, I see that values in DynamoDB are populated within two double quotes ""value"" which seems to be wrong.

我使用下面的命令来创建外部表。有没有一种方法来指定字段双引号?

I am using following command to create external table. Is there a way to specify that fields are enclosed within double quotes?

CREATE EXTERNAL TABLE emrS3_import_1(col1 string, col2 string, col3 string, col4 string)  ROW FORMAT DELIMITED FIELDS TERMINATED BY '","' LOCATION 's3://emrTest/folder';

任何建议将是AP preciated。 谢谢 邢吉天

Any suggestions would be appreciated. Thanks Jitendra

推荐答案

如果你坚持CSV文件格式,你必须使用自定义SERDE;和这里的一些基于对opencsv libarary 工作。

If you're stuck with the CSV file format, you'll have to use a custom SerDe; and here's some work based on the opencsv libarary.

但是,如果你可以修改的源文件,您可以选择一个新的分隔符,这样引用的字段是没有必要的(好运气),或重写逃避任何嵌入式逗号用一个转义字符,例如: '\',它可以行格式中使用指定的 ESCAPED BY

But, if you can modify the source files, you can either select a new delimiter so that the quoted fields aren't necessary (good luck), or rewrite to escape any embedded commas with a single escape character, e.g. '\', which can be specified within the ROW FORMAT with ESCAPED BY:

CREATE EXTERNAL TABLE emrS3_import_1(col1 string, col2 string, col3 string, col4 string)  ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LOCATION 's3://emrTest/folder';