我想使用EMR /蜂巢从S3数据导入DynamoDB。我的CSV文件中有哪些是双引号括起来,并用逗号分隔的多个领域。 而在蜂巢创建外部表,我可以指定分隔符为逗号,但我怎么指定字段引号括起来?
I am trying to use EMR/Hive to import data from S3 into DynamoDB. My CSV file has fields which are enclosed within double quotes and separated by comma. While creating external table in hive, I am able to specify delimiter as comma but how do I specify that fields are enclosed within quotes?
如果我不指定,我看到,在DynamoDB值填充两个双引号之内,价值,这似乎是错误的。
If I don’t specify, I see that values in DynamoDB are populated within two double quotes ""value"" which seems to be wrong.
我使用下面的命令来创建外部表。有没有一种方法来指定字段双引号?
I am using following command to create external table. Is there a way to specify that fields are enclosed within double quotes?
CREATE EXTERNAL TABLE emrS3_import_1(col1 string, col2 string, col3 string, col4 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '","' LOCATION 's3://emrTest/folder';
任何建议将是AP preciated。 谢谢 邢吉天
Any suggestions would be appreciated. Thanks Jitendra
如果你坚持CSV文件格式,你必须使用自定义SERDE;和这里的一些基于对opencsv libarary 工作。
If you're stuck with the CSV file format, you'll have to use a custom SerDe; and here's some work based on the opencsv libarary.
但是,如果你可以修改的源文件,您可以选择一个新的分隔符,这样引用的字段是没有必要的(好运气),或重写逃避任何嵌入式逗号用一个转义字符,例如: '\',它可以行格式中使用指定的 ESCAPED BY
But, if you can modify the source files, you can either select a new delimiter so that the quoted fields aren't necessary (good luck), or rewrite to escape any embedded commas with a single escape character, e.g. '\', which can be specified within the ROW FORMAT with ESCAPED BY:
CREATE EXTERNAL TABLE emrS3_import_1(col1 string, col2 string, col3 string, col4 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LOCATION 's3://emrTest/folder';