与蜂巢,S3,EMR和恢复分区加载数据蜂巢、分区、加载、数据

2023-09-11 23:42:02 作者:ら. 心不動、則不痛

解决的:查看下面的更新#2的解决方案这个问题

SOLVED: See Update #2 below for the 'solution' to this issue.

~~~~~~~

在S3中,我存储在一个嵌套的目录结构像一些日志*。广州文件:

In s3, I have some log*.gz files stored in a nested directory structure like:

s3://($BUCKET)/y=2012/m=11/d=09/H=10/

我试图弹性的Map Reduce(EMR)加载到这些配置单元,采用了多层次的分区规范,如:

I'm attempting to load these into Hive on Elastic Map Reduce (EMR), using a multi-level partition spec like:

create external table logs (content string)
partitioned by (y string, m string, d string, h string)
location 's3://($BUCKET)';

创建表的作品。然后我尝试恢复所有的现有分区:

Creation of the table works. I then attempt to recover all of the existing partitions:

alter table logs recover partitions;

这似乎工作,它并钻取通过我的S3结构,并添加目录的所有各级:

This seems to work and it does drill down through my s3 structure and add all the various levels of directories:

hive> show partitions logs;
OK
y=2012/m=11/d=06/h=08
y=2012/m=11/d=06/h=09
y=2012/m=11/d=06/h=10
y=2012/m=11/d=06/h=11
y=2012/m=11/d=06/h=12
y=2012/m=11/d=06/h=13
y=2012/m=11/d=06/h=14
y=2012/m=11/d=06/h=15
y=2012/m=11/d=06/h=16
...

如此看来,蜂巢可以成功地看到,跨preT我的文件格式。但是,没有实际的数据不断被加载。如果我尝试做一个简单的计数或选择*,我什么也没有:

So it seems that Hive can see and interpret my file layout successfully. However, no actual data ever gets loaded. If I try to do a simple count or select *, I get nothing:

hive> select count(*) from logs;
...
OK
0

hive> select * from logs limit 10;
OK

hive> select * from logs where y = '2012' and m = '11' and d = '06' and h='16' limit 10;
OK

思考?我缺少一些额外的指令来加载以后恢复分区数据?

Thoughts? Am I missing some additional command to load data beyond recovering the partitions?

如果我手动添加一个分区有一个明确的位置,那么该作品:

If I manually add a partition with an explicit location, then that works:

alter table logs2 add partition (y='2012', m='11', d='09', h='10') location 's3://($BUCKET)/y=2012/m=11/d=09/H=10/'

我可以只写一个脚本来做到这一点,但感觉就像我失去了一些东西根本WRT恢复分区。

I can just write a script to do this, but it feels like I'm missing something fundamental w.r.t 'recover partitions'.

更新#1

由于辉煌和敏锐的观察力乔·K的评论如下,我认为区分大小写的问题可能在这里参与。

Thanks to a brilliant and keen observation by Joe K in a comment below, I think that case sensitivity issues might be involved here.

该文件肯定是有组织的,如下面的路径规范,用一个大写的H(我想这可能是一些点头ISO8601格式):

The files are definitely organized like the following path spec, with a capitalized H (I think this might be some nod to iso8601 formatting):

s3://($BUCKET)/y=2012/m=11/d=09/H=10/

创建我的外部表分区规范,做适当的资本化:

I create my external table with a partition spec that does the proper capitalization:

partitioned by (y string, m string, d string, H string)

(请注意,H)。我做了恢复的分区,这似乎递归通过目录,并适当找到分区,但不知何故(尽管所有的启发的地方,到目前为止使用H),这确实似乎蜂巢并将其保存为小写字母H

(Notice the 'H'). I do a recover partitions, which does seem to recurse through the directories and find the partitions appropriately, but somehow (despite using 'H' in all instructive places so far), it indeed seems that Hive saves it as a lower case 'h':

hive> show partitions logs;
OK
y=2012/m=11/d=06/h=08

(请注意,H)。如此看来,蜂巢能够发现的分区,但随后其存储在一个小写的形式。后来当它来寻找数据,这些路径是空的(当然),因为S3是区分大小写的。

(Note the 'h'). So it seems that Hive is able to discover the partitions, but then stores them in a lowercase form ... Later when it goes to look for data, these paths are (of course) empty because S3 is case sensitive.

我打算把我的数据转换成全部小写的目录结构,看看是否可行?

I am going to move my data into an all-lowercase directory structure and see if that works...

更新#2

事实上,我已经证实,大写H作为一个分区名称(在S3文件格式)是这里的问题。据我所知,这是发生了什么:

Indeed, I have confirmed that the capitalized 'H' as a partition name (in the s3 file layout) was the problem here. As far as I can tell, this is what was happening:

在我的S3布局有一个区分大小写的分区名称(H =) 在运行修复分区正确地发现这些分区... 不过,他们在内部存储为小写(H)

在恢复分区的命令是蜂巢的亚马逊创作的延伸。我强烈怀疑该错误是此组件。据我所知,本机配置单元具有探索的文件根分区发现...

The 'recover partitions' command is an extension of Hive authored by Amazon. I strongly suspect that the bug is in this component. To my knowledge native Hive has no concept of exploring a file root for partition discovery...

推荐答案

这是在小时现场情况的问题!

It's a case issue on the hour field!