从HDFS到S3 Hadoop的复制HDFS、Hadoop

2023-09-11 08:24:32 作者：壹陣感情風

我已经成功地完成了象夫矢量化工作在Amazon EMR（使用的 Mahout的弹性麻preduce 作为参考）。现在，我想从HDFS结果复制到S3（使用它在未来的集群）。

 对于我使用Hadoop的DistCp使用：

书房@ AWS：〜$弹性-MA preduce --jar S3：//elasticma$p$pduce/samples/distcp/distcp.jar \
＆GT; --arg HDFS：//my.bucket/prj1/seqfiles \
＆GT; --arg S3N：// ACCESS_KEY：SECRET_KEY@my.bucket/prj1/seqfiles \
＆GT; -j $ JOBID

失败。研究发现，建议：使用s3distcp 试了一下还：

 弹性-MA preduce --jobflow $ JOBID \
＆GT; --jar --arg S3：//eu-west-1.elasticma$p$pduce/libs/s3distcp/1.latest/s3distcp.jar \
＆GT; --arg --s3Endpoint --args3-eu-west-1.amazonaws.com'\
＆GT; --arg --src --argHDFS：//my.bucket/prj1/seqfiles'\
＆GT; --arg --dest --arg'S3：//my.bucket/prj1/seqfiles

在这两种情况下我也有同样的错误：的java.net.UnknownHostException：未知的主机：my.bucket 以下为第二情况下完整的错误输出。

  2012年9月6日13：25：08209致命com.amazon.external.elasticma preduce.s3distcp.S3DistCp（主）：无法获取源文件系统
的java.net.UnknownHostException：未知的主机：my.bucket
    在org.apache.hadoop.ipc.Client $连接＆LT; INIT＆GT;（Client.java:214）
    在org.apache.hadoop.ipc.Client.getConnection（Client.java:1193）
    在org.apache.hadoop.ipc.Client.call（Client.java:1047）
    在org.apache.hadoop.ipc.RPC $ Invoker.invoke（RPC.java:225）
    在$ Proxy1.getProtocolVersion（来源不明）
    在org.apache.hadoop.ipc.RPC.getProxy（RPC.java:401）
    在org.apache.hadoop.ipc.RPC.getProxy（RPC.java:384）
    在org.apache.hadoop.hdfs.DFSClient.createRPCNamenode（DFSClient.java:127）
    在org.apache.hadoop.hdfs.DFSClient＆LT; INIT＆GT;（DFSClient.java:249）
    在org.apache.hadoop.hdfs.DFSClient＆LT; INIT＆GT;（DFSClient.java:214）
    在org.apache.hadoop.hdfs.DistributedFileSystem.initialize（DistributedFileSystem.java:89）
    在org.apache.hadoop.fs.FileSystem.createFileSystem（FileSystem.java:1413）
    在org.apache.hadoop.fs.FileSystem.access $ 200（FileSystem.java:68）
    在org.apache.hadoop.fs.FileSystem $ Cache.get（FileSystem.java:1431）
    在org.apache.hadoop.fs.FileSystem.get（FileSystem.java:256）
    在com.amazon.external.elasticma preduce.s3distcp.S3DistCp.run（S3DistCp.java:431）
    在com.amazon.external.elasticma preduce.s3distcp.S3DistCp.run（S3DistCp.java:216）
    在org.apache.hadoop.util.ToolRunner.run（ToolRunner.java:65）
    在org.apache.hadoop.util.ToolRunner.run（ToolRunner.java:79）
    在com.amazon.external.elasticma preduce.s3distcp.Main.main（Main.java:12）
    在sun.reflect.NativeMethodAccessorImpl.invoke0（本机方法）
    在sun.reflect.NativeMethodAccessorImpl.invoke（NativeMethodAccessorImpl.java:39）
    在sun.reflect.DelegatingMethodAccessorImpl.invoke（DelegatingMethodAccessorImpl.java:25）
    在java.lang.reflect.Method.invoke（Method.java:597）
    在org.apache.hadoop.util.RunJar.main（RunJar.java:187）

解决方案

我发现一个bug：

主要的问题不是

的java.net.UnknownHostException：未知的主机：my.bucket

不过：

  2012年9月6日13：27：33909致命com.amazon.external.elasticma preduce.s3distcp.S3DistCp（主）：无法获取源文件系统

所以。加入1更斜杠源路径后 - 作业已经开始没有任何问题。正确的命令是

 弹性-MA preduce --jobflow $ JOBID \
＆GT; --jar --arg S3：//eu-west-1.elasticma$p$pduce/libs/s3distcp/1.latest/s3distcp.jar \
＆GT; --arg --s3Endpoint --args3-eu-west-1.amazonaws.com'\
＆GT; --arg --src --argHDFS：///my.bucket/prj1/seqfiles'\
＆GT; --arg --dest --arg'S3：//my.bucket/prj1/seqfiles

P.S。所以。这是工作。工作正确完成。我已经成功地复制目录与30GB的文件。

I've successfully completed mahout vectorizing job on Amazon EMR (using Mahout on Elastic MapReduce as reference). Now I want to copy results from HDFS to S3 (to use it in future clustering).

For that I've used hadoop distcp:

den@aws:~$ elastic-mapreduce --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
> --arg hdfs://my.bucket/prj1/seqfiles \
> --arg s3n://ACCESS_KEY:SECRET_KEY@my.bucket/prj1/seqfiles \
> -j $JOBID

Failed. Found that suggestion: Use s3distcp Tried it also:

elastic-mapreduce --jobflow $JOBID \
> --jar --arg s3://eu-west-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \
> --arg --s3Endpoint --arg 's3-eu-west-1.amazonaws.com' \
> --arg --src --arg 'hdfs://my.bucket/prj1/seqfiles' \
> --arg --dest --arg 's3://my.bucket/prj1/seqfiles'

In both cases I have the same error: java.net.UnknownHostException: unknown host: my.bucket Below the full error output for the 2nd case.

2012-09-06 13:25:08,209 FATAL com.amazon.external.elasticmapreduce.s3distcp.S3DistCp (main): Failed to get source file system
java.net.UnknownHostException: unknown host: my.bucket
    at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1193)
    at org.apache.hadoop.ipc.Client.call(Client.java:1047)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
    at $Proxy1.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:401)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384)
    at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:127)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:249)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:214)
    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1413)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:68)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1431)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:256)
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:431)
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:216)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at com.amazon.external.elasticmapreduce.s3distcp.Main.main(Main.java:12)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:187)

解决方案

I've found a bug:

The main problem is not

java.net.UnknownHostException: unknown host: my.bucket

but:

2012-09-06 13:27:33,909 FATAL com.amazon.external.elasticmapreduce.s3distcp.S3DistCp (main): Failed to get source file system

So. After adding 1 more slash in source path - job was started without problems. Correct command is:

elastic-mapreduce --jobflow $JOBID \
> --jar --arg s3://eu-west-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \
> --arg --s3Endpoint --arg 's3-eu-west-1.amazonaws.com' \
> --arg --src --arg 'hdfs:///my.bucket/prj1/seqfiles' \
> --arg --dest --arg 's3://my.bucket/prj1/seqfiles'

P.S. So. it is working. Job is correctly finished. I've successfully copied dir with 30Gb file.

上一篇：在Linux上启用GD PHP中imagecreatefromstring功能和发送电子邮件发送电子邮件、功能、GD、Linux

下一篇：设置桶的名字放在域样式（bucket.s3.amazonaws.com）与Rails和回形针回形针、放在、样式、名字

相关推荐

精彩图集

精彩推荐

图片推荐