我可以流文件上传到S3没有Content-Length头?文件上传、Length、Content

2023-09-11 23:37:48 作者:声音太销魂

我正在一台机器内存有限的,而且我想上传一个动态生成的(未从盘)的文件以流的方式S3。换句话说,我不知道该文件的大小,当我开始上载,但我会在年底知道这一点。通常,PUT请求具有Content-Length头,但或许有解决的办法,如使用多或分块内容类型。

I'm working on a machine with limited memory, and I'd like to upload a dynamically generated (not-from-disk) file in a streaming manner to S3. In other words, I don't know the file size when I start the upload, but I'll know it by the end. Normally a PUT request has a Content-Length header, but perhaps there is a way around this, such as using multipart or chunked content-type.

S3可以支持流媒体上传。例如,在这里看到:

S3 can support streaming uploads. For example, see here:

http://blog.odonnell.nu/posts/streaming-uploads-s3-python-and-poster/

我的问题是,我可以完成同样的事情,而无需在上传开始指定文件的长度?

My question is, can I accomplish the same thing without having to specify the file length at the start of the upload?

推荐答案

您必须通过的 S3的多部分API 。其中每一个块需要一个内容长度,但你能避免装入大量的数据(100MiB +)到内存中。

You have to upload your file in 5MiB+ chunks via S3's multipart API. Each of those chunks requires a Content-Length but you can avoid loading huge amounts of data (100MiB+) into memory.

启动S3的多部分上传的 在收集数据到缓冲区中,直到缓冲区达到S3较低的块大小的限制(5MiB)。生成MD5校验,同时建立缓冲区。 在上载缓存为的部分的,存储的ETag(读上一个文档)。 一旦你的数据达到EOF,上传的最后一个块(可以比5MiB更小)。 最终确定多部分上传。 Initiate S3 Multipart Upload. Gather data into a buffer until that buffer reaches S3's lower chunk-size limit (5MiB). Generate MD5 checksum while building up the buffer. Upload that buffer as a Part, store the ETag (read the docs on that one). Once you reach EOF of your data, upload the last chunk (which can be smaller than 5MiB). Finalize the Multipart Upload.

S3允许多达10,000份。因此,通过选择5MiB的一部分,大小,你就可以上传至50GiB动态文件。应该足以满足大多数使用情况。

S3 allows up to 10,000 parts. So by choosing a part-size of 5MiB you will be able to upload dynamic files of up to 50GiB. Should be enough for most use-cases.

不过:如果你需要更多的,你要提高你的部分的大小。通过使用一个较高的部分尺寸(10MiB为例),或在上传过程中增加了。

However: If you need more, you have to increase your part-size. Either by using a higher part-size (10MiB for example) or by increasing it during the upload.

First 25 parts:   5MiB (total:  125MiB)
Next 25 parts:   10MiB (total:  375MiB)
Next 25 parts:   25MiB (total:    1GiB)
Next 25 parts:   50MiB (total: 2.25GiB)
After that:     100MiB

这将允许你上传的文件,高达1TB(S3的限制单个文件是5TB现在)没有不必要的内存浪费。

This will allow you to upload files of up to 1TB (S3's limit for a single file is 5TB right now) without wasting memory unnecessarily.

他的问题是跟你不一样 - 他知道并使用内容长度上载前。他想改善这种状况:许多图书馆通过加载从一个文件中的所有数据到内存中处理上传。在伪code,这将是这样的:

His problem is different from yours - he knows and uses the Content-Length before the upload. He wants to improve on this situation: Many libraries handle uploads by loading all data from a file into memory. In pseudo-code that would be something like this:

data = File.read(file_name)
request = new S3::PutFileRequest()
request.setHeader('Content-Length', data.size)
request.setBody(data)
request.send()

他的解决方案经由文件系统API获取内容长度做的。然后,他从磁盘上的数据流转换请求流。在伪code:

His solution does it by getting the Content-Length via the filesystem-API. He then streams the data from disk into the request-stream. In pseudo-code:

upload = new S3::PutFileRequestStream()
upload.writeHeader('Content-Length', File.getSize(file_name))
upload.flushHeader()

input = File.open(file_name, File::READONLY_FLAG)

while (data = input.read())
  input.write(data)
end

upload.flush()
upload.close()