通过在DynamoDB表中的所有项目迭代迭代、项目、DynamoDB

2023-09-11 11:36:19 作者:沐北清歌寒

我想遍历我DynamoDB表中的所有项目。 (我知道这是一个低效的过程,但我这样做一次,以建立一个索引表。)

I'm trying to iterate through all items in my DynamoDB table. (I understand this is an inefficient process but am doing this one-time to build an index table.)

据我所知,DynamoDB的扫描()函数返回1MB或供应限制的较小者。为了弥补这一点,我写了一个函数,查找LastEvaluatedKey的结果,并重新查询从LastEvaluatedKey开始得到的所有结果。

I understand that DynamoDB's scan() function returns the lesser of 1MB or a supplied limit. To compensate for this, I wrote a function that looks for the "LastEvaluatedKey" result and re-queries starting from the LastEvaluatedKey to get all the results.

不幸的是,它好像每次我的函数循环时间,在整个数据库中每一个键扫描,迅速吃了我读分配单位。这是非常缓慢的。

Unfortunately, it seems like every time my function loops, every single key in the entire database is scanned, quickly eating up my allocated read units. It's extremely slow.

下面是我的code:

def search(table, scan_filter=None, range_key=None,
           attributes_to_get=None,
           limit=None):
    """ Scan a database for values and return
        a dict.
    """

    start_key = None
    num_results = 0
    total_results = []
    loop_iterations = 0
    request_limit = limit

    while num_results < limit:
        results = self.conn.layer1.scan(table_name=table,
                                  attributes_to_get=attributes_to_get,
                                  exclusive_start_key=start_key,
                                  limit=request_limit)
        num_results = num_results + len(results['Items'])
        start_key = results['LastEvaluatedKey']
        total_results = total_results + results['Items']
        loop_iterations = loop_iterations + 1
        request_limit = request_limit - results['Count']

        print "Count: " + str(results['Count'])
        print "Scanned Count: " + str(results['ScannedCount'])
        print "Last Evaluated Key: " + str(results['LastEvaluatedKey']['HashKeyElement']['S'])
        print "Capacity: " + str(results['ConsumedCapacityUnits'])
        print "Loop Iterations: " + str(loop_iterations)

    return total_results

调用函数:

db = DB()
results = db.search(table='media',limit=500,attributes_to_get=['id'])

和我的输出:

Count: 96
Scanned Count: 96
Last Evaluated Key: kBR23QJNAwYZZxF4E3N1crQuaTwjIeFfjIv8NyimI9o
Capacity: 517.5
Loop Iterations: 1
Count: 109
Scanned Count: 109
Last Evaluated Key: ATcJFKfY62NIjTYY24Z95Bd7xgeA1PLXAw3gH0KvUjY
Capacity: 516.5
Loop Iterations: 2
Count: 104
Scanned Count: 104
Last Evaluated Key: Lm3nHyW1KMXtMXNtOSpAi654DSpdwV7dnzezAxApAJg
Capacity: 516.0
Loop Iterations: 3
Count: 104
Scanned Count: 104
Last Evaluated Key: iirRBTPv9xDcqUVOAbntrmYB0PDRmn5MCDxdA6Nlpds
Capacity: 513.0
Loop Iterations: 4
Count: 100
Scanned Count: 100
Last Evaluated Key: nBUc1LHlPPELGifGuTSqPNfBxF9umymKjCCp7A7XWXY
Capacity: 516.5
Loop Iterations: 5

这是预期的行为?或者说,我是什么做错了吗?

Is this expected behavior? Or, what am I doing wrong?

推荐答案

您没有做错什么

这是密切相关的亚马逊计算能力单元的方式。首先,这是非常重要的一点是:

This is closely related to the way Amazon computes the capacity unit. First, it is extremely important to understand that:

容量机组==保留计算单元 容量的单位!=保留网络传输 capacity units == reserved computational units capacity units != reserved network transit

好吧,就算是不严格的说确切的,但比较接近,尤其是当它涉及到扫描

Well, even that is not strictly speaking exact but quite close, especially when it comes to Scan.

在一个扫描运行,有

扫描项:累积大小的最多的1MB,可能会低于该尺寸是否已达到限制 返回的项目:所有的匹配项的的在扫描项 scanned Items: cumulated size is at most 1MB, may be below that size if limit is already reached returned Items: all the matching items in the scanned Items

容量机组是一个计算单元,您的付费的作为扫描项目。嗯,其实,你付的扫描项目的累积大小。要注意的是这种大小包括所有存储和索引开销... 0.5容量/累计KB

as the capacity unit is a compute unit, you pay for the scanned Items. Well, actually, you pay for the cumulated size of the scanned items. Beware that this size includes all the storage and index overhead... 0.5 capacity / cumulated KB

扫描的大小不依赖于任何过滤器,它是一个字段选择器或一个结果的过滤器。

The scanned size does not depend on any filter, be it a field selector or a result filter.

这是你的结果,我想,你的项目需要大约10KB每一个自己的实际有效载荷大小的意见趋于确认。

From your results, I guess that your Items requires ~10KB each which your comment on their actual payload size tends to confirm.

我有一个测试表只包含很小的元素。一个扫描仅消耗 1.0 容量机组检索100个项目,因为累积规模和LT; 2KB

I have a test table which contains only very small elements. A Scan consumes only 1.0 Capacity unit to retrieve 100 Items because cumulated size < 2KB