DynamoDB - Incremental Export#

这个 POC 是用来验证 DynamoDB 在 2023-09 推出的 Incremental Export 功能的.

下面这个 gen_data.py 脚本能够每一秒生成一条数据, 并且写入到 DynamoDB 中.

这个 export_data.py 脚本则分别导出了 initial load 和 incremental load, 并且读取了里面的数据.

export_data.py

# -*- coding: utf-8 -*-

import gzip
import polars as pl
from datetime import datetime, timezone, timedelta

from s3pathlib import S3Path
from aws_dynamodb_io.api import ExportJob, ExportFormatEnum, ExportTypeEnum
from simpletype.api import String, Integer
from dynamodb_json_seder.api import deserialize_df

from gen_data import (
    Measurement,
    tb_name,
    bsm,
)


bucket = f"{bsm.aws_account_alias}-{bsm.aws_region}-data"
s3dir_root = S3Path(
    f"s3://{bucket}/projects/learn_aws/dynamodb-solutions/incremental-export/"
).to_dir()
table_arn = f"arn:aws:dynamodb:{bsm.aws_region}:{bsm.aws_account_id}:table/{tb_name}"
simple_schema = {
    "id": String(),
    "time": String(),
    "value": Integer(),
}
polars_schema = {k: v.to_dynamodb_json_polars() for k, v in simple_schema.items()}


def export_initial_data():
    export_job = ExportJob.export_table_to_point_in_time(
        dynamodb_client=bsm.dynamodb_client,
        table_arn=table_arn,
        s3_bucket=s3dir_root.bucket,
        s3_prefix=s3dir_root.key,
        export_time=datetime(2024, 9, 14, 23, 5, 0).astimezone(timezone.utc),
        export_format=ExportFormatEnum.DYNAMODB_JSON.value,
    )
    print(f"export_arn = {export_job.arn}")


def export_incremental_data():
    export_job = ExportJob.export_table_to_point_in_time(
        dynamodb_client=bsm.dynamodb_client,
        table_arn=table_arn,
        s3_bucket=s3dir_root.bucket,
        s3_prefix=s3dir_root.key,
        export_format=ExportFormatEnum.DYNAMODB_JSON.value,
        export_type=ExportTypeEnum.INCREMENTAL_EXPORT.value,
        incremental_export_specification=dict(
            ExportFromTime=datetime(2024, 9, 14, 23, 5, 0).astimezone(timezone.utc),
            ExportToTime=datetime(2024, 9, 14, 23, 20, 0).astimezone(timezone.utc),
        ),
    )
    print(f"export_arn = {export_job.arn}")


def read_df_from_init_export(export_job: ExportJob):
    data_file_list = export_job.get_data_files(bsm.dynamodb_client, bsm.s3_client)
    sub_df_list = list()
    for data_file in data_file_list:
        s3path = S3Path(f"s3://{bucket}/{data_file.s3_key}")
        b = s3path.read_bytes(bsm=bsm)
        sub_df = pl.read_ndjson(
            gzip.decompress(b),
            schema={"Item": pl.Struct(polars_schema)},
        )
        sub_df_list.append(sub_df)
    df = pl.concat(sub_df_list)
    df = deserialize_df(df, simple_schema, dynamodb_json_col="Item")
    return df


def read_df_from_incr_export(export_job: ExportJob):
    data_file_list = export_job.get_data_files(bsm.dynamodb_client, bsm.s3_client)
    sub_df_list = list()
    for data_file in data_file_list:
        s3path = S3Path(f"s3://{bucket}/{data_file.s3_key}")
        b = s3path.read_bytes(bsm=bsm)
        sub_df = pl.read_ndjson(
            gzip.decompress(b),
            schema={"NewImage": pl.Struct(polars_schema)},
        )
        sub_df_list.append(sub_df)
    df = pl.concat(sub_df_list)
    df = deserialize_df(df, simple_schema, dynamodb_json_col="NewImage")
    return df


def exam_overlap(init_arn: str, incr_arn: str):
    init_export = ExportJob.describe_export(bsm.dynamodb_client, init_arn)
    incr_export = ExportJob.describe_export(bsm.dynamodb_client, incr_arn)
    df_init= read_df_from_init_export(init_export)
    df_incr = read_df_from_incr_export(incr_export)
    df_init = df_init.sort("time")
    df_incr = df_incr.sort("time")
    print(df_init.tail(1).to_dicts())
    print(df_incr.head(1).to_dicts())



if __name__ == "__main__":
    # export_initial_data()
    init_arn = "arn:aws:dynamodb:us-east-1:878625312159:table/incremental_export_poc-measurement/export/01726369892818-7b359b68"
    # export_incremental_data()
    incr_arn = "arn:aws:dynamodb:us-east-1:878625312159:table/incremental_export_poc-measurement/export/01726370424000-3dcd3992"
    exam_overlap(init_arn, incr_arn)

重要结论

full export 的时候, export time 是结尾时间, 是 exclusive 的 (不包含 export time 本身)
incremental export 的时候, start time 是包括本身的, 而 end time 不包括.
full export 的数据是在 Item field 下的.
incremental export 的数据是在 NewImage field 下的.
incremental export 的 window 必须在 15 分钟以上.
incremental export 会在 S3 prefix 下直接创建一个 data 的目录来保存数据 (而 full export 会根据时间戳自动创建一个子文件夹). 所以建议 incremental export 的 prefix 的 folder name 包含 export period 的时间戳, 这样能比较确保不同的 incremental export 不会互相影响.
哪怕是很小的数据量, 一般 export 的时间也在 5 分钟左右.

由以上结论可以得出, 现在要想将 DynamoDB 同步到数据仓库中, 不用 fancy 的流数据处理, 就能实现不超过 20 - 30 分钟数据延迟的同步. 对于大多数应用来说这已经够了. 这个 20 分钟是建立在假设你有一个 8:00 的数据, 你只有在 8:15 的时候才能开始运行一个 15 分钟窗口的 incremental export, 而 export 本身需要 5 分钟, 数据处理需要大约 1 分钟, 所以总共的延迟是 20 分钟. 当然你也可以在 8:05 的时候就做一个 7:50 - 8:05 的 export, 然后数据处理时 filter 掉不要的数据, 这样的延迟可以做到 10 分钟左右. 但无论怎么样, 数据延迟都不会低于 export 本身需要的时间 (大约 5 分钟).

DynamoDB to Data Lake Solution

我的这套方案不需要任何 Orchestration, 只需要 Lambda Function 既可.

Initial Export Lambda:
- Description: 这个 Lambda 的任务是负责打开 PITR, 然后适时启动 Full Export Job.
- Detail: 它会不断检测目标 Table 是否打开了 PITR, 如果打开了, 就会在 15 分钟整点后启动一个 Export Job, 并且在 S3 中写一个 Tracker 文件, 表示时间已经推进到了这个 Initial Load Export Time. 这样后续的 Incremental Export Lambda 看到这个 S3 文件就知道可以开始进行 Incremental Export 了.
- Schedule: 15 分钟运行一次.
Initial Export Data Processing Lambda:
- Description: 这个 Lambda 的任务是负责处理 Full Export 的数据, 并且写入到 Data Lake 中.
- Detail: 这个 Lambda 会 5 分钟运行一次, 检查 Full Export 完成没有, 如果完成了就会读取 Full Export 的数据, 然后写入到 Data Lake 中. 这个 Lambda 会根据 Full Export 的时间戳来决定写入到 Data Lake 的目录结构.
- Schedule: 15 分钟运行一次.
Incremental Export Lambda: 这个
- Schedule: 5 分钟运行一次.
Incremental Export Data Processing Lambda.

你先开启 PITR. 然后等个 15 分钟, 然后找之前最近的一个 15 分钟的节点, 进行一次 Full Export. 例如你 7:55 打开的 PITR, 然后你 8:00 的时候进行一次把 8:00 之前的全部数据导出的 Full Export, 然后用本地运行的程序运行个一次

你需要两个定时运行的 Lambda