Run Remote Command on Ec2 Ultimate Solution#

Keywords: AWS, EC2, System, Systems, Manager, SSM, Python, Remote, Command

我们想要做什么#

出于多种原因, 往往是网络相关的原因, 很多时候代码必须要在 EC2 环境内执行. 而作为开发者, 如何像在本地电脑上执行 Python 自动化脚本一样在 EC2 环境内执行命令呢? 如果能做到这一点, 想象空间可以是无限的. 下面我们详细的拆解一下需求:

从具体执行的命令复杂度来看, 可以分为两类:

单条在 Terminal 内的命令. 例如 aws s3 ls.
以 Python 脚本形式存在的命令, 具体的命令的逻辑在 Python 脚本中被定义的. 这个脚本并不是事先准备好的, 换言之, 在执行脚本前我们要现将脚本上传到 EC2 环境内.

从对反馈的要求来看, 可以分为三类:

我只需要执行, 不需要反馈.
我需要知道执行的返回码是成功 (0) 还是失败 (非 0).
我不仅需要知道执行状态, 这个命令可能还会返回一些数据, 我还需要知道这个数据.

从命令的发起者来看,

只需要我的开发电脑能发起命令即可.
这个命令需要能被任何有权限的地方发起, 例如另一台 EC2, 一个 Lambda.

可以看出, 以上需求可以排列组合, 从而出现 2 * 3 * 2 = 12 种情况. 有没有一种解决方案能够同时满足这 12 种情况呢? 答案是肯定的, 我们将在下面的章节中详细的介绍.

探索可能的解决方案#

我们对上面的需求来一条一条的分析, 看看这些需求后面的本质.

单条在 Terminal 内的命令. 例如 aws s3 ls.

这个没什么说的, 就是一条远程命令.
以 Python 脚本形式存在的命令, 具体的命令的逻辑在 Python 脚本中被定义的. 这个脚本并不是事先准备好的, 换言之, 在执行脚本前我们要现将脚本上传到 EC2 环境内.

这就意味着我们总得有一个简单, 可重复, 安全的方法将任意脚本上传到 EC2 环境内.
我只需要执行, 不需要反馈

这个没什么说的, 简单执行即可.
我需要知道执行的返回码是成功 (0) 还是失败 (非 0)

这就需要我们能捕获错误码 (return code)
我不仅需要知道执行状态, 这个命令可能还会返回一些数据, 我还需要知道这个数据

要么这个命令本身的设计就是会把返回数据写到 stdout, 那么我们只要能捕获 stdout 即可. 要么在运行时将数据上传到一个中间媒介, 例如 S3, 然后我们再从 S3 读取数据.
只需要我的开发电脑能发起命令即可

要么我的电脑能 SSH 到 EC2 上去. 要么我的电脑有一些相关的 AWS 权限. 这里的权限主要指的是 AWS System Manager Run Command 的权限. 这是一个 AWS 托管的服务器, 可以利用 SSM Agent 在 EC2 上执行任何命令.
这个命令需要能被任何有权限的地方发起, 例如另一台 EC2, 一个 Lambda.

这个发起方只要有上面说的 AWS System Manager Run Command 权限即可. 当然开发电脑也可以有这个权限.

好了, 我们现在对解决每一条需求都有个大概的概念了, 下一步我们来将这些方案组合成一个完整的解决方案. 但在这之前, 我们先来了解一下这里的核心技术 AWS SSM Run Command.

AWS SSM Run Command#

AWS System Manager 是一个历史悠久的 AWS 服务, 主要用于批量管理 EC2 instance 虚拟机. 你可以将其理解为 AWS 版本的 Ansible. 而它的核心组件就是 System Manager Agent (SSM Agent), 本质上是一个服务端软件, 安装在 EC2 机器上, 属于系统服务的一部分. 而 AWS 内部对 EC2 的管理工作很多都是通过 SSM Agent 来进行的. 而”Run Command” 则是 SSM 的一项功能, 可以通过 SSM Agent 执行远程命令.

简单来说我们选择 SSM Run Command 作为我们解决方案的核心技术是出于以下几点考量:

SSM Run Command 是受 IAM Role 权限保护的, 非常安全且灵活, 兼容于各种 AWS 服务, 使得我们可以在任何 AWS 服务内发起 SSM Run Command.
SSM Run Command 功能免费, 且支持非常高的并发量.
SSM Run Command 可以捕获 Return Code, Stdout, Stderr, 使得我们可以满足上面的所有需求.

SSM Run Command 本身有一些限制.

通过 API 发送的 Run Command 也是有限制的, 不能超过 100KB. 如果你需要发送大量数据, 那么你需要修改你的远程命令程序, 让它接受 S3 uri 为参数, 然后到 S3 uri 去读输入数据.
Stdout 是有大小限制的, API 最多显示 24000 个字符. 如果需要捕获大量数据, 那么你需要修改你的远程命令程序, 将结果保存在 S3 上.

这里我不详细展开说 SSM Run Command 这个功能, 建议先看看一下 Run Remote Command on EC2 via SSM 这边博文, 对其有个简单的了解

最终解决方案#

对于运行单条 Terminal Command, 就直接用 SSM Run Command 即可.
对于运行复杂的 Python 脚本呢, 我们可以将在本地的 Python 脚本先上传到 S3, 然后用 Run Command 运行第一条命令 aws s3 cp s3://... /tmp/...script.py 将其下载到 EC2 上, 然后再指定 Python 解释器来执行该脚本. 如果该脚本是个命令行工具, 我们还能带上参数. 注意, 我们要确保这个 EC2 上预装了 aws cli.
如果我们需要捕获命令返回的结果, 那么我们要么自己能保证这条命令能在 Stdout 中返回一个结构化的数据 (注意, logging 可能会干扰到返回值), 例如 JSON, 要么能运行过程中的数据上传到 S3. 然后我们再从 S3 读取数据.

实际案例#

script.py 这是我们想要在 EC2 上执行的命令. 我们会在后面的脚本中将其上传到 S3, 然后在 EC2 上下载并执行.

# -*- coding: utf-8 -*-

"""
一个需要在 EC2 上运行的脚本, 它会打印一些包含特殊符号的字符串的 JSON 到 stdout.
"""

import sys
import json


def run() -> dict:
    print("start")
    print("done")
    return {
        "python": sys.executable,
        "weird_string": "\\a\nb\tc\"d'e@f#g:h/i"
    }


if __name__ == "__main__":
    print(json.dumps(run()))

ssm_remote_command_helpers.py 这是一个库, 能让我们方便的调用 run command 命令

# -*- coding: utf-8 -*-

"""
This module allow you to run remote command on EC2 instance via SSM in 'sync' mode.
The original ssm_client.send_command() is 'async' call, which means you have to
poll the status of the command execution via ssm_client.get_command_invocation().
This module hides the complexity of polling and provide a simple interface.

Requirements:

    func_args>=0.1.1,<1.0.0

.. _send_command: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ssm/client/send_command.html
.. _get_command_invocation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ssm/client/get_command_invocation.html
"""

import typing as T
import sys
import enum
import time
import itertools
import dataclasses

from func_args import resolve_kwargs, NOTHING

if T.TYPE_CHECKING:
    from mypy_boto3_ssm.client import SSMClient  # pip install "boto3_stubs[ssm]"


class Waiter:
    """
    Simple retry / polling with progressing status. Usage, it is common to check
    if a long-running job is done every X seconds and timeout in Y seconds.
    This class allow you to customize the polling interval and timeout,.

    Example:

    .. code-block:: python

        print("before waiter")

        for attempt, elapse in Waiter(
            delays=1,
            timeout=10,
            verbose=True,
        ):
            # check if should jump out of the polling loop
            if elapse >= 5:
                print("")
                break

        print("after waiter")
    """

    def __init__(
        self,
        delays: T.Union[int, float],
        timeout: T.Union[int, float],
        indent: int = 0,
        verbose: bool = True,
    ):
        self._delays = delays
        self.delays = itertools.repeat(delays)
        self.timeout = timeout
        self.tab = " " * indent
        self.verbose = verbose

    def __iter__(self):
        if self.verbose:  # pragma: no cover
            sys.stdout.write(
                f"start waiter, polling every {self._delays} seconds, "
                f"timeout in {self.timeout} seconds.\n"
            )
            sys.stdout.flush()
            sys.stdout.write(
                f"\r{self.tab}on 0 th attempt, "
                f"elapsed 0 seconds, "
                f"remain {self.timeout} seconds ..."
            )
            sys.stdout.flush()
        start = time.time()
        end = start + self.timeout
        yield 0, 0
        for attempt, delay in enumerate(self.delays, 1):
            now = time.time()
            remaining = end - now
            if remaining < 0:
                raise TimeoutError(f"timed out in {self.timeout} seconds!")
            else:
                time.sleep(min(delay, remaining))
                elapsed = int(now - start + delay)
                if self.verbose:  # pragma: no cover
                    sys.stdout.write(
                        f"\r{self.tab}on {attempt} th attempt, "
                        f"elapsed {elapsed} seconds, "
                        f"remain {self.timeout - elapsed} seconds ..."
                    )
                    sys.stdout.flush()
                yield attempt, int(elapsed)


def send_command(
    ssm_client: "SSMClient",
    instance_id: str,
    commands: T.List[str],
    comment: str = NOTHING,
    output_s3_bucket_name: str = NOTHING,
    output_s3_key_prefix: str = NOTHING,
) -> str:
    """
    A simple wrapper of ``ssm_client.send_command``, execute sequence of commands
    to one EC2 instance.

    Reference:

    - https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ssm/client/send_command.html
    """
    res = ssm_client.send_command(
        **resolve_kwargs(
            InstanceIds=[
                instance_id,
            ],
            DocumentName="AWS-RunShellScript",
            DocumentVersion="1",
            Parameters={"commands": commands},
            Comment=comment,
            OutputS3BucketName=output_s3_bucket_name,
            OutputS3KeyPrefix=output_s3_key_prefix,
        )
    )
    command_id = res["Command"]["CommandId"]
    return command_id


class CommandInvocationStatusEnum(str, enum.Enum):
    """
    Reference:

    - get_command_invocation_
    """

    Pending = "Pending"
    InProgress = "InProgress"
    Delayed = "Delayed"
    Success = "Success"
    Cancelled = "Cancelled"
    TimedOut = "TimedOut"
    Failed = "Failed"
    Cancelling = "Cancelling"


@dataclasses.dataclass
class CommandInvocation:
    """
    Represents a Command Invocation details returned from a
    get_command_invocation_ API call.
    """

    CommandId: T.Optional[str] = dataclasses.field(default=None)
    InstanceId: T.Optional[str] = dataclasses.field(default=None)
    Comment: T.Optional[str] = dataclasses.field(default=None)
    DocumentName: T.Optional[str] = dataclasses.field(default=None)
    DocumentVersion: T.Optional[str] = dataclasses.field(default=None)
    PluginName: T.Optional[str] = dataclasses.field(default=None)
    ResponseCode: T.Optional[int] = dataclasses.field(default=None)
    ExecutionStartDateTime: T.Optional[str] = dataclasses.field(default=None)
    ExecutionElapsedTime: T.Optional[str] = dataclasses.field(default=None)
    ExecutionEndDateTime: T.Optional[str] = dataclasses.field(default=None)
    Status: T.Optional[str] = dataclasses.field(default=None)
    StatusDetails: T.Optional[str] = dataclasses.field(default=None)
    StandardOutputContent: T.Optional[str] = dataclasses.field(default=None)
    StandardOutputUrl: T.Optional[str] = dataclasses.field(default=None)
    StandardErrorContent: T.Optional[str] = dataclasses.field(default=None)
    StandardErrorUrl: T.Optional[str] = dataclasses.field(default=None)
    CloudWatchOutputConfig: T.Optional[dict] = dataclasses.field(default=None)

    @classmethod
    def from_get_command_invocation_response(
        cls,
        response: dict,
    ) -> "CommandInvocation":
        """
        Reference:

        - get_command_invocation_
        """
        kwargs = {
            field.name: response.get(field.name) for field in dataclasses.fields(cls)
        }
        return cls(**kwargs)

    @classmethod
    def get(
        cls,
        ssm_client: "SSMClient",
        command_id: str,
        instance_id: str,
    ) -> "CommandInvocation":
        """
        A wrapper around get_command_invocation_ API call.

        Reference:

        - get_command_invocation_
        """
        response = ssm_client.get_command_invocation(
            CommandId=command_id,
            InstanceId=instance_id,
        )
        return cls.from_get_command_invocation_response(response)

    def to_dict(self) -> dict:
        return dataclasses.asdict(self)


def wait_until_command_succeeded(
    ssm_client: "SSMClient",
    command_id: str,
    instance_id: str,
    delays: int = 3,
    timeout: int = 60,
    verbose: bool = True,
) -> CommandInvocation:
    """
    After you call send_command_ API, you can use this function to wait until
    it succeeds. If it fails, it will raise an exception.

    Reference:

    - get_command_invocation_
    """
    for _ in Waiter(delays=delays, timeout=timeout, verbose=verbose):
        command_invocation = CommandInvocation.get(
            ssm_client=ssm_client,
            command_id=command_id,
            instance_id=instance_id,
        )
        if command_invocation.Status == CommandInvocationStatusEnum.Success.value:
            sys.stdout.write("\n")
            return command_invocation
        elif command_invocation.Status in [
            CommandInvocationStatusEnum.Cancelled.value,
            CommandInvocationStatusEnum.TimedOut.value,
            CommandInvocationStatusEnum.Failed.value,
            CommandInvocationStatusEnum.Cancelling.value,
        ]:
            raise Exception(f"Command failed, status: {command_invocation.Status}")
        else:
            pass

example.py 这是我们的最终代码, 实现了我们的解决方案.

# -*- coding: utf-8 -*-

"""
Requirements::

    pathlib_mate>=1.2.1,<2.0.0
    s3pathlib>=2.0.1,<3.0.0
    boto_session_manager>=1.5.1,<2.0.0
"""

import typing as T
import time
import json
import uuid

from pathlib_mate import Path
from s3pathlib import S3Path
from boto_session_manager import BotoSesManager
from rich import print as rprint

# 从 ssm_remote_command_helpers.py 中导入我们需要的函数
from ssm_remote_command_helpers import (
    send_command,
    wait_until_command_succeeded,
)


def run(
    bsm: BotoSesManager,
    instance_id: str,
    path_python: Path,
    code: str,
    s3_path: S3Path,
    args: T.List[str],
):
    """
    这是我们解决方案的主函数, 对 ssm_remote_command_helpers.py 中的函数进行二次封装,
    它能自动将脚本通过 S3 上传到 EC2 上执行.

    :param bsm: boto session manager 对象
    :param instance_id: EC2 instance id
    :param path_python: 位于 EC2 上的 Python 解释器路径, 你可以选择用哪个 Python 解释器来运行这个命令
    :param code: 你要在 EC2 上执行的脚本的源代码的字符串
    :param s3_path: 你要将这个源代码上传到 S3 的哪里
    :param args: 这个 Python 脚本有没有额外的参数, 如果有, 请用列表的形式列出来, 就像你
        写 subprocess.run([...]) 一样.
    """
    s3path.write_text(code)

    # 生成一个随机的路径, 用于存放代码
    path_code = f"/tmp/{uuid.uuid4().hex}.py"
    # 用 aws cli 将代码下载到本地, 并且过滤掉日志
    command1 = f"/home/ubuntu/.pyenv/shims/aws s3 cp {s3_path.uri} {path_code} 2>&1 > /dev/null"
    # 组装最终命令
    args_ = [
        f"{path_python}",
        f"{path_code}",
    ]
    args_.extend(args)
    command2 = " ".join(args_)
    print(command1)
    print(command2)
    # 用 SSM 远程执行该命令
    command_id = send_command(
        ssm_client=bsm.ssm_client,
        instance_id=instance_id,
        commands=[
            command1,
            command2,
        ],
    )
    # 然后等待命令执行完毕
    time.sleep(1)  # 一定要等待 1 秒, 不然你立刻 get 是 get 不到的
    command_invocation = wait_until_command_succeeded(
        ssm_client=bsm.ssm_client,
        command_id=command_id,
        instance_id=instance_id,
    )
    rprint(command_invocation)
    # 解析 return code 和 standard output, parse 我们脚本输出的 JSON
    print(command_invocation.ResponseCode)
    lines = command_invocation.StandardOutputContent.splitlines()
    output_data = json.loads(lines[-1])
    rprint(output_data)


if __name__ == "__main__":
    bsm = BotoSesManager(profile_name="bmt_app_dev_us_east_1")
    instance_id = "i-00f591fc972902fc5"
    path_python = Path("/home/ubuntu/.pyenv/shims/python")
    code = Path("script.py").read_text()
    s3path = S3Path(
        f"s3://{bsm.aws_account_id}-{bsm.aws_region}-data/projects/dev-exp-share/script.py"
    )
    args = []
    run(
        bsm=bsm,
        instance_id=instance_id,
        path_python=path_python,
        code=code,
        s3_path=s3path,
        args=[],
    )