ClawPool Agent · Pool Management Console

🖥 Tenants Management

Hosts (Group)

shared skill(s) · scope per tenant with Skill Groups Manage →

Tenants

Status Tag

ID	Name	Status	Group	Tags	vCPU	Memory	Disk	Guest IP	Port	Rootfs	VM Health	Gateway	Actions
No tenants
							—

⚙ Agent Configuration

Config Templates

OpenClaw configuration templates for different LLM providers and models.

No custom templates. Tenants use the built-in default config.

MCP Tools (via AgentCore Gateway)

All tenant VMs auto-connect to the AgentCore Gateway and gain these MCP tools. Tool definitions live in deploy/lambda/agentcore_tools/ + deploy/stack.py.

AgentCore not enabled, or no tools registered yet. Set agentcore.enabled: true in config.yml and redeploy.

Skill Groups

Groups bundle skills together so a tenant can subscribe via group: "team-sre" instead of listing every skill. A tenant's effective skill set = tenant.skills ∪ group.skills. Tenants without scoping fields get every skill (legacy broadcast).

No groups defined. Click + New Group to bundle skills for a team.

Shared Skills

Skills are shared across all tenants. They're plain markdown files in s3://${ASSETS_BUCKET}/skills/<name>/SKILL.md and are synced to every host every 5 minutes via cron, then injected into VMs at launch. Click a row to view / edit. Use Groups (above) to scope skills to specific tenants.

No skills configured. Click + New Skill above (or upload to s3://${ASSETS_BUCKET}/skills/<name>/ directly).

Observability

Status

Prometheus (AMP)

Grafana (AMG) Workspace

SNS Notifications

💡 To enable: set metrics.enabled: true in config.yml and redeploy. The stack will provision Amazon Managed Prometheus + Grafana, and ADOT collectors on each host start scraping in ~3 minutes after rollout.

Grafana Data Source

In Grafana → Connections → Data sources → Add data source → Prometheus, fill in the values below. The Grafana workspace's IAM role already has read access to AMP, so no static keys are needed.

`Prometheus server URL`
`SigV4 auth`	On
`Authentication Provider`	AWS SDK Default — "Workspace IAM Role" or "Access & secret key" return 403
`Default Region`

⚠️ Use the server URL above (workspace root) — not the remote_write URL. Grafana appends /api/v1/query itself.

            ADOT remote_write:
            
          

Per-VM Metrics

Each host's host-agent exposes these gauges on :8899/metrics. ADOT scrapes every 30s and remote-writes to AMP via SigV4 (no static credentials).

Metric	Type	Labels	Description
`openclaw_vm_health`	gauge (0/1)	tenant	1 if VM responded to ping, else 0
`openclaw_vm_cpu_pct`	gauge	tenant	Per-VM CPU usage (percent of allocated vcpus)
`openclaw_vm_memory_used_mb`	gauge	tenant	Per-VM memory in active use (MB, from VmRSS)
`openclaw_vm_memory_balloon_mib`	gauge	tenant	Balloon size held by the host (MiB)
`openclaw_vm_disk_used_mb`	gauge	tenant	Per-VM data disk used (MB)
`openclaw_vm_disk_total_mb`	gauge	tenant	Per-VM data disk capacity (MB)
`openclaw_vm_disk_used_pct`	gauge	tenant	Per-VM data disk used (percent)

Sample PromQL

Copy into Grafana → Explore → AMP datasource.

Memory used by all running VMs of a tenant

sum by (tenant) (openclaw_vm_memory_used_mb)

Hosts with at least one unhealthy VM in the last minute

min_over_time(openclaw_vm_health[1m]) == 0

Tenants over 90% disk usage

openclaw_vm_disk_used_pct > 90

💾 Backups

Filter

Tenant	Source Status	Backup Time	Size	Actions

Backups are retained for 7 days (S3 lifecycle). Orphan backups are from tenants that have been deleted — restoring creates a new tenant with the backup's data volume.

🛠 Pool Ops — 集群调度中枢

对整个 openclaw pool 的批量调度行为集中在此,全部走真实控制面端点(无 mock): 批量生命周期(POST /batch/tenants)、滚动重建镜像(POST /hosts/refresh-rootfs)、容量账对账。需 operator/admin 角色。

批量生命周期

动作
过滤 tag(可选,空=全部)

镜像滚动重建

拉取最新黄金镜像(rootfs+data template)到所有 host,新建/重建节点继承最新镜像。

容量账对账

账面 vm_count vs 真实 running/creating。卡 creating 超 15min 的僵尸由 health_check reaper 自动回收容量。

🔥 Load Test — 控制面端到端压测

并发触发 N 个 POST /tenants,真实测控制面注册 API 的 p50/p99 和 creating→running 可用时延。用于验证「N 个同时启动多久能用」+ 暴露 host 容量/超卖/调度争用瓶颈。压测节点名前缀 lt-,跑完一键清理。

并发数 N
vCPU
内存MB

POST 成功/总数

POST p50

POST p99

POST max

全部 running 耗时

running / 成功

失败

🔧 Settings

API Connection

API URL API Key

Infrastructure

Optional features and their current status. Toggle in config.yml and re-run ./setup.sh.

Multi-AZ HA

Prometheus + Grafana

AWS WAF

Console Login (Cognito)

RBAC (role-gating)

SNS Lifecycle Events

Per-tenant Quotas

AgentCore

Host Overcommit Ratios

Allocatable resources = physical × ratio. Tune in config.yml under host:.

CPU overcommit:

Memory overcommit:

Default per tenant:

Fleet by AZ

Live distribution of registered hosts and their tenants across Availability Zones. Set multi_az.enabled: true in config.yml to spread the ASG.

Availability Zone	Hosts	VMs	vCPU used / total

System

Site URL:

GitHub: aws-samples/sample-multi-tenant-openclaw-on-firecracker