AI teams need to move fast. Engineers need the freedom to quickly spin up GPU clusters to run experiments without too much red tape. Governance layers like approval queues, manual reviews, and rigid allowlists add friction that can slow down iteration speed and productivity. But unconstrained self-service creates its own set of problems like runaway costs, security risks, and compliance failures.
SkyPilot Admin Policies enable you to enforce the right constraints automatically, without requiring engineers to know or care about them. Set Admin Policies at the API server and/or client level and they will be applied to all SkyPilot tasks.
Admin policies are packaged as Python modules that can be configured using the admin_policy field in your SkyPilot config:
admin_policy: example_policy.EnforceAutostopPolicy
This article will take a closer look at what Admin Policies are, what you can use them for, and how you can create your own.
Run safe workloads faster
An Admin Policy is a Python class that intercepts every SkyPilot request (e.g. a sky launch API call) and can inspect, modify, or reject it before it executes.
Admin policies are a great way to manage governance concerns like cost control, security and compliance, and to enhance the developer experience. The SkyPilot repository contains example policies for things like:
- Enforce autostop on all clusters –
EnforceAutostopPolicy - Enforce spot instances for all GPU tasks –
UseSpotForGpuPolicy - Disable public IPs on AWS –
DisablePublicIpPolicy - Auto-label Kubernetes workloads –
AddLabelsPolicy - Reject outdated clients –
RejectOldClientsPolicy
All of these are testable code that can be chained together. Find all of the ready-made Admin Policies in the SkyPilot repo.
How do Admin Policies work?
Admin Policies implement a single interface method:
def validate_and_mutate(cls, user_request: sky.UserRequest) -> sky.MutatedUserRequest:
The method receives a UserRequest object containing the full task definition and the active SkyPilot configuration, and returns a MutatedUserRequest with whatever changes (or none) the policy applies. To reject a request outright, the method raises an exception.
This design has a few important properties:
- Policies are code that can be unit-tested and can express arbitrary logic.
- Policies are modular and can be chained sequentially.
- Mutations are transparent to the user as the policy enforces requirements (e.g. spot instances, injected labels, or verified quotas) automatically behind the scenes.
Step-by-Step Walkthrough
The EnforceAutostopPolicy is a good example of a policy that enforces a platform rule (i.e. autostop must be set on every cluster) while handling the various edge cases that arise in a real deployment. You can find the full policy in the SkyPilot example policy script.
Let’s take a close look at how it’s built. This will help you understand how to build your own.
What It Does
When a cluster is launched or a job is executed against a cluster, the policy checks whether autostop is required for that cluster. If the cluster doesn’t exist yet, is currently stopped, or is running without autostop configured, the policy requires the incoming request to set an autostop value. If the request doesn’t include one, it is rejected.
Step 1: Filter by Request Type
Not every request needs autostop enforcement. The policy first checks whether the incoming request is a cluster launch, and passes all other request types through unchanged:
if user_request.request_name not in [
sky.AdminPolicyRequestName.CLUSTER_LAUNCH,
sky.AdminPolicyRequestName.CLUSTER_EXEC,
]:
return sky.MutatedUserRequest(
task=user_request.task,
skypilot_config=user_request.skypilot_config)
Step 2: Look Up the Current Cluster State
To know whether autostop is required, the policy needs to know the current state of the cluster. It queries the API using the cluster name from request_options:
cluster_name = request_options.cluster_name
cluster_records = sky.get(
sky.status([cluster_name], refresh=common.StatusRefreshMode.AUTO, all_users=True)
)
Step 3: Determine Whether Autostop Is Required
The policy applies the requirement in three cases:
- the cluster does not yet exist,
- the cluster is stopped, or
- the cluster is running but has no autostop set (
autostop < 0):
# Check if the user request should specify autostop settings.
need_autostop = False
if not cluster_records:
# Cluster does not exist
need_autostop = True
elif cluster_records[0]['status'] == sky.ClusterStatus.STOPPED:
# Cluster is stopped
need_autostop = True
elif cluster_records[0]['autostop'] < 0:
# Cluster is running but autostop is not set
need_autostop = True
Step 4: Check Whether the Request Satisfies the Requirement
The policy reads the autostop setting from request_options:
# Check if the user request is setting autostop settings.
is_setting_autostop = False
idle_minutes_to_autostop = request_options.idle_minutes_to_autostop
is_setting_autostop = (idle_minutes_to_autostop is not None and
idle_minutes_to_autostop >= 0)
Step 5: Reject or Pass Through
If autostop is required but not provided, the policy raises an exception. Otherwise, it returns the request unchanged:
# If the cluster requires autostop but the user request is not setting
# autostop settings, raise an error.
if need_autostop and not is_setting_autostop:
raise RuntimeError('Autostop/down must be set for all clusters.')
return sky.MutatedUserRequest(
task=user_request.task,
skypilot_config=user_request.skypilot_config)
Note that the sky.status() call adds a few seconds of latency to each affected request. This is a real trade-off to account for when designing policies that require live cluster state.
Build Your Own Admin Policy
Admin policies are implemented by extending the sky.AdminPolicy interface:
Your custom admin policy should look like this:
import sky
class MyPolicy(sky.AdminPolicy):
@classmethod
def validate_and_mutate(cls, user_request: sky.UserRequest) -> sky.MutatedUserRequest:
# Logic for validate and modify user requests.
...
return sky.MutatedUserRequest(user_request.task,
user_request.skypilot_config)
The UserRequest object exposes everything you need to build your policies:
task: the full task definitionskypilot_config: the active SkyPilot config for this requestrequest_options: cluster name, autostop settings, dry-run flaguser: the requesting user (server-side only)request_name: type of requestclient_api_version: client API version integer (server-side only)at_client_side: whether the policy running at the client side
In addition to local Python modules, SkyPilot also supports RESTful admin policies, allowing you to host your governance logic as a remote service. Learn more about hosting an admin policy as a server.
Check the docs for precise definitions of the UserRequest and MutatedUserRequest objects.
Share your custom Admin Policies in the SkyPilot Slack - we’d love to see what you’ve built!
