Create OSS Skill
Create Agent Skills that comply with the agentskills.io spec and follow authoring best practices.
Directory Structure
skill-name/
├── SKILL.md # Required: metadata + instructions
├── scripts/ # Optional: executable code
├── references/ # Optional: documentation
├── assets/ # Optional: templates, resources
└── evals/ # Optional: eval test cases
├── evals.json
└── files/
Place skills in skills/<skill-name>/SKILL.md (project-level) or ~/.agents/skills/<skill-name>/SKILL.md (user-level cross-client).
Frontmatter
| Field | Required | Constraints |
|---|---|---|
name | Yes | 1-64 chars. Lowercase a-z, numbers, hyphens. No leading/trailing/consecutive hyphens. Must match parent directory name. |
description | Yes | 1-1024 chars. What it does and when to use it. Include trigger keywords. |
license | No | License name or reference to bundled file. |
compatibility | No | 1-500 chars. Environment requirements. Only if needed. |
allowed-tools | No | Space-delimited pre-approved tools. |
metadata | No | Arbitrary key-value map. |
Quote YAML values containing colons: description: "Use when: the user asks".
Writing Effective Descriptions
The description carries the entire burden of triggering — agents see only name + description at startup (~50-100 tokens) and decide whether to activate from this alone.
Principles:
- Imperative phrasing: “Use this skill when…” not “This skill does…”
- Focus on user intent: Describe what the user is trying to achieve, not internal mechanics
- Be pushy: Explicitly list contexts, including cases where the user doesn’t name the domain directly
- Include keywords: Specific terms agents will match against
Good:
description: >
Analyze CSV and tabular data files — compute summary statistics,
add derived columns, generate charts, and clean messy data. Use this
skill when the user has a CSV, TSV, or Excel file and wants to
explore, transform, or visualize the data, even if they don't
explicitly mention "CSV" or "analysis."
Bad: description: Helps with PDFs.
To test and optimize descriptions systematically, see references/optimizing-descriptions.md.
Writing Effective Instructions
Start from Real Expertise
Don’t rely solely on LLM general knowledge. Ground skills in:
- Hands-on tasks: Complete a real task with an agent, note corrections and context you provided, extract the reusable pattern
- Project artifacts: Internal docs, runbooks, API specs, code review comments, incident reports, version control history
Spend Context Wisely
The SKILL.md body competes for attention with everything else in the context window.
- Add what the agent lacks, omit what it knows. Don’t explain what a PDF is. Do specify which library to use and why.
- Aim for moderate detail. Concise stepwise guidance with a working example outperforms exhaustive documentation.
- Design coherent units. Too narrow = multiple skills load for one task. Too broad = triggers on wrong tasks.
Calibrate Control
- Give freedom when multiple approaches are valid (explain why so the agent adapts)
- Be prescriptive when operations are fragile or a specific sequence must be followed
- Provide defaults, not menus: Pick one tool/approach, mention alternatives briefly
Patterns for Instructions
Gotchas sections — highest-value content. Concrete corrections to mistakes the agent will make:
## Gotchas
- The `users` table uses soft deletes. Include `WHERE deleted_at IS NULL`.
- User ID is `user_id` in DB, `uid` in auth, `accountId` in billing.
Output templates — more reliable than prose descriptions:
## Report structure
Use this template:
# [Title]
## Executive summary
## Key findings
## Recommendations
Checklists — for multi-step workflows with dependencies:
## Workflow
- [ ] Step 1: Analyze (run `scripts/analyze.py`)
- [ ] Step 2: Validate (run `scripts/validate.py`)
- [ ] Step 3: Execute (run `scripts/execute.py`)
Validation loops — instruct the agent to verify its own work:
1. Make edits
2. Run `python scripts/validate.py output/`
3. If fails: fix issues, re-validate
4. Only proceed when validation passes
Plan-validate-execute — for batch/destructive operations:
1. Generate plan → `plan.json`
2. Validate plan against source of truth
3. Only execute after validation passes
Favor Procedures over Declarations
Teach how to approach a class of problems, not what to produce for a specific instance. The approach should generalize even when individual details are specific.
Progressive Disclosure Budget
| Tier | Content | Budget |
|---|---|---|
| 1. Catalog | name + description | ~50-100 tokens |
| 2. Instructions | SKILL.md body | <5000 tokens / <500 lines |
| 3. Resources | scripts/, references/, assets/ | On demand |
Keep SKILL.md under 500 lines. Move detailed reference material to separate files. Tell the agent when to load each file: “Read references/api-errors.md if the API returns a non-200 status” is better than “see references/ for details.”
Bundled Scripts
One-off Commands
Reference existing packages directly — no scripts/ needed:
uvx ruff@0.8.0 check .(Python)npx eslint@9 --fix .(Node)go run golang.org/x/tools/cmd/goimports@v0.28.0 .(Go)
Pin versions for reproducibility. State prerequisites in compatibility.
Self-contained Scripts
Bundle in scripts/ with inline dependency declarations:
# /// script
# dependencies = ["beautifulsoup4"]
# ///
from bs4 import BeautifulSoup
# ... run with: uv run scripts/extract.py
Designing Scripts for Agents
- No interactive prompts — agents can’t respond to TTY input. Use flags/env vars/stdin.
--helpoutput — primary way agents learn the interface. Include description, flags, examples.- Helpful error messages — say what went wrong, what was expected, what to try.
- Structured output — JSON/CSV over free-form text. Data to stdout, diagnostics to stderr.
- Idempotent — “create if not exists” over “create and fail on duplicate.”
- Dry-run support —
--dry-runfor destructive operations. - Predictable output size — default to summary/limit, support
--offsetfor pagination.
When to Bundle
Compare agent execution traces across test cases. If the agent independently reinvents the same logic each run (chart building, format parsing, output validation), write a tested script once and bundle it.
Evaluating Skills
Test whether a skill produces good outputs using structured evals. See references/evaluating-skills.md for the full eval workflow.
Core loop:
- Write test cases in
evals/evals.json(prompt + expected output + optional files) - Run each test with and without the skill
- Add assertions after seeing first outputs
- Grade assertions (PASS/FAIL with evidence)
- Aggregate into
benchmark.json(pass rate, tokens, time deltas) - Human review for qualities assertions can’t capture
- Iterate: feed failures + feedback + transcripts into skill improvements
Checklist
Before publishing:
namematches parent directory exactlynameis lowercase alphanumeric + hyphens, 1-64 charsdescriptionis non-empty, describes what AND when, under 1024 charsdescriptionuses imperative phrasing with trigger keywords- YAML frontmatter is valid (colons quoted)
- Body is under 500 lines / ~5000 tokens
- Reference material split into separate files with conditional load instructions
- Scripts are self-contained, non-interactive, with
--helpand structured output compatibilityincluded only if environment-specific requirements exist- Validate with
skills-ref validate ./my-skillif available
Template
---
name: <skill-name>
description: >
<What it does — capabilities>. Use when <trigger conditions>,
even if the user doesn't explicitly mention <domain keywords>.
license: Apache-2.0
---
# <Skill Title>
## When to Use
<Clear trigger conditions.>
## Instructions
<Step-by-step guidance. Explain *why* for non-obvious steps.>
## Gotchas
<Concrete corrections to mistakes the agent will make without being told.>
## Examples
<Input/output examples demonstrating expected behavior.>
Common Mistakes
| Mistake | Fix |
|---|---|
| Vague description | Add trigger keywords, capability list, imperative phrasing |
| Description says “This skill does…” | Use “Use this skill when…” |
| Name doesn’t match directory | Rename directory or name field to match |
| Uppercase/underscores in name | Lowercase + hyphens only |
| Unquoted YAML colons | Quote: description: "Use when: the user asks" |
| Everything in one file | Split into references/, scripts/, assets/ |
| Missing “when to use” | Always include trigger conditions in description |
| Body too long | Target <500 lines, offload detail to reference files |
| Generic LLM-generated content | Ground in real expertise, project artifacts, hands-on tasks |
| Exhaustive rules over examples | Concise steps + working example outperform rule lists |
| Menus of equal options | Pick a default, mention alternatives briefly |
| Scripts with interactive prompts | Use flags/env vars/stdin, never TTY prompts |
| No validation step | Add validation loops or plan-validate-execute patterns |