Define success criteria

Category 2: Workflow Automation

Used for: Multi-step processes that benefit from consistent methodology, including coordination across multiple MCP servers.

Real example: skill-creator skill

"Interactive guide for creating new skills. Walks the user through use case definition, frontmatter generation, instruction writing, and validation."

Key techniques:

• Step-by-step workflow with validation gates
• Templates for common structures
• Built-in review and improvement suggestions
• Iterative refinement loops

Category 3: MCP Enhancement

Used for: Workflow guidance to enhance the tool access an MCP server provides.

Real example: sentry-code-review skill (from Sentry)

"Automatically analyzes and fixes detected bugs in GitHub Pull Requests using Sentry's error monitoring data via their MCP server."

Key techniques:

• Coordinates multiple MCP calls in sequence
• Embeds domain expertise
• Provides context users would otherwise need to specify
• Error handling for common MCP issues

How will you know your skill is working?

These are aspirational targets - rough benchmarks rather than precise thresholds. Aim for rigor but accept that there will be an element of vibes-based assessment. We are actively developing more robust measurement guidance and tooling.

Quantitative metrics:

• Skill triggers on 90% of relevant queries
- – How to measure: Run 10-20 test queries that should trigger your skill. Track how many times it loads automatically vs. requires explicit invocation.
• Completes workflow in X tool calls
- – How to measure: Compare the same task with and without the skill enabled. Count tool calls and total tokens consumed.
• 0 failed API calls per workflow
- – How to measure: Monitor MCP server logs during test runs. Track retry rates and error codes.

Qualitative metrics:

• Users don't need to prompt Claude about next steps
- – How to assess: During testing, note how often you need to redirect or clarify. Ask beta users for feedback.
• Workflows complete without user correction
- – How to assess: Run the same request 3-5 times. Compare outputs for structural consistency and quality.
• Consistent results across sessions
- – How to assess: Can a new user accomplish the task on first try with minimal guidance?