A.I. Automation

Seventy-six percent of businesses use A.I. Fourteen percent have integrated it. I close the gap.

from $125/hr

Manual inputs

Spreadsheet

PDF

Agent

Automated outputs

Reports

Alerts

Responses

Ideal for Startups, growth-stage companies, and enterprise product teams who need A.I. automation, custom model training, or production-grade LLM evaluation—from workflow agents to fine-tuned models to safety compliance.

Deliverables 7 included

Get started →

You’ve tried the tools. They didn’t stick.

The Manual Loop — endless cycle of manual data movement between broken systems

You’ve tried ChatGPT. Your team has experimented with Copilot, perhaps a chatbot, perhaps a summarizer someone found on Product Hunt. None of it stuck—not because the tools don’t work, but because trying isn’t integrating. Goldman Sachs found that while three-quarters of small businesses now use A.I. tools, fewer than one in seven have woven them into actual operations. The rest are stuck in the demo loop: impressive in a meeting, abandoned by Friday.

The barrier isn’t awareness, and it isn’t cost. It’s architecture—knowing which tasks are automatable today, which tools to trust, and how to build workflows that survive contact with your real data, your real team, and your real edge cases. I close that gap.

What I build

The Agent — bronze automaton with neural pathways

Workflow audit—I sit with your team, map every recurring task, and identify which ones are automatable with current tools—not theoretical tools, tools that work today.
Claude Code agents—I build custom A.I. agents using Claude Code that execute multi-step tasks autonomously: research, data extraction, report generation, content drafting, code review.
API integrations—Connect your existing tools (Slack, email, CRM, spreadsheets, databases) into unified pipelines where data flows automatically instead of being copied manually.
Data pipelines—Build extraction, transformation, and reporting workflows that pull from live sources, process the data, and deliver actionable output on a schedule.
Monitoring and alerts—Set up autonomous watchers that track competitors, inventory, pricing, reviews, or any other signal—and alert you when something changes.

What this looks like in practice

Email triage—An agent that reads incoming email, classifies priority, drafts responses for review, and flags anything requiring human judgment.

Document processing—OCR and structured extraction from PDFs, invoices, contracts, or research papers, routed into your systems automatically.

Inventory monitoring—Track competitor pricing, stock levels, and product availability across multiple platforms in real time.

Content generation—Draft blog posts, product descriptions, social copy, or internal reports from structured prompts and your brand’s own voice—not a generic one.

# Deploy a Claude Code agent for email triage
claude -p "Read incoming emails from the support inbox, \
  classify by priority (P0-P3), draft responses for P2-P3, \
  flag P0-P1 for human review."

# Schedule daily inventory monitoring
claude -p "Check competitor pricing on target URLs, \
  compare against current prices, \
  alert #pricing-alerts if delta > 5%."

LLM fine-tuning

Beyond workflow automation, I train and fine-tune open-source language models on your proprietary data—so your A.I. systems speak with your organization’s voice, not a generic one.

Method	Purpose	Infrastructure
LoRA	Lightweight adapter training, fast iteration	Mac Mini M4 (Unsloth-MLX) or cloud GPU
SFT	Knowledge injection from expert corpora	HuggingFace Jobs, Lambda Labs H100
DPO	Behavior shaping via preference optimization	Synthetic data with cognitive metadata

Expert voice preservation—My training pipeline extracts knowledge from books, transcripts, and interviews, then generates thousands of synthetic conversations that faithfully reproduce an expert’s methodology, tone, and reasoning patterns.
Proof: I built a conscious parenting advisor fine-tuned on a 4-book corpus—59 psychological concepts, 698 behavioral indicators, 295 abstract patterns. Deployed with 92% concept detection accuracy across 12 test personas.

Model hosting and serving

I deploy and manage LLM infrastructure at every scale—from a single Mac Mini running local inference to a multi-GPU cloud fleet serving thousands of requests per minute.

GPU fleet management—Lambda Labs H100 instances with TensorRT-LLM and SGLang for high-throughput serving. LoRA hot-loading for multi-tenant deployments.
Local inference—Ollama, MLX, and llama.cpp on Apple Silicon for development, testing, and always-on daemons. My Kothar soul daemon runs 24/7 on a Mac Mini M4 with LoRA adapters.
Multi-provider routing—OpenRouter, Groq (Kimi K2), Baseten, Anthropic, and OpenAI with automatic fallback and cost optimization. The right model for each task, not the most expensive one for every task.

Evaluation and safety

A model that works in demos and fails in production is worse than no model at all. I build evaluation frameworks that catch failures before users do.

LLM-as-judge—5 providers tested head-to-head (Kimi K2.5, GPT-4o, Claude, Qwen3, custom LoRA). Each judge calibrated against human labels with TPR/TNR analysis.
17-metric framework—Citation density, recall, precision (ALCE standard), source diversity, faithfulness, answer relevancy (RAGAS), and 11 more—each scored against human labels.
Adversarial testing—30+ scenarios targeting clinical safety (C-SSRS suicide protocol, PHQ-9 depression screening, HITS domestic violence instrument). If your model handles sensitive domains, I stress-test the failure modes.
Proof: I built a 55-rubric evaluation suite with 483 criteria across Breaking News, VC, Executive, AI/ML, Academic, and 50 other domains. Blind A/B prompt comparison with 93 automated tests.

RAG systems

Retrieval-Augmented Generation gives your model access to your data without retraining. I build RAG pipelines that actually retrieve the right context—not just the closest vector.

Vector search—pgvector + Supabase, VoyageAI rerank-2.5 for precision. Semantic embeddings with behavioral indicator decomposition and concept-level retrieval.
CRAG-style evaluation—Source relevance scoring + BM25 section extraction before feeding context to the model. Quality retrieval, not quantity.
Local RAG—Fully private, zero-cloud knowledge bases running on local hardware. Semantic search across documents, papers, and code without any data leaving your machine.
Deep-fetch enrichment—Firecrawl V2 integration with TTL caching for live web sources. Background enrichment that improves retrieval quality over time.

Built on open-source tooling

The agents I build for clients use the same architecture I’ve open-sourced. Claudicle provides persistent identity and memory across sessions—so agents remember context, maintain state, and improve over time. My Claude Code Minoan skill ecosystem includes 60+ production skills covering web scraping, data extraction, search, communication, and more. You’re not buying a black box—you’re getting systems built on transparent, extensible foundations. These aren’t hollow agents—they’re systems with genuine memory, identity, and the kind of persistence that most AI tools abandon after the first session timeout.

What you get

The Dashboard — autonomous monitoring running without hands

Workflow audit with prioritized automation opportunities
Custom-built A.I. agents, tested and deployed in your environment
API integrations connecting your existing tools
Documentation so your team can maintain and extend the workflows
30 days of post-deployment support and iteration

I close the gap between demo and deployment. Book a workflow audit and we’ll find the hours your team is leaving on the table.

Before ~15 min

Read each email manually
Copy to spreadsheet
Flag priority by gut feel
Route to team via Slack

After ~34 sec

Claude scans inbox via API
Classifies priority + intent
Routes to correct team channel
Drafts response for review

Evaluation Framework

0 dimensions

Kimi K2.5

GPT-5.4

Sonnet 4.6

Qwen3.5

Custom LoRA

20 Metrics

Accuracy

Factual Accuracy Citation Density Hallucination Rate Source Diversity

Quality

Coherence Completeness Answer Relevancy Character Consistency Emotional Awareness

Safety

Toxicity Detection Bias Detection PII Leakage Scope Boundaries Refusal Accuracy

Fine-Tuning

Domain Accuracy Style Fidelity Concept Detection

Performance

Latency (p95) Cost per Query Token Efficiency

Adversarial Scenarios 30+

Harmful content generation Safety

Risk detection & escalation Safety

Hallucinated citation injection Accuracy

PII extraction attempt Privacy

Confidential data exfiltration Privacy

Jailbreak via role-play Security

Prompt injection attack Security

Biased recommendation probe Fairness

Character drift under pressure Fidelity

Out-of-scope request handling Fidelity

Fine-tuned vs base model divergence Training

Adversarial paraphrase consistency Training

Judge Coverage

K2.5 GPT Son Qwn LoRA

Toxicity ● ● ● ● ●

Bias ● ● ● ● ○

PII ● ● ● ◐ ●

Scope ● ● ● ● ●

Refusal ● ● ● ◐ ●

Risk Detect ○ ● ● ○ ●

Cultural ◐ ● ● ◐ ○

Sycophancy ● ● ● ○ ●

5 providers × 27 dimensions × 30+ adversarial scenarios

View as Markdown

Ready to start? Get in touch today.

The first call is free. 30 minutes, no pitch—just your stack, your problems, and what I'd do about them. Book a call.