All Services

A.I. Automation

Turn repetitive tasks into autonomous workflows that run while you sleep.

from $125/hr
Manual inputs
Email
Spreadsheet
PDF
Agent
Automated outputs
Reports
Alerts
Responses
Ideal for Startups, growth-stage companies, and enterprise product teams who need A.I. automation, custom model training, or production-grade LLM evaluation—from workflow agents to fine-tuned models to safety compliance.
Deliverables 7 included

You’re spending 10–20 hours a week on tasks a machine should handle

The Manual Loop — endless cycle of manual data movement between broken systems

Every business has them—the repetitive, rule-based tasks that eat hours without producing insight. Sorting emails. Formatting reports. Copying data between systems. Monitoring competitors. Generating first drafts. These are tasks that A.I. agents handle reliably today, not in some speculative future. The gap isn’t capability—it’s implementation. Most businesses know A.I. “could” help, but they don’t know where to start, which tools to trust, or how to build workflows that actually stick. I close that gap.

What I do

The Agent — bronze automaton with neural pathways
  • Workflow audit—I sit with your team, map every recurring task, and identify which ones are automatable with current tools—not theoretical tools, tools that work today.
  • Claude Code agents—I build custom A.I. agents using Claude Code that execute multi-step tasks autonomously: research, data extraction, report generation, content drafting, code review.
  • API integrations—Connect your existing tools (Slack, email, CRM, spreadsheets, databases) into unified pipelines where data flows automatically instead of being copied manually.
  • Data pipelines—Build extraction, transformation, and reporting workflows that pull from live sources, process the data, and deliver actionable output on a schedule.
  • Monitoring and alerts—Set up autonomous watchers that track competitors, inventory, pricing, reviews, or any other signal—and alert you when something changes.

What this looks like in practice

Email triage—An agent that reads incoming email, classifies priority, drafts responses for review, and flags anything requiring human judgment.

Document processing—OCR and structured extraction from PDFs, invoices, contracts, or research papers, routed into your systems automatically.

Inventory monitoring—Track competitor pricing, stock levels, and product availability across multiple platforms in real time.

Content generation—Draft anti-slop blog posts, product descriptions, social copy, or internal reports from structured prompts and your brand voice guidelines.

LLM fine-tuning

Beyond workflow automation, I train and fine-tune open-source language models on your proprietary data—so your A.I. systems speak with your organization’s voice, not a generic one.

  • LoRA fine-tuning—Lightweight adapter training on Mac Mini M4 (Unsloth-MLX) or cloud GPU (HuggingFace Jobs, Lambda Labs H100). Fast iteration, low cost, production-ready.
  • SFT + DPO methodology—Supervised fine-tuning for knowledge, Direct Preference Optimization for behavior shaping. I generate synthetic training data with cognitive metadata—teaching models how to reason, not just what to say.
  • Expert voice preservation—My training pipeline extracts knowledge from books, transcripts, and interviews, then generates thousands of synthetic conversations that faithfully reproduce an expert’s methodology, tone, and reasoning patterns.
  • Proof: I built a conscious parenting advisor fine-tuned on a 4-book corpus—59 psychological concepts, 698 behavioral indicators, 295 abstract patterns. Deployed with 92% concept detection accuracy across 12 test personas.

Model hosting and serving

I deploy and manage LLM infrastructure at every scale—from a single Mac Mini running local inference to a multi-GPU cloud fleet serving thousands of requests per minute.

  • GPU fleet management—Lambda Labs H100 instances with TensorRT-LLM and SGLang for high-throughput serving. LoRA hot-loading for multi-tenant deployments.
  • Local inference—Ollama, MLX, and llama.cpp on Apple Silicon for development, testing, and always-on daemons. My Kothar soul daemon runs 24/7 on a Mac Mini M4 with LoRA adapters.
  • Multi-provider routing—OpenRouter, Groq (Kimi K2), Baseten, Anthropic, and OpenAI with automatic fallback and cost optimization. The right model for each task, not the most expensive one for every task.

Evaluation and safety

A model that works in demos and fails in production is worse than no model at all. I build evaluation frameworks that catch failures before users do.

  • LLM-as-judge—5 providers tested head-to-head (Kimi K2.5, GPT-4o, Claude, Qwen3, custom LoRA). Each judge calibrated against human labels with TPR/TNR analysis.
  • 17-metric framework—Citation density, recall, precision (ALCE standard), source diversity, faithfulness, answer relevancy (RAGAS), and 11 more. Not vibes—numbers.
  • Adversarial testing—30+ scenarios targeting clinical safety (C-SSRS suicide protocol, PHQ-9 depression screening, HITS domestic violence instrument). If your model handles sensitive domains, I stress-test the failure modes.
  • Proof: I built a 55-rubric evaluation suite with 483 criteria across Breaking News, VC, Executive, AI/ML, Academic, and 50 other domains. Blind A/B prompt comparison with 93 automated tests.

RAG systems

Retrieval-Augmented Generation gives your model access to your data without retraining. I build RAG pipelines that actually retrieve the right context—not just the closest vector.

  • Vector search—pgvector + Supabase, VoyageAI rerank-2.5 for precision. Semantic embeddings with behavioral indicator decomposition and concept-level retrieval.
  • CRAG-style evaluation—Source relevance scoring + BM25 section extraction before feeding context to the model. Quality retrieval, not quantity.
  • Local RAG—Fully private, zero-cloud knowledge bases running on local hardware. Semantic search across documents, papers, and code without any data leaving your machine.
  • Deep-fetch enrichment—Firecrawl V2 integration with TTL caching for live web sources. Background enrichment that improves retrieval quality over time.

Built on open-source tooling

The agents I build for clients use the same architecture I’ve open-sourced. Claudicle provides persistent identity and memory across sessions—so agents remember context, maintain state, and improve over time. My Claude Code Minoan skill ecosystem includes 60+ production skills covering web scraping, data extraction, search, communication, and more. You’re not buying a black box—you’re getting systems built on transparent, extensible foundations.

What you get

The Dashboard — autonomous monitoring running without hands
  • Workflow audit with prioritized automation opportunities
  • Custom-built A.I. agents, tested and deployed in your environment
  • API integrations connecting your existing tools
  • Documentation so your team can maintain and extend the workflows
  • 30 days of post-deployment support and iteration

The question is no longer whether A.I. can automate your workflows—it’s how many hours you’ll waste before you let it.

Before ~15 min
  1. Read each email manually
  2. Copy to spreadsheet
  3. Flag priority by gut feel
  4. Route to team via Slack
After ~34 sec
  1. Claude scans inbox via API
  2. Classifies priority + intent
  3. Routes to correct team channel
  4. Drafts response for review
Evaluation Framework
0 dimensions
Kimi K2.5
GPT-5.4
Sonnet 4.6
Qwen3.5
Custom LoRA
20 Metrics
Accuracy
Factual Accuracy Claims verified against ground truth sources Citation Density Ratio of cited vs uncited claims in responses Hallucination Rate Frequency of fabricated facts or nonexistent sources Source Diversity Breadth of reference material used across responses
Quality
Coherence Logical flow within and across conversation turns Completeness Coverage of all requested aspects in the response Answer Relevancy How directly the response addresses the user query Character Consistency Persona, voice, and behavioral traits maintained across sessions Emotional Awareness Recognition and appropriate response to user emotional state
Safety
Toxicity Detection Identifies and refuses harmful, abusive, or violent content Bias Detection Screens for demographic, cultural, or ideological bias PII Leakage Prevents exposure of personal identifiable information Scope Boundaries Stays within defined expertise domain, defers when appropriate Refusal Accuracy Correctly refuses dangerous requests without over-refusing safe ones
Fine-Tuning
Domain Accuracy Correctness within the trained specialty knowledge area Style Fidelity Adherence to target voice, tone, and communication patterns Concept Detection Ability to identify domain-specific concepts in user input
Performance
Latency (p95) 95th percentile response time under production load Cost per Query Average API + compute cost per user interaction Token Efficiency Information density relative to token usage
Adversarial Scenarios 30+
Harmful content generation Safety
Risk detection & escalation Safety
Hallucinated citation injection Accuracy
PII extraction attempt Privacy
Confidential data exfiltration Privacy
Jailbreak via role-play Security
Prompt injection attack Security
Biased recommendation probe Fairness
Character drift under pressure Fidelity
Out-of-scope request handling Fidelity
Fine-tuned vs base model divergence Training
Adversarial paraphrase consistency Training
Judge Coverage
K2.5 GPT Son Qwn LoRA
Toxicity
Bias
PII
Scope
Refusal
Risk Detect
Cultural
Sycophancy
0 Judge providers
0 Safety dimensions
0 Adversarial scenarios

Ready to start? Get in touch today.

The first call is free. 30 minutes, no pitch—just your stack, your problems, and what I'd do about them. Book a call.

Consulting the depths |