From Vibe to Verifier: How I Built an AI Evals Manager in a Weekend

On Tuesday night at Union Square Ventures, I had the chance to demo the result of my latest vibe-coding weekend hack: An evals dashboard manager for MuseKat.

Showing off the prompting and architecture behind an evals manager that I built for MuseKat this weekend.

One of the most exciting possibilities about AI, no-code development has been the realization that anybody with a little determination and patience can spin up their own tools and apps. Since January 2025, the prominence of this so-called "vibe coding" (ie: building without knowing how computer programming languages) has taken off in an extreme way, with people spinning up micro-apps in all corners of the Internet.

Websites like Replit, Bolt, Lovable, Bubble, Ohara (and many more) make it possible to instantly spin up landing pages or prototypes for just about anything. But there are limits to vibe coding – particularly for the non-classically trained engineers among us – and already I've also started to see a growing cottage industry of tools and services for non-technical builders, which includes everything from tutorials, courses, and video lessons to hands-on tutoring, consulting, and even ready-made services.

It's an exciting time because there are no rules. It's a scary time because there are no rules.

One of the things we've proven to be true is that it's easy to start net new as a vibe coder. One of the things that's been harder is to build on top of existing infrastructure.

While I've been vibe building all year, this was my first time building out bespoke internal tooling for my own company (a task that I used to reserve for engineers and product managers). In the end, it was a really fun experience that reminded me how anybody on any team in a company can quickly spin up their own internal tooling. Here's a bit more on why evals matter and how I built it.

Evaluating the Performance of Your AI Tools

My app MuseKat features a curious digital meerkat named Miko who likes to help kids parse information about the world around them through instant audio stories. But, as previously discussed, Miko doesn't always get things right. Since everything that Miko says is AI-generated, it's important for me as the builder to have a better grasp of how to control the output of each story to optimize for different storytelling conditions.

As an example, here's the readout of an audio story that Miko told my 5-year-old about a space shuttle we visited at the Intrepid Museum last weekend. While there's certainly some decent information in there, the crux of the product comes down to this key question: How good is that output?

That's where evals (or evaluations) comes in. As I'm learning, one important skill of building effective AI products is learning how to evaluate and tweak the output of the AI.

Many of the top product managers in the tech industry have been sharing how running an effective evals process is the single-most important skill of building software in the AI era. (After all, if you can't figure out how to control the output of an LLM or AI agent action, then you can't reliably provide the same experience to every user, and your product quality will suffer.)

This post from Lenny's Newsletter is an excellent tutorial on how to get started on evals. Here's a breakdown on how running AI agent evals differs from traditional software testing:

Source: **Beyond vibe checks: A PM’s complete guide to evals**, Lenny's Newsletter

As the post points out, you can run evals a lot of different ways: With humans, with AIs, or with programmable software. (My guess is that the best designed AI systems will need to include a mix of all three.)

I've noticed that, even without calling it evals, the natural feedback I receive when I share my app MuseKat with parents and teachers is feedback on the output from the AI engine. I've heard things like:

"It's not detailed enough."
"It's too detailed!"
"It told me something that wasn't true."
"It thinks I'm looking at something else."
"It told me something that wasn't kid-appropriate."
"It sounds too robotic."
"It missed the most important concepts."

These human gut checks are, in a sense, the core of the evaluation process. Without feedback like this, I'll never be able to improve the readouts and offer more structured and useful rules for the generative process of how the LLMs are manipulated. But it can be hard to integrate this unstructured data through a programmatic lens to change the system.

Building an Evals Manager Without Coding

To get started on building out an evals tool, you first need a good sense of what criteria you are evaluating. This is where gathering and aggregating user feedback becomes really handy. I've been running small-batch cohort tests with parents, students, and industry experts to help me understand what people like and don't like about the current way that Miko the Meerkat tells stories.

But user feedback often leaves you with a bunch of unstructured data and anecdotes.

So the first thing I did was use ChatGPT to help me parse through a lot of the feedback and lift up to identify my first set of categorical criteria that matter the most to me. Then, we worked together to establish a baseline rating system and feedback loop. With the help of this Huggingface AI Agents course, ChatGPT, Cursor, I quick spun up a working prototype of a spot check evals manager using Gradio as a quick front-end.

Here's a screenshot:

I built this tool to give me as the builder a more granular look at some of the key parameters in the AI's output that matter to me. You can see we honed in on things like the accuracy of the image recognition and the age appropriateness of the language, but also bespoke things to my app, like how well the narration sticks to the meerkat script, and also whether it's appropriate for young audiences.

How it works:
Type in the learner age of the kid profile that you're testing.
Upload the artifact image and the descriptor image from Miko's readout in the app (there's even an OCR text extractor built in).
Paste in Miko's response from the app.
Run the evaluation and receive an instantly scored rubric about how well that AI readout performed compared to each category.

While the one-off granular look is nice, if I'm going to build anything at scale, I'm going to need a much more automated way of evaluating the performance of my tooling. So, with Cursor's help, I moved one step up the agentic ladder of development and built a Python script that analyzes each new Miko query as it hits my database, then analyzes multiple queries in bulk (based on time parameters that I set) and sends me an email readout of some key observations and areas for improvement.

Here's a snapshot of what that summary looks like with a short sample set of 5 queries analyzed:

As you can see, it's batching my queries, assigning each a score (out of 100) and then also helping me identify some areas for improvement (in this case, three of the five did not pull out at least 2 facts from the extracted label text for that image).

This is a deeply imperfect system right now, but I was pretty proud of how quickly I was able to get from the black box of AI-generated data to creating a tool that invites me to toggle with the parameters in real time. That I was able to spin this up in about three days (with a working prototype of the frontend after just three hours) is really a testament to how quickly the speed of no-code AI building is evolving.

If you're working on a better way to create reliable results from your AI outputs with evals (or have other ideas for how I can make this tool even better), I'd love to hear from you!

Bethany Crystal

Commented 2 weeks ago

From vibes to verifiers -- how I built an AI evals manager in a weekend On Tuesday night at @usv, I had the chance to demo the result of my latest vibe-coding weekend hack: An evals dashboard manager for MuseKat. Here's how I did it (and what I learned along the way) https://hardmodefirst.xyz/from-vibe-to-verifier-how-i-built-an-ai-evals-manager-in-a-weekend

osama

Commented 2 weeks ago

this is rad

Commented 2 weeks ago

This is awesome, Bethany. Great tools don’t start as ideas—they emerge from real need. Yours clearly did.

Bethany Crystal

Commented 2 weeks ago

As it turns out, there's nothing quite as compelling as IMMENSE FRUSTRATION to get me to resort to building software 😆

Oloruntola

Commented 2 weeks ago

Sweet

From Vibe to Verifier: How I Built an AI Evals Manager in a Weekend

Hard Mode First

From Vibe to Verifier: How I Built an AI Evals Manager in a Weekend

From Vibe to Verifier: How I Built an AI Evals Manager in a Weekend

How I built an evals dashboard manager for my AI-powered product with no-code tools in a single weekend

Evaluating the Performance of Your AI Tools

Building an Evals Manager Without Coding