How We're Using AI to Scale Tech Discovery
In which we become Claude Code pilled
Happy New Year!
First housekeeping note - Innovate Animal Ag is hiring again! Our open roles are here. Talent is our #1 determinant of success, so please think about if there’s anyone you know that could be a great fit for any of our roles 🙏.
Second housekeeping note - I turned on paid subscriptions because my understanding is that this is good for the algorithm. All posts here will continue to be free, but if you feel inspired to support our work, all subscription revenue goes directly to IAA, and it could help other readers find our work. In either case, thanks for reading!
Bulwark Biologics, one of Innovate Animal Ag’s flagship projects, started because someone on our team found a paper. It was over a year old, had almost no citations, and described a technology that could meaningfully reduce lameness in broiler chickens. The underlying science (electron-beam inactivated vaccines) had been demonstrated decades earlier in a different application, and the USDA had even held a patent on it that they let expire due to lack of commercial interest.
That discovery raised an uncomfortable question: how many other technologies like this are sitting in the literature right now, waiting for someone to notice them?
This question seemed difficult to answer definitively, at least until the entire internet became obsessed with Claude Code a few months ago. We, along with everyone else, started thinking about how to accelerate our work through software projects that previously might have been too difficult or costly. The result was a technology evaluation system that we’ve already used to look at 24,000 patents, and it’s surfacing promising technologies we’d never have found on our own.
Automating Our Own Judgement
When we evaluate a new technology, we ask two questions: 1) Could this have a strong business value proposition for the farmer or producer? And 2) Could this meaningfully improve animal health or welfare? These sound simple, but there’s a lot of nuance behind them, informed by years working within the industry and seeing how many different technologies and businesses are succeeding or failing.
If you asked a current AI chatbot to evaluate a technology against these criteria, it would do a decent job, maybe 80%. But the last 20% is where most of the value is. There are plenty of technologies that look unremarkable on first pass but hold real promise with deeper consideration, and plenty that look exciting but face obstacles only an industry insider would recognize.
To get at this last 20%, we need a process that can actually capture the nuance of our manual evaluation. The problem is that nuance is hard to specify upfront. It’s difficult to write down a sufficiently detailed rubric to hand to the AI. What we can do is show it examples, let it try, tell it where it's wrong, and watch it get closer. That's the core idea: an iteratively calibrated agent that progressively learns to approximate our judgment. Here's how it works:
We start by writing out our evaluation framework as explicitly as possible and give it to a naive agent as a prompt. The agent scores each technology between 0 and 1 and outputs its reasoning — what arguments it found convincing, what sources it drew on.
We then develop a set of test cases around technologies we’ve already deeply evaluated where we have strong views about how promising they actually are. For each test case, we define an acceptable range of scores for the agent to output.
We then run the agent on these same cases, blind to the expected ranges, and compare their output to the desired scores. When the agent’s score diverges from ours, we interrogate its reasoning. Occasionally, the agent weighs a factor we hadn’t considered, and we update the expected range in the test case. More often, the agent misses things, or reasons differently than we would have. For example, it might not immediately understand how poultry integrators make purchasing decisions around issues affecting contract growers, or how to weigh the differences between a technology that merely detects a problem versus one that treats it. In those cases, we critique the reasoning in natural language directly in a Claude Code session, and Claude updates the agent’s prompt for us.
We repeat this process until all of the test cases pass, meaning that the agent is able to approximate our judgment, at least for a subset of cases.
Once the agent is calibrated, we release it into the world. Our first task was to evaluate every US patent ever filed on an issue relevant to poultry health or welfare — over 24,000 in total. No human team could do a search that comprehensive, but once your judgment is measured in compute costs rather than payroll costs, that kind of depth becomes feasible. The first full run cost a few thousand dollars and took only a few hours.
Critically, every new technology the agent encounters is an opportunity for further calibration. The first patent evaluation run surfaced issues our test cases didn’t cover — for example, how should the agent weigh the existence of competition when evaluating a new technology? When the agent reaches a different conclusion than we would, that becomes a new test case, and we run the calibration process a few more times.
Indeed, the system didn’t work very well out of the box. There was often more nuance that we didn’t think of until the agent explicitly missed it. But after a few rounds of iteration, the system started to work, and it’s started to surface opportunities that we may have never found. Its favorite technology in the existing patent literature is using CoQ10, a cheap compound frequently used in human health supplements, to mitigate ascites in broilers, especially in farms at high altitudes. Multiple patents have been filed with promising data, some over 25 years old. Yet, no one has ever commercialized a product for poultry.
There may be good reasons no product exists. But if there is, it’s likely because of non-public information that the agent wouldn’t have access to, or because of considerations that aren’t immediately obvious to us. Therefore, the process always bottoms out with humans. The purpose of the system is to bubble up opportunities that are worth investing human time into. Then truly assessing the promisingness of the technology involves having conversations with the people developing the technology and the customers that are actually facing the problem on the ground. If the CoQ10 project ends up panning out, I’ll definitely write about it here.
Our AI Future
The process of iteratively calibrating an agent against pre-defined test cases is one I think applies far beyond this one context. Essentially, it’s a way of automating and approximating complex judgement – like an SOP for cognitive labor. The same architecture could be used for any domain where deep expertise exists, but infinite bandwidth to apply it doesn’t.
For us, the more immediate question is: what else can we do with our calibrated agents? Looking at papers and non-US patents are the obvious next steps, but there could be even more creative applications. One that I find particularly fun is having a “lateral thinking agent” that generates novel business concepts from random word pairings, then runs them through the same calibrated evaluator. Most of the output will be garbage, but if you generate thousands, or tens of thousands of ideas, maybe one of them will be good. Or, you could design another agent that looks at broader technological advancements in biotech or AI, generates possible applications for animal agriculture, and feeds those back through the evaluator.
We’re early in figuring out what these tools can do, and every week we find new ways to use them. We’re looking for an AI engineer to help build out more of these tools, so if this work sounds interesting to you please reach out!

