Andrej Karpathy Discusses Agentic Engineering and AI's Future

Introduction

On April 29, Andrej Karpathy, a key figure in the development of Tesla’s Autopilot and a significant player at OpenAI, spoke at an event hosted by AI Sent. He delved into the technological leaps of current AI agents and their profound impacts on software and hardware ecosystems.

Karpathy introduced the concept of “agentic engineering” to differentiate it from last year’s “vibe coding,” with the former referring to the continuation and acceleration of quality standards in professional software development.

Key Concepts

In terms of productivity, which is a primary concern for the market, Karpathy distinguished between two core concepts: “vibe coding” and “agentic engineering.”

Karpathy hinted at the existence of numerous high-value, verifiable reinforcement learning environments that remain largely unaddressed by leading labs, presenting a vast blue ocean for startups to fine-tune and monetize.

The Conversation

Host: We are honored to welcome our first special guest. He has played a pivotal role in building modern AI and is dedicated to explaining it, sometimes even renaming it. He is one of the co-founders of OpenAI, where he helped launch the company and was instrumental in making Tesla’s autonomous driving system operational. He possesses a rare talent for making complex technological changes sound straightforward and logical. Many are aware that he coined the term “vibe coding” last year. However, in recent months, he made a surprising statement: he has never felt more outdated as a programmer than he does now. Let’s start our conversation from here. Andre, thank you for being here.

Andrej Karpathy: Hello, I’m glad to be here to kick things off.

Host: Just a few months ago, you mentioned feeling more outdated as a programmer than ever. Hearing this from you is quite surprising. Can you share your feelings behind this? Is it excitement or unease?

Andrej Karpathy: It’s both. Like many, I’ve been using various agent tools over the past year, like Claude Code. It performs well with code snippets, although it sometimes makes mistakes that require manual fixes, but overall, it’s quite helpful.

Last December marked a significant turning point for me. I was on vacation with more time to reflect, and I noticed that with the latest models, the code snippets produced were correct directly, and I kept asking for more, and they remained correct. I can hardly remember the last time I corrected it. I began to trust the system more and entered a state of “vibe coding.”

That was a very distinct shift. I tried to emphasize this on Twitter (now X), as many people’s interactions with AI last year were still at the level of using ChatGPT. However, a reevaluation is necessary, especially since December, when a fundamental change occurred—particularly in the dimension of agent workflows, which became genuinely usable. Since then, I’ve dived deep into this rabbit hole, and my side project folder is filled with various oddities as I continuously use AI to write code. That’s roughly what happened in December. Since then, I’ve been observing and contemplating its impacts.

The Evolution of Software

Host: You’ve discussed the idea that “LLMs are a new type of computer”—not just better software but a new computing paradigm. Software 1.0 had explicit rules, Software 2.0 involved learned weights, and Software 3.0 is where we are now. If this framework is correct, what different practices would a team adopt when they genuinely believe in this shift?

Andrej Karpathy: Yes, indeed. In the Software 1.0 phase, I was writing code; in Software 2.0, I was programming by building datasets and training neural networks, where programming became about organizing datasets, designing objective functions, and neural network architectures.

What happens next is that when you train these GPT models or large language models on a sufficient number of tasks, they must complete all tasks in the dataset due to training on the entire internet, making them, in a sense, a programmable computer.

In the Software 3.0 phase, your “programming” shifts to “prompt engineering,” where the content in the context window acts as the lever for manipulating the interpreter—the LLM that interprets your context and executes computations in the digital information space. This is essentially the nature of this transformation.

Several examples have deepened my understanding of this, and I think they are worth sharing.

When OpenClaw was released, you would typically expect to install it using a shell script. However, to accommodate various platforms and types of computers, such shell scripts often become extremely bulky and complex. The installation method for OpenClaw, however, is to copy a segment of text to your agent, which then completes the installation. This method is far more powerful because you operate under the Software 3.0 paradigm, not needing to specify every configuration detail precisely. The agent possesses its own intelligence; it understands the instructions, observes your operating environment, takes intelligent actions to get everything running, and autonomously debugs in a loop. This is immensely powerful.

Another more extreme example comes from my experience building MenuGen. The idea behind MenuGen is that when you go to a restaurant, they hand you a menu, but it usually lacks pictures, so you have no idea what the dishes look like. I wanted to take a photo of the menu and get an approximate visual of each dish. So, I built an application using “vibe coding” that could upload a photo, process it, deploy it on Vercel, re-render the menu, list all dishes, and call an image generation model to perform OCR recognition on each dish name, generating corresponding images for users.

Later, I saw the Software 3.0 version of this, which utterly shocked me: I just needed to hand a photo to Gemini and say, “Overlay this content onto the menu with Nana Banana.” Nana Banana directly returned an image—my photo of the menu—but at the pixel level, it rendered images of each dish listed on the menu. This astonished me because my entire MenuGen was actually redundant—it operated under an old paradigm, and that application shouldn’t even exist. The Software 3.0 paradigm is much more primitive; the neural networks do more of the work, with images as input and output, requiring no application layer in between.

Thus, I believe people need to reevaluate their thinking frameworks, not limit themselves to existing paradigms, and not merely view it as an accelerated version of current things. What is truly happening is that new possibilities are now available. Returning to your question about programming, I think this issue reflects an old way of thinking—because it’s not just about programming becoming faster; it’s about the broader sense of information processing now being automatable, which is not just about code.

Past code operated on structured data; you wrote code on structured data. But, for instance, my “LLM knowledge base” project essentially allows LLMs to generate a wiki for your organization or personal use—this is not a program; it’s something that couldn’t exist before because no code could generate a knowledge base from a pile of facts. But now you can input these documents and recompile and reorder them in different ways to create new, valuable content—this is a reinterpretation of data. These are all new things that were previously impossible. So I always want to return to this question: not just what can now be done faster, but what new opportunities that were previously impossible are now available. I even find the latter more exciting.

Future Opportunities

Host: I love the evolutionary path of MenuGen you described and the contrast. I believe many people have also followed your programming journey from last October to this February. If we continue to extrapolate, comparing the historical nodes of building websites in the 90s, mobile applications in the 20s, and SaaS in the last cloud era, what are the things that today are largely unbuilt but will seem obvious in hindsight?

Andrej Karpathy: Continuing from the MenuGen example, much code shouldn’t exist; neural networks take on the bulk of the work. I genuinely feel this extrapolation curve will become very strange.

One can imagine, in a sense, a complete neural computer is possible—imagine a device that takes raw video and audio, inputs it into a system that is essentially a neural network, and renders an interface through a diffusion model, tailored to that unique moment.

In the early days of computing, people were confused about what computers would ultimately look like—would they resemble calculators or neural networks? In the 50s and 60s, this wasn’t obvious. Of course, we took the calculator path and established a classical computational system, while neural networks currently run virtually on existing computers. However, one can envision a future where this all flips—neural networks become the host process, and CPUs become co-processors. We’ve already seen that chart where the computational demands of neural networks will surpass and dominate floating-point operations.

So you can imagine a very strange, very alien future form: neural networks handling the bulk of heavy lifting, with tool calls merely as historical remnants of certain deterministic tasks. What truly dominates everything is a network of neural networks somehow interconnected. This extrapolated endpoint may be extremely strange, but I think we are likely to step by step arrive there. How we traverse this path remains to be seen.

Verifiability and Automation

Host: I want to discuss the concept of “verifiability”—AI will automate tasks in verifiable domains faster and more easily. If this framework holds, what jobs will change at an unexpected speed? What professions do people think are safe but are actually highly verifiable?

Andrej Karpathy: I’ve spent some time thinking about verifiability. Traditional computers can easily automate things that can be explicitly described in code; this round of large language models can easily automate things that can be verified. The reason is that leading labs, while training these large language models, are constructing vast reinforcement learning environments where models are rewarded based on verifiable signals. It’s precisely because of this training method that these models ultimately form a “serrated” capability map—strong in verifiable areas like mathematics and code but relatively bland and rough in areas with less verifiability.

I wrote about verifiability to understand why these models have such uneven capabilities. Part of this is due to how labs train models, but I think it also relates to the labs’ focus—what data they happen to include. Some things are more economically valuable, leading to more training environments because labs want the models to perform well in those scenarios. Code is a typical example. There may be numerous verifiable environments that could have been included in training, but due to their lower practical value, they didn’t make it into the dataset.

For me, a classic example that illustrates “serrated intelligence” used to be: “How many letter r’s are in the word strawberry?” The model was notorious for getting this wrong. The current models have corrected this issue, but new examples have emerged: I want to go to a car wash 50 meters away; should I drive or walk? The most advanced models today would tell you to walk because it’s too close. But the issue is, you’re going to a car wash.

How strange is that—the most advanced Claude Opus 4.7 can simultaneously refactor 100,000 lines of code or discover zero-day vulnerabilities yet tells me to walk to the car wash. This is truly unbelievable.

This serrated capability indicates that, first, there may be fundamental issues in some areas of the model; second, you still need to be involved, treating it as a tool while maintaining some control over its behavior. So all my writing on verifiability ultimately aims to understand why these models have serrated capabilities and whether there’s a pattern to it. I believe the answer lies in a combination of “verifiability” and “lab focus.”

Another anecdote that illustrates the point: from GPT-3.5 to GPT-4, people noticed a significant improvement in the model’s chess-playing ability. Many assumed this was just a natural evolution of capability, but the reality is—this is public information; I saw it online—a large amount of chess game data was added to the pre-training set. Just due to the change in data distribution, the model’s chess ability surged beyond normal progression. Someone at OpenAI decided to include this data, and thus this capability suddenly skyrocketed.

This is why I emphasize this dimension: we are somewhat influenced by lab decisions; what they happen to include in training is what you get. You receive something without a manual; it works well in some cases and poorly in others, and you need to explore it.

If your application happens to fall within the coverage of reinforcement learning training, you will thrive; if it falls outside the data distribution, you will struggle. You need to figure out where your application lies; if it’s not within the covered loop, you really need to consider fine-tuning and do some of your own work because expecting large language models to work out of the box is unrealistic.

Advice for Founders

Host: If you were a founder today, considering starting a business, and you found a problem you believe you could solve in a verifiable domain, but you observe that labs have already achieved escape velocity in the most obvious directions—math, code, etc.—what advice would you give to the founders here?

Andrej Karpathy: I think this ties back to the previous question. Verifiability makes something feasible under the current paradigm because you can inject a large amount of reinforcement learning into it. This can still hold true even if labs aren’t directly focusing on a particular area. If you are in a verifiable setting and can create reinforcement learning environments and data samples, this effectively opens up a path for you to fine-tune, and you might benefit from it.

This is a technically feasible path: if you have a large, diverse dataset of reinforcement learning environments, you can use your preferred fine-tuning framework, pull this lever, and achieve quite decent results. I don’t want to specify which examples, but I genuinely believe there are some highly valuable reinforcement learning environments that haven’t been included in training…

That said, I don’t want to intentionally tease on stage, but such examples do exist.

Host: Conversely, what things still seem like they could be automated but are actually far from realization?

Andrej Karpathy: I do believe that almost everything can ultimately be designed to be verifiable; some are just easier than others. Even tasks like writing could be envisioned with a set of LLM judges scoring them, likely yielding quite decent results. So it’s more about the difficulty than whether it can be done. I think, fundamentally, everything can be automated.

The Shift from Vibe Coding to Agentic Engineering

Host: Last year, you coined the term “vibe coding.” Today, we find ourselves in a more serious and rigorous engineering world. What do you think the difference is? How would you label the stage we are in now?

Andrej Karpathy: I believe vibe coding is about raising the lower limit of everyone’s capabilities in software—overall raising the floor, allowing anyone to do anything with vibe coding, which is remarkable.

“Agentic engineering,” on the other hand, is about maintaining the original quality standards of professional software on this foundation. You can’t introduce security vulnerabilities due to vibe coding; you still bear responsibility for your software as before. But can you do it faster? Spoiler: yes. But how can you achieve that?

When I refer to it as “agentic engineering,” it’s because I believe it truly is an engineering discipline. You have these agents—they are somewhat “serrated” in nature, some unreliable, some random, but extremely powerful. The question is how to coordinate them without sacrificing quality standards to speed things up. Doing this well is the domain of agentic engineering.

I see these two concepts as different: one is about raising the lower limit, while the other is about breaking through the upper limit. What I’m observing is that the upper limit of agentic engineers’ capabilities is extremely high. Previously, people talked about “10x engineers,” but I believe the amplification now far exceeds that. Ten times is not the acceleration you can achieve; from my current perspective, the output of someone truly proficient in this field far exceeds tenfold.

The Future of Programming

Host: I love this framework. Last year, Sam Altman said something memorable when he visited AI Sent: different generations use ChatGPT differently. People in their thirties see it as a replacement for Google search, while teenagers view ChatGPT as an entry point to the internet. In today’s programming landscape, what is the analogy? If we observe two people using OpenAI’s Codex or Anthropic’s Claude Code to write code—one is a typical user, and the other is a true AI-native programmer—how would you describe the differences between them?

Andrej Karpathy: I think the core lies in making the most of the available tools, utilizing all their features, and continuously investing in their workflows. Just as earlier engineers would maximize the use of VIM or VS Code, now it’s about maximizing Claude Code or Codex.

In this regard, a related thought is worth mentioning. If many teams are now hiring agentic engineers, I believe most recruitment processes haven’t adapted accordingly. If you’re still giving puzzles for candidates to solve, you’re still in the old paradigm. The new recruitment process should be: give me a big project and see if you can get it done—like building a Twitter clone, doing it well and securely, then letting agents simulate user activity on your deployed site, and if it gets breached, that’s a failure. I think that’s roughly what the future will look like—observing candidates’ performance in building large projects and integrating tools in such scenarios.

The Value of Human Skills

Host: As agents become capable of more tasks, which human skills do you think will become more valuable rather than less valuable?

Andrej Karpathy: Currently, agents are essentially at the “intern” level—they are capable but still unstable. So you still need to take responsibility for aesthetics, judgment, taste, and moderate supervision.

One of my favorite examples that illustrates the oddities of agents: in MenuGen, users register with a Google account but purchase credits with a Stripe account—each has its own email. As a result, my agent, when handling credit top-ups, attempted to match the Google email with the Stripe email because there was no persistent user ID; it tried to associate the two accounts using email. However, users can completely use different emails for Stripe and Google, making it impossible to link funds to accounts. This error is very strange—why use email for cross-system identity association? Emails can be arbitrary and different.

Such errors are precisely what agents still make: you need to take responsibility for specifications and overall planning. Speaking of “planning mode,” it’s undoubtedly useful, but I think there’s a more general principle: you need to design a very detailed specification with the agent, perhaps in document form, and then let the agent write it while you supervise and control top-level architectural decisions, with the agent handling the implementation details.

For instance, regarding tensor operations in neural networks, there are numerous details between PyTorch, NumPy, and Pandas—keepdims or keepdim, dim or axis, reshape or permute or transpose—I can no longer remember these because I don’t need to. These details can be delegated to “interns” because they have excellent memory. However, you still need to understand the essence, such as whether there’s a tensor at the bottom, and a view; you can operate different views of the same memory, or you can have different storage—though the latter is less efficient. You still need to grasp these concepts so as not to perform inefficient operations like unnecessary memory copies.

So you are responsible for taste, engineering design, and architecture, ensuring the overall direction is correct, ensuring requirements are accurate—like “we need to use a unique user ID to associate all data”—these design decisions are yours to make. Engineers are responsible for filling in the gaps; that’s our current situation.

The Future of Taste and Judgment

Host: Do you think this taste and judgment will become less important over time, or will its upper limit continue to rise?

Andrej Karpathy: I genuinely hope for improvement in this area. Currently, it cannot improve; I think it’s still because it hasn’t been incorporated into reinforcement learning—perhaps there are no corresponding aesthetic rewards, or the existing rewards are insufficient.

To be honest, when I look at code, I sometimes feel a bit horrified—not every output is particularly good; often it’s bloated, with a lot of copy-pasting and weak abstractions. While it runs, it’s truly ugly.

A particularly illustrative example is the nanoGPT project—I’ve been trying to simplify the LLM training code to the extreme. The model performs very poorly on this task. I keep trying to prompt the large language model to simplify further, but it just doesn’t work. You feel like you’re completely outside the reinforcement learning loop; it’s clearly a hard push, not that flowing state.

Thus, I believe humans are still the dominant force in this area, but fundamentally, there’s no principled barrier preventing this from changing; it’s just that labs haven’t achieved this yet.

Serrated Intelligence and Its Implications

Host: I want to return to the topic of “serrated intelligence.” You wrote an insightful article discussing the comparison between “animals and ghosts”—we are not building animals but summoning ghosts. These ghosts are serrated agents shaped by data and reward functions, rather than driven by intrinsic motivation, curiosity, or empowerment—those are products of evolution. Why is this framework important? How does it change the way we build, deploy, evaluate, and even trust these systems?

Andrej Karpathy: I wrote this article because I wanted to clarify what these entities really are. If you have an accurate cognitive model of them, you can use them better. I’m not sure how practical this framework is; it may have some philosophical implications, but I think its core lies in accepting the fact that these entities are not animal intelligence. If you shout at them, they won’t perform better or worse; it has no impact. It’s all just statistical simulation loops, grounded in pre-training—statistics, with reinforcement learning layered on top.

Perhaps it’s just a mindset—what mindset do I bring to face them, what might work, what might not, and how to adjust it. I can’t say I’ve summarized “here are five clear conclusions to make your system better”; it’s more about maintaining a cautious attitude towards it and gradually exploring over time.

The Future of Intelligent Agents

Host: That’s a starting point. Now, you are deeply involved with agents that are not just chatbots—they have real permissions, local context, and can take actions on your behalf. When we all start living in such a world, what will it look like?

Andrej Karpathy: I think many here are excited about native agent environments. Everything must be rewritten—currently, everything is fundamentally designed for humans and needs to be re-migrated. The various frameworks and libraries I use now are fundamentally still written for people. This is my biggest complaint: why are there still instructions telling me what to do? I don’t want to do it myself. What I want to know is: what should I copy and paste to my agent? Every time I see instructions like “please visit this URL,” it feels very awkward.

I think everyone is pondering this question: how to break down the workflows that need to be completed into perceptions of the world and executions in the world? How to make everything agent-friendly? Essentially, it’s about describing it to the agent first and building a lot of automation around highly readable data structures for LLMs.

I hope to see a lot of agent-friendly infrastructure. For example, in MenuGen, a significant part of the trouble isn’t writing the code itself but deployment—I have to deal with various services, configure DNS, and jump around in various settings menus, which is very tedious. What I hope for is: I give an LLM a prompt, and it builds MenuGen and automatically deploys it without me touching anything; it just runs online. This might be a good test standard to judge whether our infrastructure is becoming increasingly agent-friendly.

Ultimately, I believe we are moving toward a world where every person and organization has their own intelligent agent. My agent and your agent communicate, handling meeting details and similar tasks. I think that’s roughly the direction we’re heading, and everyone here feels excited about it, which is great.

Conclusion

Host: I really like the metaphor of “perceivers and executors”; this line of thought is genuinely interesting. Finally, I want to end with the topic of education because you are arguably one of the best at clarifying complex technical concepts and have thoughtfully considered how to design education around these topics. When AI becomes cheap in the next era, what will still be worth learning deeply?

Andrej Karpathy: Recently, a tweet deeply resonated with me, and I think about it almost every day. The essence is: you can outsource your thinking, but you cannot outsource your understanding.

Host: That’s beautifully said.

Andrej Karpathy: Yes, because I am still part of this system, and information still needs to enter my brain. I increasingly feel like I’ve become the bottleneck—just “knowing” has become a bottleneck: why are we building this? What’s the value? How do I direct my agent?

So I still believe that ultimately, there must be some force to guide thinking and processing, and that force is fundamentally constrained by “understanding.” This is also why I’m excited about the LLM knowledge base—because it’s a way to help me digest information. Every time I see different perspectives and angles on the same information, I feel I gain insights. Essentially, this is a form of generating synthetic data based on fixed data. I truly enjoy this process: reading an article, it enters my wiki, and then I ask various questions, exploring different angles.

These tools, in a sense, are tools for enhancing understanding, and understanding remains a bottleneck—because without understanding, you cannot be a good “director.” Large language models themselves are certainly not good at understanding; that remains your unique core capability. Therefore, I believe tools that enhance understanding are extremely interesting and exciting directions.