AI can do work. Can it do a job?
Why passing the bar doesn't make AI a lawyer
We’re hiring a Senior Editor to join our team. Are you a liberal? Do you love to argue? Do you read social science research for fun? Apply!
To explain her pivot into the beverage industry, journalist Arielle Pardes recently told The New Yorker, “Robots can’t taste wine.” In a similar vein, the popular economist Tyler Cowen, writing with Anthropic’s Avital Balwit, predicted AI would lead to “more coaches, mentors, and individuals who supply the services of inspiration.”
For many people, their mental model of AI-driven job loss goes something like this: Identify the qualities that seem ineluctably human, like “taste” or “inspiration;” figure out the jobs associated with those qualities; assume the AIs will do everything else.
This line of thinking allows for an upbeat twist on the job loss story: The rise of robots lets humans be more human. We spend time tasting wine and inspiring each other.
It’s a pleasant thought.
The trouble is that this common mental model skips ahead to the final, placid outcome of what is sure to be a long, roiling process — it says nothing about what happens between now and then. Who is at work in three years or seven? What are the warning signs to look for along the way?
An AI passing the bar hasn’t displaced lawyers, just as an AI reading a medical scan hasn’t displaced radiologists. Software engineers do seem to be on a precipice, and they often acknowledge it.
“This won’t be the thing that replaces me as a programmer,” Paul Payne, a veteran programmer, wrote of his AI workflow. “But it convinces me that our timeline is months, not years.”
If you’re just following what tasks an AI can do, you see a drastically mixed picture when you look at different industries.
To get a better picture of how the latest AI developments affect jobs, you need to think in terms of “messiness.” What can an AI do in a flexible, fluid environment — one where instructions are ambiguous and the steps are less important than the outcomes? An AI that leads to major job displacement doesn’t just act independently. It pursues goals.
Automation or augmentation?
One reason this all seems so murky is that an individual worker’s value can either increase or decrease as their tasks are overtaken by AI. You can already see this bifurcation in the economic data.
Recent studies showed a 16% decline in employment among 22- to 25-year-olds in the most “AI-exposed” professions, like programming, relative to the least AI-exposed industries. At the same time, other research showed that industries with high AI exposure have seen a one percentage point increase in the rate of job growth relative to industries with less exposure (across all age groups).
This is often referred to as the difference between automation and “augmentation.” If AI makes your job faster to do, an employer might choose to sell more of what you make.
When ATMs were invented, it became cheaper to offer the service, so banks increased the number of branches along with the overall number of bank tellers. But a second thing happened, too. Bank tellers started doing different work: less bill counting, more relationship building, more selling clients on new bank services.
This is augmentation in action: A productivity-enhancing machine expanded the business and it upgraded the value of the frontline role. This doesn’t happen in every case, but it makes the simple story of automation a little more complex.
Another reason why looking narrowly at AI’s performance on certain tasks is a faulty heuristic is that deploying AI in actual business contexts is a really slow, messy process.
This is in part because productivity gains also create bottlenecks: If I were to complete three extra articles per week, our editors would also have to move three times as fast.
Rules and regulations also add stubborn frictions: Replacing a human scheduler in a doctor’s office with an AI won’t necessarily make it any faster to get the necessary HIPAA permissions to handle individual patient profiles; the barrier there isn’t technological.
A lot depends upon the pace at which institutions change how people relate to people, not just to technology.
The fact that software engineers are among the first to feel the effects of AI is perhaps evidence that organizational behavior is a big influence. When AI touches an industry that’s comfortable with deploying new systems (like tech itself), the technology moves faster than in more friction-heavy institutions (like health care).
This is why many people tend to predict job loss too fast. They assume a simple process of elimination, in which as soon as AIs surpass some complex human ability, we scratch off the skillset and hand it over to the robots.
No one is going to make perfect sequential predictions of who loses their job in what order, but there must be a better mental model than just assuming we should all be looking for jobs with that ineffable human touch.
Look for AIs that pursue goals, flexibly
Instead, I think it’s important to remember that, right now, AIs still complete tasks while humans pursue goals. When AIs can pursue goals in a particular field, that is a warning light for the people working in it.
Consider a single lawyer ready to settle a trust case between siblings. Suddenly, their client decides they want to go to court instead of settling.
If that happens, the lawyer strategizes flexibly about the circumstance while keeping competing goals in mind. They are simultaneously doing customer service, legal strategy, and maximizing billable hours (subtly, of course). They are focused on the outcome. They know exactly what it means to deliver, not only as the steps change, but even as the whole direction does.
Compare this to Anthropic’s experiment in which it let Claude run its office vending machine as a small business. The system was allowed to set prices, answer customer inquiries (from Anthropic employees), search the internet, and order inventory.
This is remarkably complex, autonomous behavior — and at various stages, the system made egregious, often hilarious mistakes. At one point, employees tricked it into stocking tungsten cubes, which became a fixture on desks throughout Anthropic HQ.
It could also be cajoled into giving away freebies just because its customers claimed they had a discount code. And once that issue was solved, it passed out excessive refunds and store credits.
It was autonomous, but it didn’t understand the central goal: profit.
The vending business was in the red for months until Anthropic created an AI CEO character to oversee the AI shopkeeper, which eventually helped it start turning profits. In essence, to get the system to make financial progress, Anthropic had to break the business down into one AI instance that was responsible for the goals and another that did the tasks.
Even the CEO character — “Seymour Cash,” the dedicated goal tracker intended to, of course, “see more cash” — still let excessive refunds slip by. (It had one job.)
This distinction, between tasks and goals, should help us put new AI developments in context, and it should also help us sort out what to track along the way.
In the tech industry, boosters called 2025 the “year of the agent,” the term for systems like the AI shopkeeper. An agent is an AI that can use various tools on your computer to advance a project in the background without step-by-step supervision. A quick look at Google Trends shows interest is still growing, really just beginning among the broader public:
What these systems do is genuinely incredible, and it justifies much of the hype from early adopters. You can read myriad articles on people building custom productivity apps for themselves or inventing video games to teach third graders how to read, all while typing in plain English and letting the computer do the rest.
However, it should also be clear from the vending machine example that autonomy is not quite the same thing as goal-pursuit.
In one recent measure, called APEX, the latest models completed most real-world tasks for law (78%), medicine (66%), consulting (64%), and finance (64%) when those tasks were static — sequential lists of steps that someone would take to solve a given problem. When the measure was recreated to test agents autonomously switching between multiple computer applications and complex files, the best models tended to succeed around 25% of the time. After multiple tries, they improved to 40%.
The most common failure? The models would lose the thread of the task as they changed their own environment. They could work autonomously, but they often forgot the goal.
An AI that can only succeed 25% of the time will need a human to double-check its work most of the time. It seems clear to me that this is more likely to push the human up the value chain than it is to displace them.
And this sort of autonomous behavior that nonetheless lacks focused, trustworthy goal-direction is going to limit C-suite willingness to push for deployment. An individual using AI to speed up their own work is one thing. An entire company turning over complex, collaborative business processes to AI is more costly — detailing those processes for the system requires serious upfront investment and, often, major organizational change. Companies are unlikely to move forward on this until AIs are reliably — but flexibly — focused on goals.
This doesn’t mean zero job loss will happen with current agents, just that there is a natural stopping mechanism, or at least a slowing mechanism, to how far it can go.
The mental model is “messiness”
The truth is that the AI research community is still figuring out how to capture the most relevant metrics as the technology progresses. The metrics that cause buzz in Silicon Valley are the ones that sit upstream of the news stories we all read.
If you ask anyone in Silicon Valley about job loss, they’re likely to cite a stat referred to as task duration. Essentially, they take a software task that an AI can do and see how long it would take a human to complete that same task. The thinking is, if AIs can quickly accomplish work that takes humans several hours to do, the chances of capturing more complex work are better.
For six years now, that number has doubled every seven months. The latest model from Anthropic can complete a software task that would take a typical human almost two full working days (12 hours). This can make progress toward mass job loss seem unstoppable, but what’s actually most relevant about this popular metric is buried below the headline finding.
Researchers give the tasks it tests AI on a “messiness” score for how ambiguous the instructions are or how unclear the desired output is. The more messy it is, the more like the real world.
When you separate out the “messiest” tasks from the least messy ones, you see a persistent gap — no model exceeds 30% success on the messiest tasks. Even more telling, when they bring in human judges to assess the outputs holistically, the AI’s outputs get lower scores than simpler algorithmic assessments of the AI’s work.
After the “year of the agent,” 2025, the buzziest metrics are evolving toward something “messier” like this.1 These measures ask what the models can produce. They look at how much humans have to intervene. They track how models themselves recover from mistakes — if an AI can stick to an outcome even if it gets thrown off course, then we can infer it is somehow tracking the goal.2
If you’re trying to update your own mental model for how AI job loss is likely to play out, you don’t need to be watching all of this insider baseball play out. The important thing to recognize is that earlier metrics, which informed earlier news stories, were built upon AIs doing very impressive but self-contained things, like passing the bar or solving a math problem. Going forward, the important question is whether AIs can handle messiness.
Following AI news can be equally sobering and scary. On one hand, an AI shopkeeper giving away its merchandise is funny and somewhat comforting. On the other hand, turning a profit after eight months is actually quite impressive. Looking a little closer makes job loss seem less imminent — looking a lot closer can make catastrophe seem almost immediate.
Everyone is working off of a rough guess, and even the experts have pretty wide error bars. What we do know is what to look for: not just AIs acing tests, but AIs that hold onto what they’re trying to do while the world shifts around them.
After all, that’s what humans do best.
More on AI-induced job displacement:
The Tinder-ization of the job market
Is AI going to take everyone’s jobs? A recent report went viral after claiming there were 55,000 AI-related layoffs in 2025.
We may miss the sweatshops
If automation genuinely devalues human labor, then the main ladder by which poor countries become richer would collapse at the exact same time that the main national interest rationale for immigration evaporates.
This evolution is clear in the latest publications from the top labs. Last year, a panic-inducing news story might cite Anthropic’s paper saying that, “Usage of AI extends more broadly across the economy, “with ∼ 36% of occupations using AI for at least a quarter of their associated tasks.” But to derive that figure, researchers anonymized user queries and just matched them to tasks associated with various jobs. This doesn’t tell us what users did offline after querying about some job task, only that they asked.
By contrast, Anthropic’s latest work on “Economic Primitives,” from January, includes more focus on the success of tasks and a better grasp of when humans and AI are collaborating. The same evolution has happened in OpenAI’s research: Earlier research set the standard for how to consider a job “exposed” to AI by tracking which given tasks it sped up. By contrast, the company’s new GDPVal — published in September — has AI create real work outputs and has a blind panel compare the AI outputs to human work products.
Five days ago, OpenAI released GPT-5.4, which the company pitched as the model that can do white collar work. One reason is that it includes “native computer-use capabilities,” meaning it can click through computer screens like a human does.
At the same time, one reviewer prompted it, “I need to wash my car. The carwash is 100 meters away. Should I walk or drive?” and the model said to walk. Claude responded to the same question by pointing out you need your car at the carwash.
So GPT-5.4 can use a computer uniquely well but fails an elementary question about how the world works. This should temper some of the hype. It performs better on many benchmarks, but it’s not a major leap in certain critical ones. If anything, the job loss hype around 5.4 is a proof point for my case: We need a more balanced heuristic to parse what is genuinely threatening to jobs from what is technologically novel but otherwise not a fundamental break.






So far (in my limited experience) a good question to ask has been "what would you do differently if you had unlimited free interns?"
You can get more done, but your bottlenecks are about setting up the work and verifying the quality of the work. The job impact is hardest on the junior roles - though there is that nagging question "how do we build senior people if we don't have juniors any more?"
We're in the 2nd inning. Why make any statements about the future? It would be like judging the earliest computers or mobile phones and making assessments. "TVs will never be that popular with everything in black and white and only a couple channels."