What Jobs Are Made Of
Judgment, Agency, and the Limits of AI Benchmarks
Fifteen years ago, in the winter of 2010, I was in the final stretch of my PhD and starting to explore the world outside academia. I remember coming back from a job interview for an R&D position during a record-cold Paris winter. Snow everywhere, sitting in a cold regional train.
I felt disappointed and vaguely confused.
I knew most of the tools this industry’s R&D team was using and was confident I could learn the remaining ones fairly easily. Still, it did not seem to be enough, and the interviewer kept telling me they were looking for someone “more experienced.”
At the time, I didn’t really grasp what that was supposed to mean. Valuing years of experience more than the concrete knowledge I could demonstrate felt deeply unfair to me. In my early 20s, “experience” mostly sounded like a fuzzy excuse to reject my application despite clear evidence of my capabilities and eagerness to learn.
That old feeling came back to haunt me recently.
Reading recent data about shrinking entry-level hiring, especially for software developers, I couldn’t help but put myself back in those shoes.
A Stanford analysis conducted in the summer of 2025 showed that workers aged 22–25 in the most AI-exposed occupations saw employment fall by roughly 6% between late 2022 and mid-2025. Over the same period, employment for older workers in those occupations increased by about 6–9%.1

The tipping point is hard to miss on this chart.
Correlation or causation2, fall 2022 marks the release of ChatGPT, the moment the public discovered what AI models could really do, and when the AI race for improved capabilities truly ignited, initially driven by OpenAI and Anthropic, soon to be joined at the frontier by Google and an increasing number of companies such as xAI, Alibaba (Qwen), DeepSeek, Mistral, and many others.
Over the past three years, progress on AI benchmarks has been mind-blowing. Models like Claude Opus 4.5 now solve ~75% of real-world coding tasks on SWE-bench3, Gemini 3 and GPT 5 achieve gold-medal-level performance on science Olympiads4. Meanwhile, ChatGPT usage is approaching a billion weekly users5.
By many technical measures, both capabilities and adoption have grown at an exceptional pace, often suggesting parity with industry or human experts.

And yet, despite the medals and the drop in entry-level hiring, the macro picture looks far more muted.
At a global and industry level, the impact remains limited, with only small effects on GDP6. There has been recent claims that beyond the announcements, many if not most generative AI pilots fail to produce sustained value in companies7. Moreover, on some real-world in-situ tests like the Remote Labor Index, which assesses AI agents on actual freelance projects and asks whether their output would be accepted as paid work, even the strongest current systems succeed only a small fraction of the time, around 2.5% for ManusAI for instance.8
What models are able to demonstrate on benchmarks seem to be difficult to reconcile with what is happening inside organizations.
Several explanations are usually offered for this gap between theory and practice.
One is organizational inertia: large companies are slow, legacy systems are messy and deployment is hard9. Another possibility is that we simply haven’t crossed the right capability threshold yet. Perhaps scoring close to 60% on a recent attempt to define and quantify AGI in comparison to human intelligence is just not enough10.
All of these likely play a role. But they also tend to frame work primarily as a matter of task execution.
That framing feels incomplete to me. In practice, a job is rarely just a list of tasks to execute and a coworker is seldom reducible to a bundle of technical skills11.
As a startup founder, I’ve spent close to 50% of my time hiring people at various points of our journey, and this has probably been the part of my life with the deepest lessons. One of those lessons is that, across most applicants and roles, I tend to look for a combination of three qualities:
Execution or technical skills: the ability to do a task correctly and to master the relevant tools and methods.
Common-sense, or judgment: understanding why tasks matter, how they fit into a broader goal and company values, culture and direction.
Agency, or taste: anticipating what to do next, what to propose, what not to do, when to change direction; sometimes why stopping entirely can be the best decision.
Execution and technical knowledge are relatively easy to observe, test and measure on a benchmark. Once the task is given it’s about solving it.
Judgment and agency are much harder to assess. They tend to become relevant outside of equilibrium and steady-state situations, when problems are less well defined, priorities shift, or the right move is to question the task itself. This is often where the best team members begin to shine but is also increasingly where companies are located.
And it was through this lens that I finally understood my 2010 interview.
My recruiters were not only evaluating whether I could use their tools and methods. They were also implicitly benchmarking how I would behave once the problem stopped being fully specified.
This definition of a worker sheds light on why entry-level jobs are affected first. Early-career roles are traditionally more execution-heavy. Over time, as people gain experience, their contribution tends to shift toward judgment and agency: defining problems, choosing what to work on, and navigating ambiguity.
AI systems are making faster progress on execution than on these other components. As a result, the execution layer becomes cheaper and thinner, disproportionately affecting entry-level hiring.
In the longer term, this is concerning. Judgment and agency are partly innate, but they are also often learned through experience with execution-heavy work. If the entry layer erodes too quickly, it will weaken the pipeline that produces future senior contributors.
This same framing helps make sense of the still-limited economic impact of AI and the challenges involved in automating broader, longer-horizon tasks.
The limiting factor to AI capabilities is often not the ability to generate text or code in isolation, but the difficulty of paying attention to the larger picture: adapting instructions to a company/team-wide context, interpreting fuzzy requirements, prioritizing, making common-sense trade-offs, and deciding what matters or even when to stop a task.
Execution clearly matters. It is just hardly ever the whole job or, as Cursor’s Ryo Lu wrote recently, execution isn’t the crucial part of job we thought it was:
The challenge is that judgment and agency are far more difficult to measure. Often, they only make sense in a wider –non-static– context, which explains why they received less attention in benchmarks.12
Yet they are often central to how a worker actually creates value in an organization. If we want to really understand the economic potential of AI, we will eventually need evaluations that go beyond technical execution to reflect the cross-team and vertical nature of real work and acknowledge that very few jobs consist of following a fixed, set of predefined rules to follow in a generally static environment.
And the AI era may end up placing even more weight on judgment, taste, and agency – the parts of work that are hardest to specify, hardest to benchmark, and hardest to replace.
In hindsight, the gap between AI benchmark performance and economic impact would have felt oddly familiar to my younger 20s self.
Brynjolfsson, E., Chandar, B., & Chen, R. Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of Artificial Intelligence. The paper reports a significant decline in employment for workers aged roughly 22–25 in highly AI-exposed occupations between late 2022 and mid-2025, while employment for older workers in the same occupations increased over the same period – https://digitaleconomy.stanford.edu/wp-content/uploads/2025/08/Canaries_BrynjolfssonChandarChen.pdf
Brynjolfsson et al points quite convincingly to causation
SWE-Bench and SWE-Bench Verified leaderboards show recent frontier and agentic systems solving a large fraction of real-world software engineering tasks drawn from actual repositories – about 75% at the end of 2025 for Cloude Opus 4.5 – See https://www.swebench.com
Competitive Programming with Large Reasoning Models - https://arxiv.org/abs/2502.06807v1
Gemini achieves gold-medal level at the International Collegiate Programming Contest World Finals - https://deepmind.google/blog/gemini-achieves-gold-medal-level-at-the-international-collegiate-programming-contest-world-finals/
See for instance https://openai.com/index/the-state-of-enterprise-ai-2025-report/
Artificial Intelligence and the Labor Market - https://www.nber.org/papers/w33509
Economic shifts in the age of AI – https://institute.bankofamerica.com/content/dam/economic-insights/ai-impact-on-economy.pdf
Miracle or myth: Assessing the macroeconomic productivity gains from artificial intelligence - https://cepr.org/voxeu/columns/miracle-or-myth-assessing-macroeconomic-productivity-gains-artificial-intelligence
MIT report: 95% of generative AI pilots at companies are failing – https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/
Remote Labor Index: Measuring AI Automation of Remote Work - https://arxiv.org/abs/2510.26787
GenAI Divide – https://mlq.ai/media/quarterly_decks/v0.1_State_of_AI_in_Business_2025_Report.pdf
A Definition of AGI – https://arxiv.org/abs/2510.18212
Expertise - https://www.nber.org/papers/w33941
In both AI research and economics, common-sense judgement and agency evaluation are often overlooked, and I could barely find articles exploring these aspects of work in depth.





Thank you. Included it in this round-up https://acaiberry.substack.com/p/the-choice-to-act
This reminds me of related thing I was pondering -- where do jobs come from? Most of our working time was spent to obtain food at one point, obviously something we need. But now only 5% of US are in farming. What are the rest of the people doing, and why?