TFA is right to point out the bottleneck problem for reviewing content - there’s a couple things that compound to make this worse than it should be -
The first is that the LLM outputs are not consistently good or bad - the LLM can put out 9 good MRs before the 10th one has some critical bug or architecture mistake. This means you need to be hypervigilant of everything the LLM produces, and you need to review everything with the kind of care with which you review intern contributions.
The second is that the LLMs don’t learn once they’re done training, which means I could spend the rest of my life tutoring Claude and it’ll still make the exact same mistakes, which means I’ll never get a return for that time and hypervigilance like I would with an actual junior engineer.
That problem leads to the final problem, which is that you need a senior engineer to vet the LLM’s code, but you don’t get to be a senior engineer without being the kind of junior engineer that the LLMs are replacing - there’s no way up that ladder except to climb it yourself.
All of this may change in the next few years or the next iteration, but the systems as they are today are a tantalizing glimpse at an interesting future, not the actual present you can build on.
> The first is that the LLM outputs are not consistently good or bad - the LLM can put out 9 good MRs before the 10th one has some critical bug or architecture mistake. This means you need to be hypervigilant of everything the LLM produces
This, to me, is the critical and fatal flaw that prevents me from using or even being excited about LLMs: That they can be randomly, nondeterministically and confidently wrong, and there is no way to know without manually reviewing every output.
Traditional computer systems whose outputs relied on probability solved this by including a confidence value next to any output. Do any LLMs do this? If not, why can't they? If they could, then the user would just need to pick a threshold that suits their peace of mind and review any outputs that came back below that threshold.
What would those probabilities mean in the context of these modern LLMs? They are basically “try to continue the phrase like a human would” bots. I imagine the question of “how good of an approximation is this to something a human might write” could possibly be answerable. But humans often write things which are false.
The entire universe of information consists of human writing, as far as the training process is concerned. Fictional stories and historical documents are equally “true” in that sense, right?
Hmm, maybe somehow one could score outputs based on whether another contradictory output could be written? But it will have to be a little clever. Maybe somehow rank them by how specific they are? Like, a pair of reasonable contradictory sentences that can be written about the history-book setting indicate some controversy. A pair of contradictory sentences, one about history-book, one about Narnia, each equally real to the training set, but the fact that they contradict one another is not so interesting.
LLMs do it much more often. One of the many reasons in the coding area is the fact that they're trained on both the broken and working code. They can propose as a solution a piece of code that was taken verbatim from "why is this code not working" SO question.
Google decided to approach this major problem by trying to run the code before giving the answer. Gemini doesn't always succeed as it might not have all packages needed installed for example, but at least it tries, and when it detects bullshit, it tries do correct that.
You got me thinking (less about llms, more about humans), that adults do have many contradictory truths, some require nuance, some require completely different mental compartment.
Now I feel more flexible about what truth is, as a teen and child I was more stuborn, sturdy.
Not to mention, humans say things that make sense for humans to say and not a machine. For example, one recent case I saw was where the LLM hallucinated having a Macbook available that it was using to answer a question. In the context of a human, it was a totally viable response, but was total nonsense coming from an LLM.
It’s interesting because often the revolution of LLM is compared to the calculator but a calculator that does a random calculation mistake would never have been used so much in critical systems. That’s the point of a calculator, we never double check the result. But we will never check the result of an LLM because of the statistical margin of error in the feature.
> But we will never check the result of an LLM because of the statistical margin of error in the feature.
I don't follow this statement: if anything, we absolutely must check the resut of an LLM for the reason you mention. For coding, there are tools that attempt to check the generated code for each answer to at least guarantee the code runs (whether it's relevant, optimal, or bug-free is another issue, and one that is not so easy to check without context that can be significant at times).
Right: When I avoid memorizing a country's capital city, that's because I can easily know when I will want it later and reliably access it from an online source.
When I avoid multiplying large numbers in my head, that's because I can easily characterize the problem and reliably use a calculator.
Neither are the same as people trying to use LLMs to unreliably replacing critical thinking.
> This, to me, is the critical and fatal flaw that prevents me from using or even being excited about LLMs: That they can be randomly, nondeterministically and confidently wrong, and there is no way to know without manually reviewing every output.
Sounds a lot like most engineers I’ve ever worked with.
There are a lot of people utilizing LLMs wisely because they know and embrace this. Reviewing and understanding their output has always been the game. The whole “vibe coding” trend where you send the LLM off to do something and hope for the best will teach anyone this lesson very quickly if they try it.
LLMs seem to care about getting things right and improve much faster than engineers. They've gone from non-verbal to reasonable coders in ~5 years, it takes humans a good 15 to do the same.
The people training the LLMs redid the training and fine tuned the networks and put out new LLMs. Even if marketing misleadingly uses human related terms to make you believe they evolve.
A LLM from 5 years ago will be as bad as 5 years ago.
Conceivably a LLM that can retrain itself on the input that you give it locally could indeed improve somewhat, but even if you could afford the hardware, do you see anyone giving you that option?
You cannot really compare the two. An engineer will continue to learn and adapt their output to the teams and organizations they interact with. They will be seamlessly picking up core principles, architectural nouances and verbiage of the specific environment. You need to explicitly pass all that to an llm and all approaches today lack.
Most importantly, an engineer will continue accumulating knowledge and skills while you interact with them. An llm won't.
With ChatGPT explicitly storing "memory" about the user and access to the history of all chats, that can also change. Not hard to imagine an AI-powered IDE like Cursor understanding that when you reran a prompt or gave it an error message it came to understand that its original result was wrong in some way and that it needs to "learn" to improve its outputs.
Maybe. I'd wager the next couple of generations of inference architecture will still have issues with context on that strategy. Trying to work with the state of the art models at their context boundaries quickly descends into gray goop like behavior for now and I don't see anything on the horizon that changes that rn.
The confidence value is a good idea. I just saw a tech demo from F5 that estimated the probability that a prompt might be malicious. The administrator parameterized the tool as a probability and the logs capture that probability. Could be a useful output for future generative AI products to include metadata about uncertainty in their outputs
That's not a "fatal" flaw. It just means you have to manually review every output. It can still save you time and still be useful. It's just that vibe coding is stupid for anything that might ever touch production.
> Do any LLMs do this? If not, why can't they? If they could, then the user would just need to pick a threshold that suits their peace of mind and review any outputs that came back below that threshold.
That's not how they work - they don't have internal models where they are sort of confident that this is a good answer. They have internal models where they are sort of confident that these tokens look like they were human generated in that order. So they can be very confident and still wrong. Knowing that confidence level (log p) would not help you assess.
There are probabilistic models where they try to model a posterior distribution for the output - but that has to be trained in, with labelled samples. It's not clear how to do that for LLMs at the kind of scale that they require and affordably.
You could consider letting it run code or try out things in simulations and use those as samples for further tuning, but at the moment, this might still lead them to forget something else or just make some other arbitrary and dumb mistake that they didn't make before the fine tuning.
That’s one set of reasons. But you might also choose to use a computer because you need something fine faster, cheaper, or in a larger scale than humans could do it—but where human-level accuracy is acceptable.
Honestly, I am surprised by your opinion on this matter(something also echoed a few times in other comments too). Lets switch the context for a bit… human drivers kill few thousand people, so why make so much regulations for self driving cars… why not kick out pilots entirely, autopilot can do smooth(though damaging to tires) landing/takeoffs, how about we layoff all govt workers and regulatory auditors, LLMs are better at recall and most of those paper pushers do subpar work anyways…
My analogies may sound apples to gorillas comparison but the point of automation is that they perform 100x better than human with highest safety. Just because I can DUI and get a fine does not mean a self driving car should drive without fully operational sensors, both bear same risk of killing people but one has higher regulatory restrictions.
> That they can be randomly, nondeterministically and confidently wrong, and there is no way to know without manually reviewing every output.
This is my exact same issue with LLMs and it's routinely ignored by LLM evangelists/hypesters. It's not necessarily about being wrong it's the non-deterministic nature of the errors. They're not only non-deterministic but unevenly distributed. So you can't predict errors and need expertise to review all the generated content looking for errors.
There's also not necessarily an obvious mapping between input tokens and an output since the output depends on the whole context window. An LLM might never tell you to put glue on pizza because your context window has some set of tokens that will exclude that output while it will tell me to do so because my context window doesn't. So there's not even necessarily determinism or consistency between sessions/users.
I understand the existence of Gell-Mann amnesia so when I see an LLM give confident but subtly wrong answers about a Python library I don't then assume I won't also get confident yet subtly wrong answers about the Parisian Metro or elephants.
This is a nitpick because I think your complaints are all totally valid, except that I think blaming non-determinism isn't quite right. The models are in fact deterministic. But that's just technical, from a practical sense they are non-deterministic in that a human can't determine what it'll produce without running it, and even then it can be sensitive to changes in context window like you said, so even after running it once you don't know you'll get a similar output from similar inputs.
I only post this because I find it kind of interesting; I balked at blaming non-determinism because it technically isn't, but came to conclude that practically speaking that's the right thing to blame, although maybe there's a better word that I don't know.
It's deterministic in that (input A, state B) always produces output C. But it can't generally be reasoned about, in terms of how much change to A will produce C+1, nor can you directly apply mechanical reasoning to /why/ (A.B) produces C and get a meaningful answer.
(Yes, I know, "the inputs multiplied by the weights", but I'm talking about what /meaning/ someone might ascribe to certain weights being valued X, Y or Z in the same sense as you'd look at a variable in a running program or a physical property of a mechanical system).
> from a practical sense they are non-deterministic in that a human can't determine what it'll produce without running it
But this is also true for programs that are deliberately random. If you program a computer to output a list of random (not pseudo-random) numbers between 0 and 100, then you cannot determine ahead of time what the output will be.
The difference is, you at least know the range of values that it will give you and the distribution, and if programmed correctly, the random number generator will consistency give you numbers in that range with the expected probability distribution.
In contrast, an LLM's answer to "List random numbers between 0 and 100" usually will result in what you expect, or (with a nonzero probability) it might just up and decide to include numbers outside of that range, or (with a nonzero probability) it might decide to list animals instead of numbers. There's no way to know for sure, and you can't prove from the code that it won't happen.
At the base levels LLMs aren't actually deterministic because the model weights are typically floats of limited precision. At a large enough scale (enough parameters, model size, etc) you will run into rounding issues that effectively behave randomly and alter output.
Even with temperature of zero floating point rounding, probability ties, MoE routing, and other factors make outputs not fully deterministic even between multiple runs with identical contexts/prompts.
In theory you could construct a fully deterministic LLM but I don't think any are deployed in practice. Because there's so many places where behavior is effectively non-deterministic the system itself can't be thought of as deterministic.
Errors might be completely innocuous like one token substituted for another with the same semantic meaning. An error might also completely change the semantic meaning of the output with only a single token change like an "un-" prefix added to a word.
The non-determinism is both technically and practically true in practice.
Most floating point implementations have deterministic rounding. The popular LLM inference engine llama.cpp is deterministic when using the same sampler seed, hardware, and cache configuration.
The prompts we’re using seem like they’d generate the same forced confidence from a junior. If everything’s a top-down order, and your personal identity is on the line if I’m not “happy” with the results, then you’re going to tell me what I want to hear.
There's some differences between junior developers and LLMs that are important. For one a human developer can likely learn from a mistake and internalize a correction. They might make the mistake once or twice but the occurrences will decrease as they get experience and feedback.
LLMs as currently deployed don't do the same. They'll happily make the same mistake consistently if a mistake is popular in the training corpus. You need to waste context space telling them to avoid the error until/unless the model is updated.
It's entirely possible for good mentors to make junior developers (or any junior position) feel comfortable being realistic in their confidence levels for an answer. It's ok for a junior person to admit they don't know an answer. A mentor requiring a mentee to know everything and never admit fault or ignorance is a bad mentor. That's encouraging thought terminating behavior and helps neither person.
It's much more difficult to alter system prompts or get LLMs to even admit when they're stumped. They don't have meaningful ways to even gauge their own confidence in their output. Their weights are based on occurrences in training data rather than correctness of the training data. Even with RL the weight adjustments are only as good as the determinism of the output for the input which is not great for several reasons.
Because they aren't knowledgeable. The marketing and at-first-blush impressions that LLMs leave as some kind of actual being, no matter how limited, mask this fact and it's the most frustrating thing about trying to evaluate this tech as useful or not.
To make an incredibly complex topic somewhat simple, LLMs train on a series of materials, in this case we'll talk words. It learns that "it turns out," "in the case of", "however, there is" are all words that naturally follow one another in writing, but it has no clue why one would choose one over the other beyond the other words which form the contexts in which those word series' appear. This process is repeated billions of times as it analyzes the structure of billions of written words until it arrives at a massive in scale statistical model of how likely it is that every word will be followed by every other word or punctuation mark.
Having all that data available does mean an LLM can generate... words. Words that are pretty consistently spelled and arranged correctly in a way that reflects the language they belong to. And, thanks to the documents it trained on, it gains what you could, if you're feeling generous, call a "base of knowledge" on a variety of subjects, in that by the same statistical model, it has "learned" that "measure twice, cut once" is said often enough that it's likely good advice, but again, it doesn't know why that is, which would be: it optimizes your cuts and avoids wasting materials when building something to measure it, mark it, then measure it a second or even third time to make sure it was done correctly before you do the cut, which an operation that cannot be reversed.
However that knowledge has a HARD limit in terms of what was understood within it's training data. For example, way back, a GPT model recommended using elmer's glue to keep pizza toppings attached when making a pizza. No sane person would suggest this, because glue... isn't food. But the LLM doesn't understand that, it takes the question: how do I keep toppings on pizza, and it says, well a ton of things I read said you should use glue to stick things together, and ships that answer out.
This is why I firmly believe LLMs and true AI are just... not the same thing, at all, and I'm annoyed that we now call LLMs AI and AI AGI, because in my mind, LLMs do not demonstrate any intelligence at all.
LLMs are great machine learning tech. But what exactly are they learning? No ones knows, because we're just feeding it the internet (or a good part of it) and hoping something good comes out of the end. But so far, it just shows that it only learn the closeness of one unit (token, pixels block,...) to each other. But with no idea why they are close in the first place.
The glue on pizza thing was a bit more pernicious because of how the model came to that conclusion: SERPs. Google's LLM pulled the top result for that query from Reddit and didn't understand that the Reddit post was a joke. It took it as the most relevant thing and hilarity ensued.
In that case the error was obvious, but these things become "dangerous" for that sort of use case when end users trust the "AI result" as the "truth".
Treating "highest ranked," "most upvoted," "most popular," and "frequently cited" as a signal of quality or authoritativeness has proven to be a persistent problem for decades.
Depends on the metric. Humans who up-voted that material clearly thought it was worth.
The problem is distinguishing the various reasons people think something is worth and using the right context.
That requires a lot of intelligence.
The fact that modern language models are able to model sentiment and sarcasm as well as they do is a remarkable achievement.
Sure there is a lot of work to be done to improve that, especially at scale and in products where humans are expecting something more than a good statistical "success rate", but they actually expect the precision level they are used from professionally curated human sources.
> The marketing and at-first-blush impressions that LLMs leave as some kind of actual being, no matter how limited, mask this fact
I like to highlight the fundamental difference between fictional qualities of a fictional character versus actual qualities of an author. I might make a program that generates a story about Santa Claus, but that doesn't mean Santa Claus is real or that I myself have a boundless capacity to care for all the children in the world.
Many consumers are misled into thinking they are conversing with an "actual being", rather than contributing "then the user said" lines to a hidden theater script that has a helpful-computer character in it.
This sounds an awful lot like the old Markov chains we used to write for fun in school. Is the difference really just scale? There has got to be more to it.
You can think of an autoregressive LLM as a Markov chain, sure. It's just sampling from a much more sophisticated distribution than the ones you wrote for fun did. That by itself is not much of an argument against LLMs, though.
I don't understand these arguments at all. Do you currently not do code reviews at all, and just commit everything directly to repo? do your coworkers?
If this is the case, I can't take your company at all seriously. And if it isn't, then why is reviewing the output of LLM somehow more burdensome than having to write things yourself?
> The first is that the LLM outputs are not consistently good or bad - the LLM can put out 9 good MRs before the 10th one has some critical bug or architecture mistake. This means you need to be hypervigilant of everything the LLM produces, and you need to review everything with the kind of care with which you review intern contributions.
Also, people aren't meant to be hyper-vigilant in this way.
Which is a big contradiction in the way contemporary AI is sold (LLMs, self-driving cars): they replace a relatively fun active task for humans (coding, driving) with a mind-numbing passive monitoring one that humans are actually terrible at. Is that making our lives better?
> That problem leads to the final problem, which is that you need a senior engineer to vet the LLM’s code, but you don’t get to be a senior engineer without being the kind of junior engineer that the LLMs are replacing - there’s no way up that ladder except to climb it yourself
I suspect software will stumble into the strategy deployed by the big 4 Accounting firms and large law firms - have juniors have the first pass and have the changes filter upwards in seniority, with each layer adding comments and suggestions and sending it down to be corrected, until they are ready to sign-off on it.
This will be inefficient amd wildly incompatible with agile practice, but that's one possible way for juniors to become mid-level, and eventually seniors after paying their dues. Its absolutely is inefficient in many ways, and is mostly incompatible with the current way of working as merge-sets have to be considered in a broader context all the time.
If a tech works 80% of the time, then I know that I need to be vigilant and I will review the output. The entire team structure is aware of this. There will be processes to offset this 20%.
The problem is that when the AI becomes > 95% accurate (if at all) then humans will become complacent and the checks and balances will be ineffective.
80% is good enough for like the bottom 1/4th-1/3rd of software projects. That is way better than an offshore parasite company throwing stuff at the wall because they don't care about consistency or quality at all. These projects will bore your average HNer to death rather quickly (if not technically, then politically).
Maybe people here are used to good code bases, so it doesn't make sense that 80% is good enough there, but I've seen some bad code bases (that still made money) that would be much easier to work on by not reinventing the wheel and not following patterns that are decades old and no one does any more.
We are already there. The threshold is much closer to 80% for average people. For average folks, LLMs have rapidly went from "this is wrong and silly" to "this seems right most of the time so I just trust it when I search for info" in a few years.
> you don’t get to be a senior engineer without being the kind of junior engineer that the LLMs are replacing
I disagree: LLM are not replacing the kind of junior engineer who become senior ones. They replace "copy from StackOverflow until I get something mostly working" coders. Those who end going up the management ladder, not the engineering one. LLM are (atm) not replacing the junior engineers who use tools to get an idea then read the documentation.
I used to think this about AI, that it will cause a a dearth of junior engineers. But I think it is really going to end up as a new level of abstraction. Aside from very specific bits of code, there is nothing AI does to remove any of the thinking work for me. So now I will sit down, reason through a problem, make a plan and... instead of punching code I write a prompt that punches the code.
At the end of the day, AI can't tell us what to build or why to build it. So we will always need to know what we want to make or what ancillary things we need. LLMs can definitely support that, but knowing ALL the elements and gotchas is crucial.
I don't think that removes the need for juniors, I think it simplifies what they need to know. Don't bother learning the intracacies of the language or optimization tricks or ORM details - the LLM will handle all that. But you certainly will need to know about catching errors and structuring projects and what needs testing, etc. So juniors will not be able to "look under the hood" very well but will come in learning to be a senior dev FIRST and a junior dev optionally.
Not so different from the shift from everyone programming in C++ during the advent of PHP with "that's not really programming" complaints from the neckbeards. Doing this for 20 years and still haven't had to deal with malloc or pointers.
The C++ compilers at least don't usually miscompile your source code. And when they do, it happens very rarely, mostly in obscure corners of the language, and it's kind of a big deal, and the compiler developers fix it.
Compare to the large langle mangles, which somewhat routinely generate weird and wrong stuff, it's entirely unpredictable what inputs may trip it, it's not even reproducible, and nobody is expected to actually fix that. It just happens, use a second LLM to review the output of the first one or something.
I'd rather have my lower-level abstractions be deterministic in a humanly-legible way. Otherwise in a generation or two we may very well end up being actual sorcerers who look for the right magical incantations to make the machine spirits obey their will.
In my experience, LLMs aren't advanced enough for that, they just randomly add sql injections to code that otherwise uses proper prepared statements. We can get interns to stop doing that in one day.
The demographical shift over time will eventually lead to degradation of LLM performance, because more content will be of worse quality and transformers are a concept that loses symbolic inference.
So, assuming that LLMs will increase in performance will only be true for the current generations of software engineers, whereas the next generations will lead automatically to worse LLM performance once they've replaced the demographic of the current seniors.
Additionally, every knowledge resource that led to the current generation's advancements is dying out due to proprietarization.
Courses, wikis, forums, tutorials... they all are now part of the enshittification cycle, which means that in the future they will contain less factual content per actual amount of content - which in return will also contribute to making LLM performance worse.
Add to that the problems that come with such platforms, like the stackoverflow mod strikes or the ongoing reddit moderation crisis, and you got a recipe for Idiocracy.
I decided to archive a copy of all books, courses, wikis and websites that led to my advancements in my career, so I have a backup of it. I encourage everyone to do the same. They might be worth a lot in the future, given how the trend is progressing.
> The second is that the LLMs don’t learn once they’re done training, which means I could spend the rest of my life tutoring Claude and it’ll still make the exact same mistakes, which means I’ll never get a return for that time and hypervigilance like I would with an actual junior engineer.
However, this creates a significant return on investment for opensourcing your LLM projects. In fact, you should commit your LLM dialogs along with your code. The LLM won't learn immediately, but it will learn in a few months when the next refresh comes out.
What's going to happen is that LLMs will eventually make fewer mistakes, and then people will just put up with more bugs in almost all situations, leading to everything being noticably worse, and build everything with robustness in mind, not correctness. But it will all be cheaper so there you go.
> Remember the first time an autocomplete suggestion nailed exactly what you meant to type?
I actually don't, because so far this only happened with trivial phrases or text I had already typed in the past. I do remember however dozens of times where autocorrect wrongly "corrected" the last word I typed, changing an easy to spot typo into a much more subtle semantic error.
Sometimes autocorrect will "correct" perfectly valid words if it deems the correction more appropriate. Ironically while I was typing this message, it changed the word "deems" to "seems" repeatedly. I'm not sure what's changed with their algorithm, but this appears to be far more heavy handed than it used to be.
Not from traditional auto complete, but I have some LLM 'auto complete'; because the LLM 'saw' so much code during training, there is that magic that you just have a blinking prompt and suddenly it comes up with exactly what you intended out of 'thin air'. Then again, I also very often have that it really comes up with stuff I will never want. But I remember mostly the former cases.
I see these sorts of statements from coders who, you know, aren't good programmers in the first place. Here's the secret that I that I think LLM's are uncovering: I think there's a lot of really shoddy coders out there; coders who could could/would never become good programmers and they are absolutely going to be replaced with LLMs.
I don't know how I feel about that. I suspect it's not going to be great for society. Replacing blue collar workers for robots hasn't been super duper great.
> Replacing blue collar workers for robots hasn't been super duper great.
That's just not true. Tractors, combine harvesters, dishwashers washing machines, excavators, we've repeatedly revolutionised blue-collar work, made it vastly, extraordinary more efficient.
I'd suspect that these equipments also made it more dangerous. They also made it more industrial in scale and capital costs, driving "homestead" and individual farmers out of the business, replaced by larger and more capitalized corporations.
We went from individual artisans crafting fabrics by hand, to the Industrial Revolution where children lost fingers tending to "extraordinary more efficient" machines that vastly out-produced artisans. This trend has only accelerated, where humans consume and throw out an order of magnitude more clothing than a generation ago.
You can see this trend play out across industrialized jobs - people are less satisfied, there is some social implications, and the entire nature of the job (and usually the human's independence) is changed.
The transitions through industrialization have had dramatic societal upheavals. Focusing on the "efficiency" of the changes, ironically, miss the human component of these transitions.
A few articles like this have hit the front page, and something about them feels really superficial to me, and I'm trying to put my finger on why. Perhaps it's just that it's so myopically focused on day 2 and not on day n. They extrapolate from ways AI can replace humans right now, but lack any calculus which might integrate second or third order effects that such economic changes will incur, and so give the illusion that next year will be business as usual but with AI doing X and humans doing Y.
Maybe it is the fact that they blatantly paint a picture of AI doing flawless production work where the only "bottleneck" is us puny humans needing to review stuff. It exemplifies this race to the bottom where everything needs to be hyperefficient and time to market needs to be even lower.
Which, once you stop to think about it, is insane. There is a complete lack of asking why. To In fact, when you boil it down to its core argument it isn't even about AI at all. It is effectively the same grumblings from management layers heard for decades now where they feel (emphasis) that their product development is slowed down by those pesky engineers and other specialists making things too complex, etc. But now just framed around AI with unrealistic expectations dialed up.
As a staff engineer, it upsets me if my Review to Code ratio goes above 1. Days when I am not able to focus and code, because I was reviewing other people’s work all day, I usually am pretty drained but also unsatisfied. If the only job available to engineers becomes “review 50 PRs a day, everyday” I’ll probably quit software engineering altogether.
Reviewing human code and writing thoughtful, justified, constructive feedback to help the author grow is one thing - too much of this activity gets draining, for sure, but at least I get the satisfaction of teaching/mentoring through it.
Reviewing AI-generated code, though, I'm increasingly unsure there's any real point to writing constructive feedback, and I can feel I'll burn out if I keep pushing myself to do it. AI also allows less experienced engineers to churn out code faster, so I have more and more code to review.
But right now I'm still "responsible" for "code quality" and "mentoring", even if we are going to have to figure out what those things even mean when everyone is a 10x vibecoder...
Hoping the stock market calms down and I can just decide I'm done with my tech career if/when this change becomes too painful for dinosaurs like me :)
> AI also allows less experienced engineers to churn out code faster, so I have more and more code to review
This to me has been the absolute hardest part of dealing with the post LLM fallout in this industry. It's been so frustrating for me personally I took to writing my thoughts down in a small blog humerously titled
I see this too, more and more code looks like made by the same person, even though it come from different people.
I hate these kind of comments, I'm tired to flag them for removal so they pollute code base more and more, like people did not realise how stupid of a comment this is
# print result
print(result)
I'm yet to experience coding agent to do what I asked for, so many times the solution I came up with was shorter, cleaner, and better approach than what my IDE decided to produce... I think it works well as rubber duck where I was able to explore ideas but in my case that's about it.
I have plenty of experience doing code reviews and to do a good job is pretty hard and thankless work. If I had to do that all day every day I'd be very unhappy.
It is definitely thankless work, at least at my company.
It’d be even more thankless if instead of writing good feedback that somebody can learn from (or can spark interesting conversations that I can learn from), you would just said “nope GPT it’s not secure enough” and regenerate the whole PR, then read all the way through it again. Absolute tedium nightmare
My observation over the years as a software dev was that velocity is overrated.
Mostly because all kinds of systems are made for humans - even if we as a dev team were able to pump out features we got pushed back. Exactly because users had to be trained, users would have to be migrated all kinds of things would have to be documented and accounted for that were tangential to main goals.
So bottleneck is a feature not a bug. I can see how we should optimize away documentation and tangential stuff so it would happen automatically but not the main job where it needs more thought anyway.
This is my observation as well, especially in startups. So much spaghetti thrown at walls and that pressure falls in devs to have higher velocity when it should fall on product, sales, exec to actually make better decisions.
> What I see happening is us not being prepared for how AI transforms the nature of knowledge work and us having a very painful and slow transition into this new era.
I would've liked for the author to be a bit specific here. What exactly could this "very painful and slow transition" look like? Any commenters have any idea? I'm genuinely curious.
And once the Orient and Decide part is augmented, then we'll be limited by social networks (IRL ones). Every solo founder/small biz will have to compete more and more for marketing eyeballs, and the ones who have access to bigger engines (companies), they'll get the juice they need, and we come back to humans being the bottlenecks again.
That is, until we mutually decide on removing our agency from the loop entirely . And then what?
I think less people will decide to open source their work, so AI solutions will divert from 'dark codebases' not available for models to be trained on. And people who love vibe coding will keep feeding models with code produced by models. Maybe we already reached the point where enough knowledge was locked in the models and this does not matter? I think not, based on code AI generated for me. I probably ask wrong questions.
The method of producing the work can be more important (and easier to review) than the work output itself. Like at the simplest level of a global search-replace of a function name that alters 5000 lines. At a complex level, you can trust a team of humans to do something without micro-managing every aspect of their work. My hope is the current crises of reviewing too much AI-generated output will subside into the way you can trust the team because the LLM has reached a high level of “judgement” and competence. But we’re definitely not there yet.
And contrary to the article, idea-generation with LLM support can be fun! They must have tested full replacement or something.
>> At a complex level, you can trust a team of humans to do something without micro-managing every aspect of their work
I see you have never managed an outsourced project run by a body shop consultancy. They check the boxes you give them with zero thought or regard to the overall project and require significant micro managing to produce usable code.
I find this sort of whataboutism in LLM discussions tiring. Yes, of course, there are teams of humans that perform worse than an LLM. But it obvious to all but the most hype-blinded booster that it is possible for teams of humans to work autonomously to produce good results, because that is how all software has been produced to the present day, and some of it is good.
AI increases our ability to produce bullshit but doesn't do much to increase our ability to detect bullshit. One sentence of bullshit takes 1000 sentences of clear reasoning to dispel.
> Remember the first time an autocomplete suggestion nailed exactly what you meant to type?
No.
> Multiply that by a thousand and aim it at every task you once called “work.”
If you mean "menial labor" then sure. The "work" I do is not at all aided by LLMs.
> but our decision-making tools and rituals remain stuck in the past.
That's because LLMs haven't eliminated or even significantly reduced risk. In fact they've created an entirely new category of risk in "hallucinations."
> we need to rethink the entire production-to-judgment pipeline.
Attempting to do this without accounting for risk or how capital is allocated into processes will lead you into folly.
> We must reimagine knowledge work as a high-velocity decision-making operation rather than a creative production process.
Then you will invent nothing new or novel and will be relegated to scraping by on the overpriced annotated databases of your direct competitors. The walled garden just raised the stakes. I can't believe people see a future in it.
> This pile of tasks is how I understand what Vaughn Tan refers to as Meaningmaking: the uniquely human ability to make subjective decisions about the relative value of things.
Why is that a "uniquely human ability"? Machine learning systems are good at scoring things against some criterion. That's mostly how they work.
Something I learned from working alongside data scientists and financial analysts doing algo trading is that you can almost always find great fits for your criteria, nobody ever worries about that. Its coming up with the criteria that's what everyone frets over, and even more than that, you need to beat other people at doing so - just being good or event great isn't enough. Your profit is the delta between where you are compared to all the other sharks in your pool. So LLMs are useless there, getting token predicted answers is just going to get you the same as everyone else, which means zero alpha.
So - I dunno about uniquely human? But there's definitely something here where, short of AGI, there's always going to need to be someone sitting down and actually beating the market (whatever that metaphor means for your industry or use case).
Finance is sort of a unique beast in that the field is inherently negative-sum. The profits you take home are always going to be profits somebody else isn't getting.
If you're doing like, real work, solving problems in your domain actually adds value, and so the profits you get are from the value you provide.
If you're algo trading then yes, which is what the person you're replying to is talking about.
But "finance" is very broad and covers very real and valuable work like making loans and insurance - be careful not to be too broad in your condemnation.
I think this is challenging because there’s a lot of tacit knowledge involved, and feedback loops are long and measurement of success ambiguous.
It’s a very rubbery, human oriented activity.
I’m sure this will be solved, but it won’t be solved by noodling with prompts and automation tools - the humans will have to organise themselves to externalise expert knowledge and develop an objective framework for making ‘subjective decisions about the relative value of things’.
The article rightly points out that people don't enjoy just being reviewers: we like to take an active role in playing, learning, and creating. They point out the need to find a solution to this, but then never follow up on that idea.
This is perhaps the most fundamental problem. In the past, tools took care of the laborious and tedious work so we could focus on creativity. Now we are letting AI do the creative work and asking humans to become managers and code reviewers. Maybe that's great for some people, but it's not what most problem solvers want to be doing. The same people who know how to judge such things are the same people who have years of experience doing this things. Without that experience you can't have good judgement.
Let the AI make it faster and easier for me to create; don't make it replace what I do best and leave me as a manager and code reviewer.
The parallels with grocery checkouts are worth considering. Humans are great at recognizing things, handling unexpected situations, and being friendly and personable. People working checkouts are experts at these things.
Now replace that with self serve checkouts. Random customers are forced to do this all themselves. They are not experts at this. The checkouts are less efficient because they have to accommodate these non-experts. People have to pack their own bags. And they do all of this while punching buttons on a soulless machine instead of getting some social interaction in.
But worse off is the employee who manages these checkouts. Now instead of being social, they are security guards and tech support. They are constantly having to shoot the computer issues and teach disinterested and frustrated beginners how to do something that should be so simple. The employee spends most of their time as a manager and watchdog, looking at a screen that shows the status of all the checkouts, looking for issues, like a prison security guard. This work is inactive and unengaging, requiring constant attention - something humans aren't good at. When little they do interact with others, it is in situations where that are upset.
We didn't automate anything here, we just changed who does what. We made customers into the people doing checkouts and we made more level staff into managers of them, plus being tech support.
This is what companies are trying to do with AI. They want to have fewer employees whose job it is to manage the AIs, directing them to produce. The human is left assigning tasks and checking the results - managers of thankless and soulless machines. The credit for the creation goes to the machines while the employees are seen as low skilled and replaceable.
And we end up back at the start: trying to find high skilled people to perform low skilled work based on experience that they only would have had if they had being doing high skilled work to begin with. When everyone is just managing an AI, no one will know what it is supposed to do.
This really isn’t true in principle. The current LLM ecosystems can’t do “meaning tasks” but there are all kinds of “legacy” AI expert systems that do exactly what is required.
My experience is that middle manager gatekeepers are the most reluctant to participate in building knowledge systems that obsolete them though.
> Ultimately, I don’t see AI completely replacing knowledge workers any time soon.
How was that conclusion reached? And what is meant by knowledge workers? Any work with knowledge is exactly the domain of LLMs. So, LLMs are indeed knowledge workers.
> He argues this type of value judgement is something AI fundamentally cannot do, as it can only pattern match against existing decisions, not create new frameworks for assigning worth.
Counterpoint : That decision has to be made only once (probably by some expert). AI can incorportate that training data into its reasoning and voila, it becomes available to everyone. A software framework is already a collection of good decisions, practices and tastes made by experts.
> An MIT study found materials scientists experienced a 44% drop in job satisfaction when AI automated 57% of their “idea-generation” tasks
Counterpoint : Now consider making material science decisions which requires materials to have not just 3 properties but 10 or 15.
> Redesigning for Decision Velocity
Suggestion : I think this section implies we must ask our experts to externalize all their tastes, preferences, top-down thinking so that other juniors can internalize those. So experts will be teaching details (based on their internal model) to LLMs while teaching the model itself to humans.
TFA is right to point out the bottleneck problem for reviewing content - there’s a couple things that compound to make this worse than it should be -
The first is that the LLM outputs are not consistently good or bad - the LLM can put out 9 good MRs before the 10th one has some critical bug or architecture mistake. This means you need to be hypervigilant of everything the LLM produces, and you need to review everything with the kind of care with which you review intern contributions.
The second is that the LLMs don’t learn once they’re done training, which means I could spend the rest of my life tutoring Claude and it’ll still make the exact same mistakes, which means I’ll never get a return for that time and hypervigilance like I would with an actual junior engineer.
That problem leads to the final problem, which is that you need a senior engineer to vet the LLM’s code, but you don’t get to be a senior engineer without being the kind of junior engineer that the LLMs are replacing - there’s no way up that ladder except to climb it yourself.
All of this may change in the next few years or the next iteration, but the systems as they are today are a tantalizing glimpse at an interesting future, not the actual present you can build on.
> The first is that the LLM outputs are not consistently good or bad - the LLM can put out 9 good MRs before the 10th one has some critical bug or architecture mistake. This means you need to be hypervigilant of everything the LLM produces
This, to me, is the critical and fatal flaw that prevents me from using or even being excited about LLMs: That they can be randomly, nondeterministically and confidently wrong, and there is no way to know without manually reviewing every output.
Traditional computer systems whose outputs relied on probability solved this by including a confidence value next to any output. Do any LLMs do this? If not, why can't they? If they could, then the user would just need to pick a threshold that suits their peace of mind and review any outputs that came back below that threshold.
What would those probabilities mean in the context of these modern LLMs? They are basically “try to continue the phrase like a human would” bots. I imagine the question of “how good of an approximation is this to something a human might write” could possibly be answerable. But humans often write things which are false.
The entire universe of information consists of human writing, as far as the training process is concerned. Fictional stories and historical documents are equally “true” in that sense, right?
Hmm, maybe somehow one could score outputs based on whether another contradictory output could be written? But it will have to be a little clever. Maybe somehow rank them by how specific they are? Like, a pair of reasonable contradictory sentences that can be written about the history-book setting indicate some controversy. A pair of contradictory sentences, one about history-book, one about Narnia, each equally real to the training set, but the fact that they contradict one another is not so interesting.
> But humans often write things which are false.
LLMs do it much more often. One of the many reasons in the coding area is the fact that they're trained on both the broken and working code. They can propose as a solution a piece of code that was taken verbatim from "why is this code not working" SO question.
Google decided to approach this major problem by trying to run the code before giving the answer. Gemini doesn't always succeed as it might not have all packages needed installed for example, but at least it tries, and when it detects bullshit, it tries do correct that.
Interesting point.
You got me thinking (less about llms, more about humans), that adults do have many contradictory truths, some require nuance, some require completely different mental compartment.
Now I feel more flexible about what truth is, as a teen and child I was more stuborn, sturdy.
> But humans often write things which are false.
Not to mention, humans say things that make sense for humans to say and not a machine. For example, one recent case I saw was where the LLM hallucinated having a Macbook available that it was using to answer a question. In the context of a human, it was a totally viable response, but was total nonsense coming from an LLM.
It’s interesting because often the revolution of LLM is compared to the calculator but a calculator that does a random calculation mistake would never have been used so much in critical systems. That’s the point of a calculator, we never double check the result. But we will never check the result of an LLM because of the statistical margin of error in the feature.
> But we will never check the result of an LLM because of the statistical margin of error in the feature.
I don't follow this statement: if anything, we absolutely must check the resut of an LLM for the reason you mention. For coding, there are tools that attempt to check the generated code for each answer to at least guarantee the code runs (whether it's relevant, optimal, or bug-free is another issue, and one that is not so easy to check without context that can be significant at times).
Right: When I avoid memorizing a country's capital city, that's because I can easily know when I will want it later and reliably access it from an online source.
When I avoid multiplying large numbers in my head, that's because I can easily characterize the problem and reliably use a calculator.
Neither are the same as people trying to use LLMs to unreliably replacing critical thinking.
> This, to me, is the critical and fatal flaw that prevents me from using or even being excited about LLMs: That they can be randomly, nondeterministically and confidently wrong, and there is no way to know without manually reviewing every output.
Sounds a lot like most engineers I’ve ever worked with.
There are a lot of people utilizing LLMs wisely because they know and embrace this. Reviewing and understanding their output has always been the game. The whole “vibe coding” trend where you send the LLM off to do something and hope for the best will teach anyone this lesson very quickly if they try it.
Most engineers you worked with probably cared about getting it right and improving their skills.
LLMs seem to care about getting things right and improve much faster than engineers. They've gone from non-verbal to reasonable coders in ~5 years, it takes humans a good 15 to do the same.
LLMs have not improved at all.
The people training the LLMs redid the training and fine tuned the networks and put out new LLMs. Even if marketing misleadingly uses human related terms to make you believe they evolve.
A LLM from 5 years ago will be as bad as 5 years ago.
Conceivably a LLM that can retrain itself on the input that you give it locally could indeed improve somewhat, but even if you could afford the hardware, do you see anyone giving you that option?
You cannot really compare the two. An engineer will continue to learn and adapt their output to the teams and organizations they interact with. They will be seamlessly picking up core principles, architectural nouances and verbiage of the specific environment. You need to explicitly pass all that to an llm and all approaches today lack. Most importantly, an engineer will continue accumulating knowledge and skills while you interact with them. An llm won't.
With ChatGPT explicitly storing "memory" about the user and access to the history of all chats, that can also change. Not hard to imagine an AI-powered IDE like Cursor understanding that when you reran a prompt or gave it an error message it came to understand that its original result was wrong in some way and that it needs to "learn" to improve its outputs.
Maybe. I'd wager the next couple of generations of inference architecture will still have issues with context on that strategy. Trying to work with the state of the art models at their context boundaries quickly descends into gray goop like behavior for now and I don't see anything on the horizon that changes that rn.
Human memory is new neural paths.
LMM "memory" is a larger context with unchanged neural paths.
The confidence value is a good idea. I just saw a tech demo from F5 that estimated the probability that a prompt might be malicious. The administrator parameterized the tool as a probability and the logs capture that probability. Could be a useful output for future generative AI products to include metadata about uncertainty in their outputs
That's not a "fatal" flaw. It just means you have to manually review every output. It can still save you time and still be useful. It's just that vibe coding is stupid for anything that might ever touch production.
> Do any LLMs do this? If not, why can't they? If they could, then the user would just need to pick a threshold that suits their peace of mind and review any outputs that came back below that threshold.
That's not how they work - they don't have internal models where they are sort of confident that this is a good answer. They have internal models where they are sort of confident that these tokens look like they were human generated in that order. So they can be very confident and still wrong. Knowing that confidence level (log p) would not help you assess.
There are probabilistic models where they try to model a posterior distribution for the output - but that has to be trained in, with labelled samples. It's not clear how to do that for LLMs at the kind of scale that they require and affordably.
You could consider letting it run code or try out things in simulations and use those as samples for further tuning, but at the moment, this might still lead them to forget something else or just make some other arbitrary and dumb mistake that they didn't make before the fine tuning.
How would a meaningful confidence value be calculated with respect to the output of an LLM? What is “correct” LLM output?
It can be the probability of the response being accepted by the prompter
So unique to each prompter, refined over time?
>That they can be randomly, nondeterministically and confidently wrong, and there is no way to know without manually reviewing every output.
I think I can confidently assert that this applies to you and I as well.
I choose a computer to do a task because I expect it to be much more accurate, precise, and deterministic than a human.
That’s one set of reasons. But you might also choose to use a computer because you need something fine faster, cheaper, or in a larger scale than humans could do it—but where human-level accuracy is acceptable.
Honestly, I am surprised by your opinion on this matter(something also echoed a few times in other comments too). Lets switch the context for a bit… human drivers kill few thousand people, so why make so much regulations for self driving cars… why not kick out pilots entirely, autopilot can do smooth(though damaging to tires) landing/takeoffs, how about we layoff all govt workers and regulatory auditors, LLMs are better at recall and most of those paper pushers do subpar work anyways…
My analogies may sound apples to gorillas comparison but the point of automation is that they perform 100x better than human with highest safety. Just because I can DUI and get a fine does not mean a self driving car should drive without fully operational sensors, both bear same risk of killing people but one has higher regulatory restrictions.
There's an added distinction; if you make a mistake, you are liable for it. Including jail time, community service, being sued by the other party etc.
If an LLM makes a mistake? Companies will get off scot free (they already are), unless there's sufficient loophole for a class-action suit.
> That they can be randomly, nondeterministically and confidently wrong, and there is no way to know without manually reviewing every output.
This is my exact same issue with LLMs and it's routinely ignored by LLM evangelists/hypesters. It's not necessarily about being wrong it's the non-deterministic nature of the errors. They're not only non-deterministic but unevenly distributed. So you can't predict errors and need expertise to review all the generated content looking for errors.
There's also not necessarily an obvious mapping between input tokens and an output since the output depends on the whole context window. An LLM might never tell you to put glue on pizza because your context window has some set of tokens that will exclude that output while it will tell me to do so because my context window doesn't. So there's not even necessarily determinism or consistency between sessions/users.
I understand the existence of Gell-Mann amnesia so when I see an LLM give confident but subtly wrong answers about a Python library I don't then assume I won't also get confident yet subtly wrong answers about the Parisian Metro or elephants.
This is a nitpick because I think your complaints are all totally valid, except that I think blaming non-determinism isn't quite right. The models are in fact deterministic. But that's just technical, from a practical sense they are non-deterministic in that a human can't determine what it'll produce without running it, and even then it can be sensitive to changes in context window like you said, so even after running it once you don't know you'll get a similar output from similar inputs.
I only post this because I find it kind of interesting; I balked at blaming non-determinism because it technically isn't, but came to conclude that practically speaking that's the right thing to blame, although maybe there's a better word that I don't know.
Non-explicable?
It's deterministic in that (input A, state B) always produces output C. But it can't generally be reasoned about, in terms of how much change to A will produce C+1, nor can you directly apply mechanical reasoning to /why/ (A.B) produces C and get a meaningful answer.
(Yes, I know, "the inputs multiplied by the weights", but I'm talking about what /meaning/ someone might ascribe to certain weights being valued X, Y or Z in the same sense as you'd look at a variable in a running program or a physical property of a mechanical system).
> from a practical sense they are non-deterministic in that a human can't determine what it'll produce without running it
But this is also true for programs that are deliberately random. If you program a computer to output a list of random (not pseudo-random) numbers between 0 and 100, then you cannot determine ahead of time what the output will be.
The difference is, you at least know the range of values that it will give you and the distribution, and if programmed correctly, the random number generator will consistency give you numbers in that range with the expected probability distribution.
In contrast, an LLM's answer to "List random numbers between 0 and 100" usually will result in what you expect, or (with a nonzero probability) it might just up and decide to include numbers outside of that range, or (with a nonzero probability) it might decide to list animals instead of numbers. There's no way to know for sure, and you can't prove from the code that it won't happen.
At the base levels LLMs aren't actually deterministic because the model weights are typically floats of limited precision. At a large enough scale (enough parameters, model size, etc) you will run into rounding issues that effectively behave randomly and alter output.
Even with temperature of zero floating point rounding, probability ties, MoE routing, and other factors make outputs not fully deterministic even between multiple runs with identical contexts/prompts.
In theory you could construct a fully deterministic LLM but I don't think any are deployed in practice. Because there's so many places where behavior is effectively non-deterministic the system itself can't be thought of as deterministic.
Errors might be completely innocuous like one token substituted for another with the same semantic meaning. An error might also completely change the semantic meaning of the output with only a single token change like an "un-" prefix added to a word.
The non-determinism is both technically and practically true in practice.
Most floating point implementations have deterministic rounding. The popular LLM inference engine llama.cpp is deterministic when using the same sampler seed, hardware, and cache configuration.
The prompts we’re using seem like they’d generate the same forced confidence from a junior. If everything’s a top-down order, and your personal identity is on the line if I’m not “happy” with the results, then you’re going to tell me what I want to hear.
There's some differences between junior developers and LLMs that are important. For one a human developer can likely learn from a mistake and internalize a correction. They might make the mistake once or twice but the occurrences will decrease as they get experience and feedback.
LLMs as currently deployed don't do the same. They'll happily make the same mistake consistently if a mistake is popular in the training corpus. You need to waste context space telling them to avoid the error until/unless the model is updated.
It's entirely possible for good mentors to make junior developers (or any junior position) feel comfortable being realistic in their confidence levels for an answer. It's ok for a junior person to admit they don't know an answer. A mentor requiring a mentee to know everything and never admit fault or ignorance is a bad mentor. That's encouraging thought terminating behavior and helps neither person.
It's much more difficult to alter system prompts or get LLMs to even admit when they're stumped. They don't have meaningful ways to even gauge their own confidence in their output. Their weights are based on occurrences in training data rather than correctness of the training data. Even with RL the weight adjustments are only as good as the determinism of the output for the input which is not great for several reasons.
> Do any LLMs do this? If not, why can't they?
Because they aren't knowledgeable. The marketing and at-first-blush impressions that LLMs leave as some kind of actual being, no matter how limited, mask this fact and it's the most frustrating thing about trying to evaluate this tech as useful or not.
To make an incredibly complex topic somewhat simple, LLMs train on a series of materials, in this case we'll talk words. It learns that "it turns out," "in the case of", "however, there is" are all words that naturally follow one another in writing, but it has no clue why one would choose one over the other beyond the other words which form the contexts in which those word series' appear. This process is repeated billions of times as it analyzes the structure of billions of written words until it arrives at a massive in scale statistical model of how likely it is that every word will be followed by every other word or punctuation mark.
Having all that data available does mean an LLM can generate... words. Words that are pretty consistently spelled and arranged correctly in a way that reflects the language they belong to. And, thanks to the documents it trained on, it gains what you could, if you're feeling generous, call a "base of knowledge" on a variety of subjects, in that by the same statistical model, it has "learned" that "measure twice, cut once" is said often enough that it's likely good advice, but again, it doesn't know why that is, which would be: it optimizes your cuts and avoids wasting materials when building something to measure it, mark it, then measure it a second or even third time to make sure it was done correctly before you do the cut, which an operation that cannot be reversed.
However that knowledge has a HARD limit in terms of what was understood within it's training data. For example, way back, a GPT model recommended using elmer's glue to keep pizza toppings attached when making a pizza. No sane person would suggest this, because glue... isn't food. But the LLM doesn't understand that, it takes the question: how do I keep toppings on pizza, and it says, well a ton of things I read said you should use glue to stick things together, and ships that answer out.
This is why I firmly believe LLMs and true AI are just... not the same thing, at all, and I'm annoyed that we now call LLMs AI and AI AGI, because in my mind, LLMs do not demonstrate any intelligence at all.
LLMs are great machine learning tech. But what exactly are they learning? No ones knows, because we're just feeding it the internet (or a good part of it) and hoping something good comes out of the end. But so far, it just shows that it only learn the closeness of one unit (token, pixels block,...) to each other. But with no idea why they are close in the first place.
The glue on pizza thing was a bit more pernicious because of how the model came to that conclusion: SERPs. Google's LLM pulled the top result for that query from Reddit and didn't understand that the Reddit post was a joke. It took it as the most relevant thing and hilarity ensued.
In that case the error was obvious, but these things become "dangerous" for that sort of use case when end users trust the "AI result" as the "truth".
Treating "highest ranked," "most upvoted," "most popular," and "frequently cited" as a signal of quality or authoritativeness has proven to be a persistent problem for decades.
Depends on the metric. Humans who up-voted that material clearly thought it was worth.
The problem is distinguishing the various reasons people think something is worth and using the right context.
That requires a lot of intelligence.
The fact that modern language models are able to model sentiment and sarcasm as well as they do is a remarkable achievement.
Sure there is a lot of work to be done to improve that, especially at scale and in products where humans are expecting something more than a good statistical "success rate", but they actually expect the precision level they are used from professionally curated human sources.
> The marketing and at-first-blush impressions that LLMs leave as some kind of actual being, no matter how limited, mask this fact
I like to highlight the fundamental difference between fictional qualities of a fictional character versus actual qualities of an author. I might make a program that generates a story about Santa Claus, but that doesn't mean Santa Claus is real or that I myself have a boundless capacity to care for all the children in the world.
Many consumers are misled into thinking they are conversing with an "actual being", rather than contributing "then the user said" lines to a hidden theater script that has a helpful-computer character in it.
This explanation is only superficially correct, and there is more to it than simply predicting the next word.
It is the way in which the prediction works, that leads to some form of intelligence.
This sounds an awful lot like the old Markov chains we used to write for fun in school. Is the difference really just scale? There has got to be more to it.
You can think of an autoregressive LLM as a Markov chain, sure. It's just sampling from a much more sophisticated distribution than the ones you wrote for fun did. That by itself is not much of an argument against LLMs, though.
I don't understand these arguments at all. Do you currently not do code reviews at all, and just commit everything directly to repo? do your coworkers?
If this is the case, I can't take your company at all seriously. And if it isn't, then why is reviewing the output of LLM somehow more burdensome than having to write things yourself?
> The first is that the LLM outputs are not consistently good or bad - the LLM can put out 9 good MRs before the 10th one has some critical bug or architecture mistake. This means you need to be hypervigilant of everything the LLM produces, and you need to review everything with the kind of care with which you review intern contributions.
Also, people aren't meant to be hyper-vigilant in this way.
Which is a big contradiction in the way contemporary AI is sold (LLMs, self-driving cars): they replace a relatively fun active task for humans (coding, driving) with a mind-numbing passive monitoring one that humans are actually terrible at. Is that making our lives better?
> That problem leads to the final problem, which is that you need a senior engineer to vet the LLM’s code, but you don’t get to be a senior engineer without being the kind of junior engineer that the LLMs are replacing - there’s no way up that ladder except to climb it yourself
I suspect software will stumble into the strategy deployed by the big 4 Accounting firms and large law firms - have juniors have the first pass and have the changes filter upwards in seniority, with each layer adding comments and suggestions and sending it down to be corrected, until they are ready to sign-off on it.
This will be inefficient amd wildly incompatible with agile practice, but that's one possible way for juniors to become mid-level, and eventually seniors after paying their dues. Its absolutely is inefficient in many ways, and is mostly incompatible with the current way of working as merge-sets have to be considered in a broader context all the time.
> hypervigilant
If a tech works 80% of the time, then I know that I need to be vigilant and I will review the output. The entire team structure is aware of this. There will be processes to offset this 20%.
The problem is that when the AI becomes > 95% accurate (if at all) then humans will become complacent and the checks and balances will be ineffective.
80% is good enough for like the bottom 1/4th-1/3rd of software projects. That is way better than an offshore parasite company throwing stuff at the wall because they don't care about consistency or quality at all. These projects will bore your average HNer to death rather quickly (if not technically, then politically).
Maybe people here are used to good code bases, so it doesn't make sense that 80% is good enough there, but I've seen some bad code bases (that still made money) that would be much easier to work on by not reinventing the wheel and not following patterns that are decades old and no one does any more.
We are already there. The threshold is much closer to 80% for average people. For average folks, LLMs have rapidly went from "this is wrong and silly" to "this seems right most of the time so I just trust it when I search for info" in a few years.
> you don’t get to be a senior engineer without being the kind of junior engineer that the LLMs are replacing
I disagree: LLM are not replacing the kind of junior engineer who become senior ones. They replace "copy from StackOverflow until I get something mostly working" coders. Those who end going up the management ladder, not the engineering one. LLM are (atm) not replacing the junior engineers who use tools to get an idea then read the documentation.
I used to think this about AI, that it will cause a a dearth of junior engineers. But I think it is really going to end up as a new level of abstraction. Aside from very specific bits of code, there is nothing AI does to remove any of the thinking work for me. So now I will sit down, reason through a problem, make a plan and... instead of punching code I write a prompt that punches the code.
At the end of the day, AI can't tell us what to build or why to build it. So we will always need to know what we want to make or what ancillary things we need. LLMs can definitely support that, but knowing ALL the elements and gotchas is crucial.
I don't think that removes the need for juniors, I think it simplifies what they need to know. Don't bother learning the intracacies of the language or optimization tricks or ORM details - the LLM will handle all that. But you certainly will need to know about catching errors and structuring projects and what needs testing, etc. So juniors will not be able to "look under the hood" very well but will come in learning to be a senior dev FIRST and a junior dev optionally.
Not so different from the shift from everyone programming in C++ during the advent of PHP with "that's not really programming" complaints from the neckbeards. Doing this for 20 years and still haven't had to deal with malloc or pointers.
The C++ compilers at least don't usually miscompile your source code. And when they do, it happens very rarely, mostly in obscure corners of the language, and it's kind of a big deal, and the compiler developers fix it.
Compare to the large langle mangles, which somewhat routinely generate weird and wrong stuff, it's entirely unpredictable what inputs may trip it, it's not even reproducible, and nobody is expected to actually fix that. It just happens, use a second LLM to review the output of the first one or something.
I'd rather have my lower-level abstractions be deterministic in a humanly-legible way. Otherwise in a generation or two we may very well end up being actual sorcerers who look for the right magical incantations to make the machine spirits obey their will.
When interns make mistakes they make human mistakes that are easier to catch for humans than the alien kind of mistake that llms make.
In my experience, LLMs aren't advanced enough for that, they just randomly add sql injections to code that otherwise uses proper prepared statements. We can get interns to stop doing that in one day.
I wanted to add:
The demographical shift over time will eventually lead to degradation of LLM performance, because more content will be of worse quality and transformers are a concept that loses symbolic inference.
So, assuming that LLMs will increase in performance will only be true for the current generations of software engineers, whereas the next generations will lead automatically to worse LLM performance once they've replaced the demographic of the current seniors.
Additionally, every knowledge resource that led to the current generation's advancements is dying out due to proprietarization.
Courses, wikis, forums, tutorials... they all are now part of the enshittification cycle, which means that in the future they will contain less factual content per actual amount of content - which in return will also contribute to making LLM performance worse.
Add to that the problems that come with such platforms, like the stackoverflow mod strikes or the ongoing reddit moderation crisis, and you got a recipe for Idiocracy.
I decided to archive a copy of all books, courses, wikis and websites that led to my advancements in my career, so I have a backup of it. I encourage everyone to do the same. They might be worth a lot in the future, given how the trend is progressing.
> The second is that the LLMs don’t learn once they’re done training, which means I could spend the rest of my life tutoring Claude and it’ll still make the exact same mistakes, which means I’ll never get a return for that time and hypervigilance like I would with an actual junior engineer.
However, this creates a significant return on investment for opensourcing your LLM projects. In fact, you should commit your LLM dialogs along with your code. The LLM won't learn immediately, but it will learn in a few months when the next refresh comes out.
> In fact, you should commit your LLM dialogs along with your code.
Wholeheartedly agree with this.
I think code review will evolve from "Review this code" to "Review this prompt that was used to generate some code"
> In fact, you should commit your LLM dialogs along with your code.
Absolutely, for different reasons including later reviews / visits to the code + prompts.
I wonder if some sort of summarization / gist of the course correction / teaching would work.
For example Cursor has checked-in rules files and there is a way to have the model update the rules themselves based on the conversation
That may be true but the cost of refactoring code that is wrong also plummets.
So even if 9 out of 10 is wrong you can just can it.
What's going to happen is that LLMs will eventually make fewer mistakes, and then people will just put up with more bugs in almost all situations, leading to everything being noticably worse, and build everything with robustness in mind, not correctness. But it will all be cheaper so there you go.
The intro sentence to this is quite funny.
> Remember the first time an autocomplete suggestion nailed exactly what you meant to type?
I actually don't, because so far this only happened with trivial phrases or text I had already typed in the past. I do remember however dozens of times where autocorrect wrongly "corrected" the last word I typed, changing an easy to spot typo into a much more subtle semantic error.
Sometimes autocorrect will "correct" perfectly valid words if it deems the correction more appropriate. Ironically while I was typing this message, it changed the word "deems" to "seems" repeatedly. I'm not sure what's changed with their algorithm, but this appears to be far more heavy handed than it used to be.
Not from traditional auto complete, but I have some LLM 'auto complete'; because the LLM 'saw' so much code during training, there is that magic that you just have a blinking prompt and suddenly it comes up with exactly what you intended out of 'thin air'. Then again, I also very often have that it really comes up with stuff I will never want. But I remember mostly the former cases.
I see these sorts of statements from coders who, you know, aren't good programmers in the first place. Here's the secret that I that I think LLM's are uncovering: I think there's a lot of really shoddy coders out there; coders who could could/would never become good programmers and they are absolutely going to be replaced with LLMs.
I don't know how I feel about that. I suspect it's not going to be great for society. Replacing blue collar workers for robots hasn't been super duper great.
> Replacing blue collar workers for robots hasn't been super duper great.
That's just not true. Tractors, combine harvesters, dishwashers washing machines, excavators, we've repeatedly revolutionised blue-collar work, made it vastly, extraordinary more efficient.
> made it vastly, extraordinary more efficient.
I'd suspect that these equipments also made it more dangerous. They also made it more industrial in scale and capital costs, driving "homestead" and individual farmers out of the business, replaced by larger and more capitalized corporations.
We went from individual artisans crafting fabrics by hand, to the Industrial Revolution where children lost fingers tending to "extraordinary more efficient" machines that vastly out-produced artisans. This trend has only accelerated, where humans consume and throw out an order of magnitude more clothing than a generation ago.
You can see this trend play out across industrialized jobs - people are less satisfied, there is some social implications, and the entire nature of the job (and usually the human's independence) is changed.
The transitions through industrialization have had dramatic societal upheavals. Focusing on the "efficiency" of the changes, ironically, miss the human component of these transitions.
A few articles like this have hit the front page, and something about them feels really superficial to me, and I'm trying to put my finger on why. Perhaps it's just that it's so myopically focused on day 2 and not on day n. They extrapolate from ways AI can replace humans right now, but lack any calculus which might integrate second or third order effects that such economic changes will incur, and so give the illusion that next year will be business as usual but with AI doing X and humans doing Y.
Maybe it is the fact that they blatantly paint a picture of AI doing flawless production work where the only "bottleneck" is us puny humans needing to review stuff. It exemplifies this race to the bottom where everything needs to be hyperefficient and time to market needs to be even lower.
Which, once you stop to think about it, is insane. There is a complete lack of asking why. To In fact, when you boil it down to its core argument it isn't even about AI at all. It is effectively the same grumblings from management layers heard for decades now where they feel (emphasis) that their product development is slowed down by those pesky engineers and other specialists making things too complex, etc. But now just framed around AI with unrealistic expectations dialed up.
Why: they assume that humans have some secret sauce. Like... judgement...we don't. Once you extrapolate, yes, many things will be very very different.
Validating the outputs of a stochastic parrot sounds like a very alienating job.
As a staff engineer, it upsets me if my Review to Code ratio goes above 1. Days when I am not able to focus and code, because I was reviewing other people’s work all day, I usually am pretty drained but also unsatisfied. If the only job available to engineers becomes “review 50 PRs a day, everyday” I’ll probably quit software engineering altogether.
Feeling this too. And AI is making it "worse".
Reviewing human code and writing thoughtful, justified, constructive feedback to help the author grow is one thing - too much of this activity gets draining, for sure, but at least I get the satisfaction of teaching/mentoring through it.
Reviewing AI-generated code, though, I'm increasingly unsure there's any real point to writing constructive feedback, and I can feel I'll burn out if I keep pushing myself to do it. AI also allows less experienced engineers to churn out code faster, so I have more and more code to review.
But right now I'm still "responsible" for "code quality" and "mentoring", even if we are going to have to figure out what those things even mean when everyone is a 10x vibecoder...
Hoping the stock market calms down and I can just decide I'm done with my tech career if/when this change becomes too painful for dinosaurs like me :)
I could not agree more.
> AI also allows less experienced engineers to churn out code faster, so I have more and more code to review
This to me has been the absolute hardest part of dealing with the post LLM fallout in this industry. It's been so frustrating for me personally I took to writing my thoughts down in a small blog humerously titled
"Yes, I will judge you for using AI...",
in fact I say nearly this exact sentiment in it.
https://jaysthoughts.com/aithoughts1
Thanks, I like your framing in terms of the impact on "trust".
> Generating more complex solutions that are possibly not understood by the engineer submitting the changes.
I'd possibly remove "possibly" :-)
I see this too, more and more code looks like made by the same person, even though it come from different people.
I hate these kind of comments, I'm tired to flag them for removal so they pollute code base more and more, like people did not realise how stupid of a comment this is
I'm yet to experience coding agent to do what I asked for, so many times the solution I came up with was shorter, cleaner, and better approach than what my IDE decided to produce... I think it works well as rubber duck where I was able to explore ideas but in my case that's about it.> As a staff engineer, it upsets me if my Review to Code ratio goes above 1.
How does this work? Do you allow merging without reviews? Or are other engineers reviewing code way more than you?
Sorry I wrote that in haste. I meant it in terms of time spent. In absolute number of PRs, you’d probably be reviewing more PRs than you create.
I was about to say, I’m not even at the staff level, and I already review significantly more PRs than I myself push.
But in terms of time spent, thankfully still spend more time writing.
Most knowledge work - perhaps all of it - is already validating the output of stochastic parrots, we just call those stochastic parrots "management'.
It's actually very fun, ime.
I have plenty of experience doing code reviews and to do a good job is pretty hard and thankless work. If I had to do that all day every day I'd be very unhappy.
It is definitely thankless work, at least at my company.
It’d be even more thankless if instead of writing good feedback that somebody can learn from (or can spark interesting conversations that I can learn from), you would just said “nope GPT it’s not secure enough” and regenerate the whole PR, then read all the way through it again. Absolute tedium nightmare
My observation over the years as a software dev was that velocity is overrated.
Mostly because all kinds of systems are made for humans - even if we as a dev team were able to pump out features we got pushed back. Exactly because users had to be trained, users would have to be migrated all kinds of things would have to be documented and accounted for that were tangential to main goals.
So bottleneck is a feature not a bug. I can see how we should optimize away documentation and tangential stuff so it would happen automatically but not the main job where it needs more thought anyway.
This is my observation as well, especially in startups. So much spaghetti thrown at walls and that pressure falls in devs to have higher velocity when it should fall on product, sales, exec to actually make better decisions.
> What I see happening is us not being prepared for how AI transforms the nature of knowledge work and us having a very painful and slow transition into this new era.
I would've liked for the author to be a bit specific here. What exactly could this "very painful and slow transition" look like? Any commenters have any idea? I'm genuinely curious.
And once the Orient and Decide part is augmented, then we'll be limited by social networks (IRL ones). Every solo founder/small biz will have to compete more and more for marketing eyeballs, and the ones who have access to bigger engines (companies), they'll get the juice they need, and we come back to humans being the bottlenecks again.
That is, until we mutually decide on removing our agency from the loop entirely . And then what?
I think less people will decide to open source their work, so AI solutions will divert from 'dark codebases' not available for models to be trained on. And people who love vibe coding will keep feeding models with code produced by models. Maybe we already reached the point where enough knowledge was locked in the models and this does not matter? I think not, based on code AI generated for me. I probably ask wrong questions.
The method of producing the work can be more important (and easier to review) than the work output itself. Like at the simplest level of a global search-replace of a function name that alters 5000 lines. At a complex level, you can trust a team of humans to do something without micro-managing every aspect of their work. My hope is the current crises of reviewing too much AI-generated output will subside into the way you can trust the team because the LLM has reached a high level of “judgement” and competence. But we’re definitely not there yet.
And contrary to the article, idea-generation with LLM support can be fun! They must have tested full replacement or something.
>> At a complex level, you can trust a team of humans to do something without micro-managing every aspect of their work
I see you have never managed an outsourced project run by a body shop consultancy. They check the boxes you give them with zero thought or regard to the overall project and require significant micro managing to produce usable code.
I find this sort of whataboutism in LLM discussions tiring. Yes, of course, there are teams of humans that perform worse than an LLM. But it obvious to all but the most hype-blinded booster that it is possible for teams of humans to work autonomously to produce good results, because that is how all software has been produced to the present day, and some of it is good.
AI increases our ability to produce bullshit but doesn't do much to increase our ability to detect bullshit. One sentence of bullshit takes 1000 sentences of clear reasoning to dispel.
> Remember the first time an autocomplete suggestion nailed exactly what you meant to type?
No.
> Multiply that by a thousand and aim it at every task you once called “work.”
If you mean "menial labor" then sure. The "work" I do is not at all aided by LLMs.
> but our decision-making tools and rituals remain stuck in the past.
That's because LLMs haven't eliminated or even significantly reduced risk. In fact they've created an entirely new category of risk in "hallucinations."
> we need to rethink the entire production-to-judgment pipeline.
Attempting to do this without accounting for risk or how capital is allocated into processes will lead you into folly.
> We must reimagine knowledge work as a high-velocity decision-making operation rather than a creative production process.
Then you will invent nothing new or novel and will be relegated to scraping by on the overpriced annotated databases of your direct competitors. The walled garden just raised the stakes. I can't believe people see a future in it.
> This pile of tasks is how I understand what Vaughn Tan refers to as Meaningmaking: the uniquely human ability to make subjective decisions about the relative value of things.
Why is that a "uniquely human ability"? Machine learning systems are good at scoring things against some criterion. That's mostly how they work.
How are the criterion chosen though?
Something I learned from working alongside data scientists and financial analysts doing algo trading is that you can almost always find great fits for your criteria, nobody ever worries about that. Its coming up with the criteria that's what everyone frets over, and even more than that, you need to beat other people at doing so - just being good or event great isn't enough. Your profit is the delta between where you are compared to all the other sharks in your pool. So LLMs are useless there, getting token predicted answers is just going to get you the same as everyone else, which means zero alpha.
So - I dunno about uniquely human? But there's definitely something here where, short of AGI, there's always going to need to be someone sitting down and actually beating the market (whatever that metaphor means for your industry or use case).
Finance is sort of a unique beast in that the field is inherently negative-sum. The profits you take home are always going to be profits somebody else isn't getting.
If you're doing like, real work, solving problems in your domain actually adds value, and so the profits you get are from the value you provide.
If you're algo trading then yes, which is what the person you're replying to is talking about.
But "finance" is very broad and covers very real and valuable work like making loans and insurance - be careful not to be too broad in your condemnation.
You're right, I spoke too broadly there.
This is an overly simplistic view of algo trading. It ignores things like market services, the very real value of liquidity, and so on.
Also ignores capital gains - and small market moves are the very mechanism by which capital formation happens.
I think this is challenging because there’s a lot of tacit knowledge involved, and feedback loops are long and measurement of success ambiguous.
It’s a very rubbery, human oriented activity.
I’m sure this will be solved, but it won’t be solved by noodling with prompts and automation tools - the humans will have to organise themselves to externalise expert knowledge and develop an objective framework for making ‘subjective decisions about the relative value of things’.
This section heading from the post captures the key insight, is more focused, and is less hyperbolic:
> Redesigning for Decision Velocity
The article rightly points out that people don't enjoy just being reviewers: we like to take an active role in playing, learning, and creating. They point out the need to find a solution to this, but then never follow up on that idea.
This is perhaps the most fundamental problem. In the past, tools took care of the laborious and tedious work so we could focus on creativity. Now we are letting AI do the creative work and asking humans to become managers and code reviewers. Maybe that's great for some people, but it's not what most problem solvers want to be doing. The same people who know how to judge such things are the same people who have years of experience doing this things. Without that experience you can't have good judgement.
Let the AI make it faster and easier for me to create; don't make it replace what I do best and leave me as a manager and code reviewer.
The parallels with grocery checkouts are worth considering. Humans are great at recognizing things, handling unexpected situations, and being friendly and personable. People working checkouts are experts at these things.
Now replace that with self serve checkouts. Random customers are forced to do this all themselves. They are not experts at this. The checkouts are less efficient because they have to accommodate these non-experts. People have to pack their own bags. And they do all of this while punching buttons on a soulless machine instead of getting some social interaction in.
But worse off is the employee who manages these checkouts. Now instead of being social, they are security guards and tech support. They are constantly having to shoot the computer issues and teach disinterested and frustrated beginners how to do something that should be so simple. The employee spends most of their time as a manager and watchdog, looking at a screen that shows the status of all the checkouts, looking for issues, like a prison security guard. This work is inactive and unengaging, requiring constant attention - something humans aren't good at. When little they do interact with others, it is in situations where that are upset.
We didn't automate anything here, we just changed who does what. We made customers into the people doing checkouts and we made more level staff into managers of them, plus being tech support.
This is what companies are trying to do with AI. They want to have fewer employees whose job it is to manage the AIs, directing them to produce. The human is left assigning tasks and checking the results - managers of thankless and soulless machines. The credit for the creation goes to the machines while the employees are seen as low skilled and replaceable.
And we end up back at the start: trying to find high skilled people to perform low skilled work based on experience that they only would have had if they had being doing high skilled work to begin with. When everyone is just managing an AI, no one will know what it is supposed to do.
This is the same problem as outsourcing to third party programmers in another country, but worse.
It really, really is at present. It’s outsourcing but without the benefit of someone getting a paycheck: all exploitation.
If the AIs learned from us they'll only be able to produce Coca Cola and ads, so the interety of the actually valuable economy is safe.
This really isn’t true in principle. The current LLM ecosystems can’t do “meaning tasks” but there are all kinds of “legacy” AI expert systems that do exactly what is required.
My experience is that middle manager gatekeepers are the most reluctant to participate in building knowledge systems that obsolete them though.
> Ultimately, I don’t see AI completely replacing knowledge workers any time soon.
How was that conclusion reached? And what is meant by knowledge workers? Any work with knowledge is exactly the domain of LLMs. So, LLMs are indeed knowledge workers.
> He argues this type of value judgement is something AI fundamentally cannot do, as it can only pattern match against existing decisions, not create new frameworks for assigning worth.
Counterpoint : That decision has to be made only once (probably by some expert). AI can incorportate that training data into its reasoning and voila, it becomes available to everyone. A software framework is already a collection of good decisions, practices and tastes made by experts.
> An MIT study found materials scientists experienced a 44% drop in job satisfaction when AI automated 57% of their “idea-generation” tasks
Counterpoint : Now consider making material science decisions which requires materials to have not just 3 properties but 10 or 15.
> Redesigning for Decision Velocity
Suggestion : I think this section implies we must ask our experts to externalize all their tastes, preferences, top-down thinking so that other juniors can internalize those. So experts will be teaching details (based on their internal model) to LLMs while teaching the model itself to humans.