OpenAI’s new “reasoning” AI models are here: o1-preview and o1-mini

chekk · Sep 12, 2024

I recently watched the Star Trek: TNG, S01E15 episode again where the Binars create a very lifelike woman in the holodeck for Riker.
Riker: "I mean, how real are you?"
Minuet: "As real as you need me to be."

Previously, those and other lines in this scene never bothered me. This time, they gave me chills. Using an AI chatbot is partially a self-reflection and therefore can feel pretty weird.

No solid conclusions or judgment here, just something that really struck me.

chekk · Sep 12, 2024

bigcheese said:
I think one of the biggest problems with anthropomorphizing LLMs is the fact that we judge them by things that are trivial to humans, but make them seem dumb (strawberry rs that have been toted to death). Clearly the value of a new tool is not in the stuff it cannot do, but in the things it can do. Instead of focusing on performing human tasks that require human intelligence, focus should be placed on the things that an LLM is already doing 10x better than humans. They don't offer human intelligence, nor might they ever, although they offer an orthogonal kind of intelligence that can complement ours.

That sort of comparison is something humans do all the time. Sometimes, it's apples vs oranges, but sometimes it's "if it gets this simple thing wrong, how am I supposed to trust anything else it says?"
Which isn't unreasonable.

Yes, I wouldn't expect a spell checker to do math, but the generalized nature of an LLM makes it very difficult for people to discriminate.

chekk · Sep 12, 2024

bigcheese said:
And that would be the core of the issue; it’s very hard for humans not to correlate language with general intelligence. In many ways an image generator is as impressive as an LLM from a technical perspective, yet no one would claim that there is intelligence hiding behind the output. It’s something about language that makes us blind, and unfortunately that might make us take the long road to finding what this tech is useful for.

The comparison with an image generator is insightful. Thanks for that.
There is indeed something different about language. I think that's why using an AI chatbot feels weird to me; it's like a talented BS artist with good diction and grammar, but doesn't actually know anything. Every time I encounter that in a chat session, it's rather disorienting.

chekk · Sep 12, 2024

Ozy said:
So, the way you distinguish that from your average internet poster is because the grammar is better?

I post here, not on the average internet.
(Pivot and sidestep ...)

chekk · Sep 12, 2024

The Lurker Beneath said:
Machines can't think... same as bears can't wrestle.

Who wants to go in the ring?

A machine told me that bears like false equivalencies.

chekk · Sep 12, 2024

Aurich said:
lol, in the "believe it or not" part, I definitely choose "not".

For people who don't click on tweets:

View attachment 90262

At the risk of over thinking this, I asked GPT-4o for a random word: Serendipity.
Lol.

chekk · Sep 12, 2024

Surtrus said:
There's the very best kind of evidence: people choosing to use it instead of doing the work themselves. Surely if, for example, it made their lives harder rather than easier, they would not, so very very consistently, continue making this choice.

Are they using it because it's quick and convenient or because the quality of the end result is better? We don't have that data.
Easier lines up with convenience more often than quality.

Now, doing things for convenience is perfectly fine, but then one needs to be clear about which dimension is being valued.

chekk · Sep 13, 2024

S4WRXTTCS said:
I'm confused because I specifically EXCLUDED production coding.

I completely agree with what you said which is why I excluded production level coding.

Perhaps you could expand on what you mean by production level coding.
For example, when you code for "non production", do you have a different standard for accuracy?

chekk · Sep 13, 2024

internetomancer said:
I like the framing, but what makes you say that humans have native system 2?

We certainly aren't born with carefully measured reasoning. We don't get there as children either.

Adults overwhelmingly still have flawed reasoning skills.

Furthermore the skills-- reasoning, critical thinking, logic, math-- are often learned. Taught in school or whatever. Taught using patterns, definitions, metaphors, and especially algorithms.

And then the way we actually think (like what literally goes through our heads) feels more like using System 1, applied in an iterative fashion. Tossing out words that feel right, and then repeatedly calculating whether they actually accurate or true or valid. And then trying again... Much like the concept of this LLM, which is also slow, and deliberate, and better at math than 99% of people.

If you haven't read the referenced "Thinking, Fast and Slow" by Daniel Kahneman, I highly recommend it. Your comments/questions are well addressed there.
Really.

chekk · Sep 13, 2024

S4WRXTTCS said:
I don't do production level coding so I don't really even speak in those terms. I thought I could simply exclude that and people would know that I'm only referring to hobby/prototyping/testing/etc.

Lots of people need to leverage coding to play around with some idea where the code doesn't have to match the same standards as production level coding.

As a non-SW developer I feel like different levels of accuracy/reliability are needed depending on what the code is for. Production level code still covers UI code, gaming code, and low level firmware level code which all has different levels of accuracy and reliability needed.

The coding I do is infrequent coding when I want to test out an idea or to trouble some hardware. I don't need any have the same kinds of bullet proofing or security concerns that production level code needs.

I'm certainly going to do sanity checks on the code with test cases, but the code doesn't have to be perfect. It's probably better if it has some glitch to prevent me from being overcomplacent. Haha.

So the code AI produces works generally well for my use case especially when there is a large representation of what I'm doing in the training data.

But, I do feel like there is a lot more they could do that I felt was low hanging fruit. Like it should run the python code itself. Most of the time I simply feed it back the error message, and it fixes itself.

So in terms of actually making money I wish they'd focus on improving the models for things they are good at, and to stop trying to make them general.

Now that being said I have been impressed with o1-preview in answering the dumb questions I sometimes ponder.

Hmm. There's a bit of conceptual barrier here (I am a SW dev), but let's see how we do.

Outside of medical applications which require extra rigor, UI code, gaming code, and low level firmware code are all basically the same level of accuracy and reliability. It all has to work.

The analogy that suddenly occurs to me is weekend "Learn to be a coder!" sessions. These are useful as an intro, but don't make anyone a "coder". Similarly, if hobby/prototyping/testing/etc equates to "I just need to slap a few lines of code together to process this CSV file (or whatever), then AI code generation works reasonably well.

Anything much beyond that ventures very close to "production" quality code that has to handle edge cases, corner cases and all the other fun stuff. To put it a bit differently, there's a whole lot of non-production code that still requires production features in order to be useful. In that context, close doesn't count, which I think is what the other poster was driving at.

chekk · Sep 13, 2024

S4WRXTTCS said:
Firmware has the highest level of rigor and testing because its either safety critical or there is a potential to brick the device if there are mistakes in it.

Infotainment systems in cars, and UI/Apps will have less rigor because generally you can update them later if bugs are found. My Rivian UI is so buggy that I think every owner gets their own specific bug. The one assigned to me is there is no switch to turn on/off ATMOS. It's not where its supposed to be. This is despite the fact that VW is pouring a bunch of money into Rivian to get access to their SW. I don't even want to experience VW software if they think Rivian is better.

Games are way to complicated for perfect. Starfield for example is a buggy unoptimized mess, but I still find it enjoyable despite that.

There are also two ways that AI can help.

The one I was advocating for is to remove barriers for non-coders to test game ideas, app ideas, etc. Where I felt like close was all they really needed.

The other that I didn't advocate because its way outside my area of expertise is helping with validation and verification of code. To stop companies from shipping buggy code.

I assure you that firmware does not get the highest level of rigor and testing. I can see why you would think that, but it's not true. Also, game developers have large testing/QA departments, but yet lots of bugs remain.
So why is that? The discrepancy in your experience is likely due to size of code base and functionality, your level of interaction with the software and how close that is to your professional expertise, experience and intuition. The closer it is, the more likely you are to notice bugs.

Having been doing software dev for 20 years, there is buggy code everywhere. It's more obvious in the case of games and your Rivian, but it's there even in more critical stuff. Intel's recent blunder with code requesting the wrong voltage levels is far from unique: reading errata on firmware and other foundational stuff can be pretty appalling. That's why I said it's basically the same level of accuracy and reliability. However, it mostly works in each domain and that seems to be good enough for the majority of organizations.

What's interesting and frustrating is that the problem isn't in the code, it's in the developers and organizations. It took me years to see this. AI isn't going to fix that.

chekk · Sep 13, 2024

mdrejhon said:
Occasionally, I do use an I to ask questions about refactoring and other tasks.

<snip for length>

I'm glad it worked for you.
Refactoring is often undertaken for code modernization (i.e. keeping current), but is also a great opportunity to evaluate current requirements. I've removed lots of code during refactoring because it turned out it mapped to a dead business requirement.

Search

Search

OpenAI’s new “reasoning” AI models are here: o1-preview and o1-mini

chekk

Ars Centurion

More options

chekk

Ars Centurion

More options

chekk

Ars Centurion

More options

chekk

Ars Centurion

More options

chekk

Ars Centurion

More options

chekk

Ars Centurion

More options

chekk

Ars Centurion

More options

chekk

Ars Centurion

More options

chekk

Ars Centurion

More options

chekk

Ars Centurion

More options

chekk

Ars Centurion

More options

chekk

Ars Centurion

More options

nproxy.org