OpenAI shipped GPT-4 yesterday, the highly anticipated text-generating AI model, and it’s a curious work.
GPT-4 improves on its predecessor, GPT-3, in significant ways, for example by making statements more factually correct and allowing developers to more easily determine its style and behavior. It is also multimodal in the sense that it can understand images, allowing it to caption and even explain the photo’s content in detail.
But GPT-4 has serious drawbacks. Like GPT-3, the model “hallucinations” and makes basic logic errors. In an example on OpenAI’s own blog, GPT-4 describes Elvis Presley as “the son of an actor”. (Neither of his parents were actors.)
To get a better handle on GPT-4’s development cycle and its capabilities as well as its limitations, TechCrunch spoke with Greg Brockman, one of OpenAI’s co-founders and its president, via a video call on Tuesday.
Brockman had one word when asked to compare GPT-4 to GPT-3: Different.
“It’s just different,” he told TechCrunch. “There are still a lot of problems and mistakes [the model] makes … but you can actually see a jump in skill in things like calculus or law, where it went from being really bad in some domains to actually being quite good relative to humans.
The test results support his case. On the AP Calculus BC exam, the GPT-4 score is 4 out of 5 while the GPT-3 score is 1. (The GPT-3.5, the intermediate model between the GPT-3 and GPT-4, also scores a 4.) And in a simulated bar exam, the GPT-4 passed with the scores of the top 10% of examinees; The GPT-3.5 score hovered around the bottom 10%.
Shifting Gears One of the more intriguing aspects of the GPT-4 is the aforementioned multimodality. Unlike GPT-3 and GPT-3.5, which can only accept text prompts (e.g. “Write an essay about giraffes”), GPT-4 can take both images and text prompts to perform some action. (eg image of giraffes) Serengeti with prompt “How many giraffes are shown here?”).
This is because GPT-4 was trained on the image And text data while its predecessors were trained only on text. OpenAI says the training data came from “a variety of licensed, manufactured, and publicly available data sources, which may include publicly available personal information,” but Brockman declined to comment when I asked for specifics. protested. (Training data has gotten OpenAI into legal trouble before.)
The image understanding capability of the GPT-4 is quite impressive. For example, “What’s funny about this image?” Describe it panel by panel” plus a three-panel image showing a fake VGA cable being plugged into an iPhone, GPT-4 gives a breakdown of each image panel and correctly explains the joke (” The humor in this image comes from the absurdity of “plugging a big, old VGA connector into a small, modern smartphone charging port”).
At this time only one launch partner has access to GPT-4’s image analysis capabilities – an assistive app for the blind called Be My Eyes. Brockman says the wider rollout, whenever it happens, will be “slow and deliberate” as OpenAI evaluates the risks and benefits.
“There are policy issues like facial recognition and how to treat images of people that we need to address and work through,” Brockman said. “We need to figure out, like, where the danger zones are — where the red lines are — and then clarify over time.”
OpenAI tackled similar ethical dilemmas around its text-to-image system, DALL-E 2. After initially disabling the capability, OpenAI allowed customers to upload people’s faces for editing using an AI-powered image-generating system. At the time, OpenAI claimed that upgrades to its security system made the face-editing feature possible by “reducing the potential for harm” from deepfakes, as well as attempts to create sexual, political and violent content.
Another perennial is keeping GPT-4 from being used in unintended ways that could cause harm – psychological, monetary or otherwise. Hours after the model was released, Israeli cyber security startup Adversa AI published a blog post that found GPT-4 to bypass OpenAI’s content filters and generate phishing emails, offensive descriptions of gay people, and other highly offensive text. Methods have been demonstrated.
This is not a new phenomenon in the language model domain. Meta’s Blenderbot and OpenAI’s ChatGPT have also been driven to say wildly offensive things, and even reveal sensitive details about their inner workings. But many had expected, this reporter included, that GPT-4 could deliver significant improvements on the moderation front.
When asked about the robustness of GPT-4, Brockman stressed that the model went through six months of security training and, in internal tests, was 82% less likely to respond to requests for content denied by OpenAI’s usage policy. and was 40% more likely. To generate “factual” responses compared to GPT-3.5.
“We spent a lot of time trying to understand what GPT-4 is capable of,” Brockman said. “Bringing this out into the world is how we learn. We’re constantly updating, incorporating a bunch of improvements, so the model is more scalable to any personality or mode you want it to be in.”
Early real-world results aren’t as promising, frankly. Beyond Adversa AI tests, Microsoft’s chatbot Bing Chat, powered by GPT-4, has been shown to be susceptible to jailbreaking. Using carefully crafted inputs, users have been able to make the bot profess love, threaten harm, defend the Holocaust, and invent conspiracy theories.
Brockman didn’t deny that GPT-4 falls short here. But he emphasized the model’s new mitigation control tools, including an API-level capability called “system” messages. System messages are essentially instructions that set the tone for GPT-4’s interactions – and establish boundaries. For example, a system message might read: “You are a teacher who always gives feedback in the Socratic style. You never Give the student answers, but always try to ask the right questions to help them learn to think for themselves.”
The idea is that the system messages act as guardrails to prevent GPT-4 from straying off course.
“Really finding the tone, style and substance of GPT-4 has been a big focus for us,” Brockman said. “I think we’re starting to understand a little bit more about how to do engineering, how to have a repeatable process that gets you to predictable results that are going to be really useful to people.”
Brockman pointed to Evals, OpenAI’s new open-source software framework for evaluating the performance of its AI models, as a sign of OpenAI’s commitment to “strengthening” its models. Evals lets users develop and run benchmarks for evaluating models such as GPT-4 while observing their performance – a kind of crowdsourced approach to model testing.
“With evals, we can see [use cases] users care about in a systematic way that we’ve been able to test against,” Brockman said. “Part of why we [open-sourced] That’s because we’re moving away from releasing a new model every three months – whatever it was before – to continually improving. You don’t build what you don’t measure, right? as we build new versions [of the model]We can at least know what those changes are.”
I asked Brockman whether OpenAI would ever compensate people for testing its models with Evals. He didn’t commit to it, but did note that — for a limited time — OpenAI is giving select Evals users early access to the GPT-4 API.
Brockman and my conversation also touched on GPT-4’s context window, which refers to the amount of text the model can consider before generating additional text. OpenAI is testing a version of GPT-4 that can “remember” about 50 pages of content, or five times more “memory” than vanilla GPT-4 and eight times more than GPT-3 Is.
Brockman believes that the expanded context window leads to new, previously unexplored applications, especially in the enterprise. He envisions an AI chatbot built for a company that leverages context and knowledge from a variety of sources, including employees across departments, to answer questions in a very informed yet conversational way.
This is not a new concept. But Brockman makes the case that GPT-4’s answers will be far more useful than today’s chatbots and search engines.
Brockman said, “First, the model had no knowledge of who you are, what you’re interested in, etc.” “Having that kind of history [with the larger context window] It’s definitely going to make it more capable … it will turbocharge what people can do.”