Contact us
Earhartstrasse 17, 8152 Glattpark
Mobile: +41 76 577 32 59
Work requests
Mobile: +41 76 577 32 59

The capabilities of GPT-4o that simulate real conversations

At its spring update event, OpenAI unveiled GPT-4o, a new AI model that can perform real-time logic over audio, video and text.

GPT-4o marks a significant advance in natural human-machine interaction. This model improves real-time dialogs, video analysis, translations and much more. It thus confirms expectations and also some fears about the future of AI technology.

The capabilities of GPT-4o and their impact on everyday life include:

  1. Memory performance: Can learn from previous conversations with users.
  2. Real-time translation: Supports instant translations into 50 different languages.
  3. Solve math problems / tutoring: Explains mathematical problems in an understandable way and solves them.
  4. Language skills: Creates the feeling of talking to a real person through voice communication. Recognizes different voice tones.
  5. Multimedia analysis: Analyzes images and text and establishes connections between text and visual data.

These capabilities demonstrate the broad applicability of GPT-4o in interacting with users and performing various tasks. The model is improved through continuous learning from its experiences.
GPT-4o was presented yesterday by OpenAI in a live stream on YouTube. In this broadcast you can see the demonstration of these capabilities:

GPT-4o will be free of charge for all ChatGPT users
but OpenAI has not yet given an exact date when this will be possible. CEO Sam Altman only said that "the new sound mode will be available to Plus users in the coming weeks". Further details can be found in the "Model availability" section at the end of this article.

What innovations are behind the capabilities of GPT-4o? Let's take a look at the technical details from OpenAI...

GPT-4o is at the intelligence level of GPT-4, but much faster.

GPT-4o accepts any combination of text, audio and video as input and can generate any combination of text, audio and video output. It responds to voice input in just 232 milliseconds, which is very close to the response time of a human interlocutor and creates a near-human dialog experience.

Before GPT-4o, ChatGPT's speech mode had average delays of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4). GPT-4 could not directly detect tone of voice, multiple speakers or background noise. It was also unable to capture laughter, singing or emotional expressions as it used a pipeline of three separate models: one for converting sound to text, one for text output and a third for converting back to sound. This process resulted in loss of information. With GPT-4o, a single end-to-end model is used for text, image and sound. This means that all inputs and outputs are processed by the same neural network. "As GPT-4o is our first model that combines all these methods, we are only at the beginning of exploring its capabilities and limitations," says OpenAI.

The new model matches the performance of GPT-4 Turbo for English and coded texts and offers a clear advantage for texts in other languages. It is better at understanding images and sounds, much faster and 50 percent cheaper in the API.

Model ratings

Measured against conventional benchmarks, GPT-4o reaches the level of GPT-4 Turbo in the areas of text, reasoning and coding intelligence. It sets new standards in the areas of multilingualism, voice and video capability and takes AI technology to a new level.

Security and limitations of the model

OpenAI has integrated extensive safety measures: "GPT-4o has built-in safety mechanisms through various methods such as filtering training data and improving model behavior after training. We have developed new systems to provide protective barriers at the sound outputs. GPT-4o has been assessed against our Readiness Framework and in line with our voluntary commitments. Our assessments on cybersecurity, CBRN, persuasiveness and model autonomy show that GPT-4o is not higher than medium risk in any of these categories. The evaluation included automated and human assessments throughout the training process. We tested both pre- and post-safety limit versions of the model to better understand the model's capabilities. GPT-4o also underwent a comprehensive external Red Team study with over 70 experts in fields such as social psychology, bias, fairness and misinformation to identify risks introduced or amplified by the new methods. These findings were used to improve the safety measures."
The statement continues:

"We have recognized that the language methods of GPT-4o entail new risks. Today we are releasing text and image input and text output. In the coming weeks and months, we will work on the technical infrastructure, availability after training and security required for the introduction of further methods. For example, audio output will initially be limited to preset sounds and will comply with our current security guidelines. More details on all methods of the GPT-4o will be announced on the upcoming system board. In our testing, we have identified some limitations with all methods of the model, some of which can be seen in the video below. We welcome feedback to help us identify tasks where GPT-4 Turbo performs better than GPT-4o so that we can continue to improve the model."

Availability of the models

"GPT-4o is our latest step in pushing the boundaries of deep learning and making it practical. We have been working hard over the last two years to improve efficiency at every level of the stack. The first fruit of this research is a GPT-4 level model that is more widely available. The capabilities of GPT-4o are being rolled out iteratively (with extended Red Team access starting today)."
"GPT-4o's text and picture features are available on ChatGPT starting today. We are making GPT-4o available at the free level and for Plus users with up to 5x higher message limits. In the coming weeks, a new version of voice mode with GPT-4o alpha will be introduced in ChatGPT Plus. Developers can now also access GPT-4o as a text and image model in the API. GPT-4o is 2x faster than GPT-4 Turbo, half the price and has 5x higher speed limits. We plan to roll out the new audio and video features of GPT-4o to a small group of trusted partners in the API in the coming weeks."


Write comment

Your email address will not be published. Required fields are marked with *

This website stores cookies on your computer. Cookie Policy