Agents everywhere

OpenAI vs Perplexity vs Anthropic vs Gemini: Here come the AI agents!

co-written by Newsrooms.AI24. January 2025, 11:58

AI-Agent am Smartphone. © Grok / Trending Topics — Smartphone Agent. © Grok / Trending Topics

Startup Interviewer: Gib uns dein erstes AI Interview

Until now, we have asked them questions in the browser, locked in an app, or let them create texts, images, and code – but now they are breaking out and preparing to take over our smartphones and computers. 2025 will be the year of AI agents, and the billion-dollar AI startups OpenAI, Perplexity, and Anthropic are not hesitating to unleash their AI agents on the world.

OpenAI: “Operator” can control computers and operate websites

OpenAI has released its AI agent, called “Operator”, which has been awaited for weeks, in a first test version. Operator is an AI agent that can perform tasks on the web using the Computer-Using Agent (CUA) model. What’s special about it is that CUA interacts with graphical user interfaces like a human – it uses the mouse and keyboard to click buttons, navigate through menus, and fill in text fields. This makes it very flexible as it doesn’t require any special APIs.

In practice, Operator can perform various web-based tasks, such as researching information, creating shopping lists, compiling playlists, or finding venues. It works in an iterative process: It analyzes screenshots of the screen, plans the next steps, and then performs actions such as clicking, scrolling, or tapping. For sensitive actions such as logins or orders, it asks for confirmation for security reasons.

Operator’s performance is already impressive – according to OpenAI, it achieves success rates of 58-87% on web-based tasks. It works particularly well on repetitive UI interactions and when the task contains detailed instructions. It still has difficulties with unfamiliar user interfaces and precise text editing.

Operator is currently available as a Research Preview for Pro users in the US ($200/month). Great emphasis has been placed on security: there are restrictions on sensitive tasks such as banking transactions, a block list for certain websites, and various verification mechanisms. In addition, the user must confirm important actions, and active monitoring is required for sensitive websites.

Perplexity launches assistant for Android

The AI startup Perplexity, previously better known for its answer machine on the web, has developed its technology into a mobile assistant for Android. This new assistant can answer general questions and perform various tasks, such as writing emails, setting reminders, or making restaurant reservations. A special feature is its multimodality – the assistant can both analyze screen content and use the smartphone camera to perceive the environment.

Once installed on the Android smartphone and given the appropriate access, the assistant can, for example, play podcasts, organize Uber rides, and even identify specific products such as Pokémon trading cards. Writing and sending text messages via the contact list also works without any problems.

However, there are still limitations: The assistant currently only works with selected apps such as Spotify, YouTube and Uber, as well as with email, messaging and clock apps. Applications such as Slack or Reddit are not yet supported. While the service is already available for Android users, an iOS version is still missing – this will follow as soon as Apple grants the necessary permissions. Overall, it is Perplexity’s attempt to move from the web into the mobile world and compete with Apple’s Siri and Google’s Gemini. It will be exciting, however, to see how deeply Perplexity can integrate into the iPhone – Apple, which is introducing its own AI features in iOS, will still have a say in this.

Anthropic already has “Computer Use” in beta

A few months ago, Anthropic, one of OpenAI’s big rivals, introduced “Computer Use.” Anthropic’s top-of-the-line Claude 3.5 Sonnet can perform basic computer interactions – it can move the mouse cursor, click, and enter text using a virtual keyboard. The system does this by analyzing screenshots of the screen and executing actions based on them, similar to how a human would operate a computer.

The practical skills are currently limited and prone to errors. Claude can use simple software such as calculators and text editors, but he is slower than humans and makes more mistakes. More complex actions such as dragging, zooming or quick reactions to short-term screen changes are not yet possible.

A major technical breakthrough was the ability to accurately count pixels to perform precise mouse movements. The system can also detect errors on its own and repeat actions if something doesn’t work. What’s remarkable is that after training, Claude was able to transfer these skills to other software using a few simple programs.

The technology is currently in a public beta phase. While Claude performs significantly better than other AI models in computer usability tests with a success rate of 14.9%, it is still far behind human performance of 70-75%. The developers are working to improve their capabilities while implementing security measures against misuse.

Google’s Gemini can control apps

As reported, Samsung has given Google’s AI Gemini a very central role in its new flagship Series S25. Google Gemini can now perform multiple tasks across different apps with a single command. In concrete terms, this means that users can search for a restaurant and forward the information directly to a friend or have their favorite team’s game dates added to the calendar – all with just one instruction.

Multi-app support works with a range of Google apps as well as selected third-party apps such as WhatsApp and Spotify. For Samsung Galaxy S25 users, additional Samsung apps such as calendar, notes, reminders, and clock are also available. Before executing cross-app actions, Gemini asks again to be on the safe side.

The voice control “Gemini Live” is getting an update, which is initially only available for Galaxy S25/S24 and Pixel 9 smartphones. Users of these devices can share images, files, and YouTube videos in the chat interface and ask Gemini for feedback or further information. In the coming months, additional functions such as screen sharing and live video streaming for Android devices will be added. This will make Gemini an even more versatile assistant that can carry out complex, cross-app tasks using natural voice commands.