It's Elon Musk's AI's turn to integrate human-like vision, hearing, and speech capabilities

It’s Elon Musk’s AI’s turn to integrate human-like vision, hearing, and speech capabilities

Grok, the AI chatbot developed by xAI – the company founded by Elon Musk – is in the process of integrating multimedia processing capabilities, allowing users to interact through both images and text.

Grok – the artificial intelligence (AI) product from xAI – the company founded by Elon Musk, is expected to soon be upgraded with the ability to process multimedia information. This information was revealed through developer documentation published by xAI.

Grok is considered a ‘rookie’ in the AI field.

In March 2024, Grok made significant strides with the Grok 1.5 version, featuring notably improved reasoning abilities. Previously, in a blog post last month, xAI hinted that Grok-1.5V would offer “multimodal models in certain specific domains.” The recent update to the developer documentation seems to indicate that xAI is preparing to launch a new AI model. This means that users will be able to upload images to Grok and receive text-based responses. Specifically, the documentation guides developers on how to use xAI’s software development kit (SDK) to create responses based on both text and images. A sample Python script illustrates how to read image files, set up text prompts, and use the xAI SDK to generate responses.

Launched in November 2023 and exclusively available to X Premium Plus subscribers, Grok is regarded as a “rookie” in the AI field compared to heavyweights like OpenAI’s ChatGPT. A unique aspect of Grok is its ability to access real-time information, including posts on the X platform. According to xAI, the Grok model was trained on “multiple sources of publicly available text data on the Internet up to Q3 2023 and datasets curated and selected by human reviewers.”

X’s blog post also confirmed that Grok-1 was not trained on X data (including public X posts). However, xAI acknowledged that benchmarks for large language models are often criticized because models can perform well on those benchmarks if they were included in their training data. This is akin to memorizing answers for a test rather than truly understanding the content.

Nonetheless, according to a blog post by xAI, Grok 1.5 is gradually closing the gap with GPT-4 across various evaluation metrics, from elementary level to high school competitions. Multimodal chatbots are seen as the next frontier in the AI race. Many industry giants, such as Google, have announced new advancements at the Google I/O event, while OpenAI has also unveiled GPT-4o. The lack of multimedia capabilities has caused Grok to lag behind until now. With ongoing upgrade efforts, can Grok make a surprise impact in this challenging race?