Time to look beyond language models, argues the Stanford professor and “godmother of AI”

From TIME — Nov 20, 2024

By Fei-Fei Li, co-director of Stanford HAI and CEO of World Labs

Language is full of visual aphorisms. Seeing is believing. A picture is worth a thousand words. Out of sight, out of mind. The list goes on. This is because we humans draw so much meaning from our vision. But seeing was not always possible. Until about 540m years ago, all organisms lived below the surface of the water and none of them could see. Only with the emergence of trilobites could animals, for the first time, perceive the abundance of sunlight around them. What ensued was remarkable. Over the next 10m-15m years, the ability to see ushered in a period known as the Cambrian explosion, in which the ancestors of most modern animals appeared.

Today we are experiencing a modern-day Cambrian explosion in artificial intelligence (AI). It seems as though a new, mind-boggling tool becomes available every week. Initially, the generative-AI revolution was driven by large language models like ChatGPT, which imitate humans’ verbal intelligence. But I believe an intelligence based on vision—what I call spatial intelligence—is more fundamental. Language is important but, as humans, much of our ability to understand and interact with the world is based on what we see.

A subfield of AI known as computer vision has long sought to teach computers to have the same or better spatial intelligence as humans. The field has progressed rapidly over the past 15 years. And, guided by the core belief that AI needs to advance with human benefit at its centre, I have dedicated my career to it.

No one teaches a child how to see. Children make sense of the world through experiences and examples. Their eyes are like biological cameras, taking a “picture” five times a second. By the age of three, kids have seen hundreds of millions of such pictures.

We need to move from large language models to large world models

We know from decades of research that a fundamental element of vision is object recognition, so we began by teaching computers this ability. It was not easy. There are infinite ways to render the three-dimensional (3D) shape of a cat, say, into a two-dimensional (2D) image, depending on viewing angle, posture, background and more. For a computer to identify a cat in a picture it needs to have a lot **of information, like a child does.

This was not possible until three elements converged in the mid-2000s. At that point algorithms known as convolutional neural networks, which had existed for decades, met the power of modern-day graphics processing units (GPUs) and the availability of “big data”—billions of images from the internet, digital cameras and so forth.

My lab contributed the “big data” element to this convergence. In 2007, in a project called ImageNet, we created a database of 15m labelled images across 22,000 object categories. Then we and other researchers trained neural-network models using images and their corresponding textual labels, so that the models learned to describe a previously unseen photo using a simple sentence. Unexpectedly rapid progress in these image-recognition systems, created using the ImageNet database, helped spark the modern AI boom.

As technology progressed, a new generation of models, based on techniques such as transformer architectures and diffusion, brought with them the dawn of generative AI tools. In the realm of language, this made possible chatbots like ChatGPT. When it comes to vision, modern systems do not merely recognise but can also generate images and videos in response to text prompts. The results are impressive, but still only in 2D.

For computers to have the spatial intelligence of humans, they need to be able to model the world, reason about things and places, and interact in both time and 3D space. In short, we need to go from large language models to large world models.

We’re already seeing glimpses of this in labs across academia and industry. With the latest AI models, trained using text, images, video and spatial data from robotic sensors and actuators, we can control robots using text prompts—asking them to unplug a phone charger or make a simple sandwich, for example. Or, given a 2D image, the model can transform it into an infinite number of plausible 3D spaces for a user to explore.

The applications are endless. Imagine robots that can navigate ordinary homes and look after old people; a tireless set of extra hands for a surgeon; or the uses in simulation, training and education. This is truly human-centred AI, and spatial intelligence is its next frontier. What took hundreds of millions of years to evolve in humans is taking just decades to emerge in computers. And we humans will be the beneficiaries.  ■●

2024

https://www.youtube.com/watch?v=vIXfYFB7aBI