AI summit : Frontiers in Generative AI

At the “AI, Science and Society” conference organized by Institut Polytechnique de Paris on February 6 and 7, 2025 at École polytechnique, the “Frontiers in Generative AI” symposium explored advances in generative AI, of which ChatGPT is the emblematic example. In particular, Vicky Kalogeiton presented her research work at the Computer Science Laboratory of the École polytechnique (LIX*).

17 Feb. 2025

Research, IA et Science des données, LIX

Generative AI (GenAI) is the natural successor to machine learning. Here, a computer predicts how a system will behave based on data it has already analyzed. It does this by learning to recognize patterns in existing data. The process does away with the need to manually sift through huge amounts of information, so dramatically saving us time.

As well recognizing and predicting patters, GenAI can also create new patterns. For example, it can invent a story from a few lines of input or transform simple text prompts into ultrarealistic images and even extended video clips.

ChatGPT was not the beginning of the story

The best known GenAI is ChatGPT (where GPT stands for generative pre-trained transformer). This example of a Large Language Model (LLM) actually dates back to the 1950s when the US mathematician Claude Shannon applied information theory – the branch of mathematics that deals with quantifying, storing and transmitting information – to human language. We now routinely use such statistical language modelling methods for a wide range of processing tasks, from spell-checking software to translation.

Since ChatGPT was launched at the end of 2022, other such GenAI have appeared. Other examples are Gemini and Perplexity or Mistral. All these models are trained on a range of sources, including books, articles, Wikipedia articles and chat logs (which is not without raising problems of copyright disregard). For language applications, they work by assigning probabilities to each possible next word in a sentence. They are therefore strictly speaking not real “intelligence”, but can be more accurately described as “statistically-informed guessers”.

The symposium Frontiers in generative AI, chaired by Karteek Alahari, Research Director at Inria highlighted the cutting-edge research in advanced mathematics that is advancing today’s GenAI. The technology has now reached a new level and is allowing researchers in the field to explore novel scenarios rather than simply processing existing data. These could have implications for a host of areas in science and medicine including discovering new materials for industrial and technological applications and the development of advanced physics and mathematics models. One well-known such model is “AlphaFold”, which can predict molecular structures to accelerate drug discovery and help design new proteins. Such models previously relied on techniques like molecular dynamics, which are expensive and time-consuming and require supercomputers.

Inspired by filmmakers

One of the key speakers at the symposium, Vicky Kalogeiton, who is a professor at École Polytechnique, works on advanced dynamic text-to-image or speech-to-image GenAI models. The goal of her group’s research is to create videos from written news reports, stories and other text-based content and the spoken word that successfully convey the emotions contained in the text or spoken narrative. To achieve this, the GenAI needs to go beyond simply generating images or sequences (like ordinary video processing systems like those behind YouTube, Netflix and Amazon). The processes is uses are inspired by how human filmmakers carefully edit scenes, select camera angles and adjust lighting to evoke specific emotions.

The task is challenging since most AI models today are “autoregressive”, meaning they rely on a previous still frame to build the next one, and so on. They therefore cannot “tell” a story. Being able to connect separate stills and even fill in missing frames is a paradigm shift in the field.

Vicky Kalogeiton’s work involves understanding interactions between the characters in movies, for example, by leveraging multimodal data. For humans, this data might be taken in in the form of sights, smells and sounds, but for AI this means data in the form of text, images, audio files or videos.

Such advanced multimodal techniques might be used to go beyond entertainment and storytelling applications and be applied to specialized domains such as medical imaging.

*LIX: a joint research unit CNRS, École Polytechnique, Institut Polytechnique de Paris, 91120 Palaiseau, France

Back