Français English
Let the robots learn what they want / Pierre-Yves Oudeyer
Children, especially young infants, often decide for themselves what they are interested in. They are not forced by someone to learn how to interact with their environment, nor are they constantly being instructed on how to do something. They are not passive learners, but actors of their own development. Most machine learning experiments so far are quite different because a closed set of example data is prepared and the experimenter chooses the algorithm and sets its parameters carefully. For me, the questions are then: Is it possible to build a machine that permanently develops new skills over a long period of time, similar to the way in which children do so? Can a machine learn its own ecology of complex know-how, without a programmer pre-specifying it?
Work in developmental psychology suggests that natural intelligence resides in the dynamical interactions between the brain, the body and the environment. The particular structural properties of these three poles are such that some kinds of interactions are spontaneous and easy (which we call “affordant”), like for example fitting the different pieces together of a puzzle. Some other interactions, like screwing with a pen, are awkward and difficult. This is not only true for physical relations between the environment and the body, but also for the brain: just like any other kind of learning machinery, the brain has biases that make some activities easier to learn than others. There is a vast potential of activities available to children, but they are particularly good at choosing or inventing those that are suited for their bodies and for the state of their cognitive structures at a particular point in time. So children do not merely explore any new situation that they might encounter, they apply strong selection criteria. But by engaging in these activities, they develop new skills which in turn open new areas of potential affordant interactions. This might explain why they continuously master new know-how so well.

In this landscape, the crucial mechanism seems to be the motivational system that drives the active exploration of a child. In other words, we need to understand how curiosity works and implement curiosity in a robot. During the past few years, my colleague Frédéric Kaplan and I have thus been developing algorithms for curiosity-driven behaviour that differ radically from traditional approaches in Machine Learning. Instead of starting with a predefined problem and then exploring the space of brains and bodies that might solve it, we start with a predefined body and learning system, whatever they are, driven by a system of curiosity, and then let the robot choose and progressively build up its own problems and associated skills. So instead of posing our robots with a certain task or problem, we let them “learn what they want”.

The artificial curiosity system we developed is composed of two modules : a classical prediction system, and a “metapredictor”. The first module learns the perceptual consequences of actions performed in given sensorimotor contexts. It makes it possible to calculate an “error measure” by taking the difference between the predicted consequence of an action and the actual consequence. The second module learns to predict the errors that were made by the first module. In other words, this second system builds a model of the first and associates a level of difficulty to all sensorimotor situations.

We have associated a value system to these predictors that pushes the robot away from situations that are too familiar or too difficult to predict, so it searches for situations where the learning process is maximal. Indeed, the metapredictor computes the local derivative of the prediction error curve corresponding to the situations reachable from a given sensorimotor context. The robot then chooses the action that will lead it to the situation with the lowest derivative. In this computation, he compares prediction errors obtained in similar situations. To achieve this, the metaprediction system uses an algorithm that progressively splits the space of possible situations into groups of similar situations. For each group, it maintains a history of past errors in prediction, which allows it to compute the associated derivatives, which in turn define the “interestingness” of each group of situations.

Let me illustrate the behaviour of this artificial curiosity system with an example: imagine an environment that contains four types of sensorimotor activities for the robot, which we call sensorimotor contexts (e.g. running and bumping into walls, shooting a ball, chasing a cat, sleeping). If one would force the robot to concentrate on each of these activities separately, one could measure the corresponding evolution of its errors in prediction. In one situation (1), the error is always high and does not decrease, maybe because this situation is just too complicated for the learning machinery of the robot. In another situation (4), the error is always low and does not change. In the two other cases (2 and 3), the error is high at the beginning and progressively decreases, but at different rhythms. In practice, the robot is placed in an environment in which these four activities are possible, but it knows nothing about the corresponding theoretical learning curves, and does not even know that there are four different sensorimotor contexts. In this case, one should observe initially a phase of random exploration that allows the robot to find out that there are different situations and to compute an evaluation of their relative interestingness in terms of learning progress. We should then observe the behaviour displayed on lower curves of figure 2. The robot avoids situations 1 and 4 because they do not allow progress in learning. Nevertheless, it explores them from time to time and randomly in order to verify that they remain uninteresting. On the contrary, it will concentrate first on situation (3) for which its predictions get better and better quite fast in the beginning. After a while, situation (3) is mastered and predictable: the robot spontaneously shifts to the exploration of situation (2) which at this stage of its development provides more learning progress.

By setting up environments with objects providing potentially interesting affordances, we have already shown how a complex developmental trajectory could form, with a sequence of stages in which the robot learned skills of increasing complexity which were never given on forehand by a human programmer. The environments in these initial experiments were constrained so that we could reproduce them many times. We are now extending them, both with time and new objects, so that we can see how far it scales up.

Pierre-Yves Oudeyer