Planning in Imagination
Piotr Januszewski
Gdańsk, Pomeranian Voivodeship
- 0 Collaborators
I try to make AI think more as humans do. Humans use their mental world models to plan decisions out in their minds. The idea is to take a powerful search algorithm like TD-Search or MCTS and couple it with a learned environment model. The algorithm should be able to pick the best decision among imagined plans in an arbitrary environment that it have learned to model (e.g. Atari game). This way of learning, called model-based reinforcement learning, is more sample-efficient then current state-of-the-art model-free methods like DQN or PPO. This work is part of my Master's thesis at University of Technology in Gdańsk. Thanks to Grzegorz Beringer and Mateusz Jabłoński for their collaboration! ...learn more
Project status: Under Development
Intel Technologies
Intel Opt ML/DL Framework,
Intel Python,
MKL,
Movidius NCS,
Intel CPU
Overview / Usage
Planning in imagination
The aim of this work was to derive from previous work on model learning in complex high-dimensional decision making problems and apply them to planning in complex tasks. Those methods proved to train accurate models, at least in short horizon, and should open a path for application of planning algorithms to problems without access to an accurate simulator i.e. playing Atari 2600 games from pixels, a platform used for evaluation of general competency in artificial intelligence. The goal was to improve data efficiency without loss in performance compared to state-of-the-art model-free methods. This work focused on three benchmarks: an arcade game with dense rewards Boxing, a challenging environment with sparse rewards Freeway and a complex puzzle game Sokoban.
Deriving from state-of-the-art model learning and model-based RL techniques, three architectures are presented: “Original World Models” (OWM), “World Models and AlphaZero” (W+A) and “Discrete PlaNet” (DPN). Despite many difficulties, DPN finally reached a level of performance equal or higher than strong model-free and model-base baselines in low data regime of up to 1M interactions with the real environment of Boxing.
The most challenging part of this work, underestimated at first by the author, was model learning. Current state-of-the-art model learning methods, although report promising results, were not tested for planning with them using search based algorithms, let alone planning and learning.
Neither architecture was able to learn playing Sokoban. Certainly, the problem lies in model learning techniques. Sokoban dynamics, although based on simple rules of moving a character and pushing boxes, allow for incredible number of possible states and levels configurations. The models were not able to generalize well to this number of possibilities.
This work is my master's thesis defended with honours.
Methodology / Approach
This work is based on or takes inspiration from:
-
AlphaZero for planning and decision making (our implementation: https://github.com/piojanu/AlphaZero).
AlphaGo Lee version of this algorithm became the first AI to beat a human grandmaster of Go in March of 2016. It is huge achievement as for a long time people thought that Go is the holy grail of artificial intelligence and we were still many years before AI will beat humans in this game! If you ask chess player why he made some move, he will give you clear strategy he follows. But if you ask Go player, why he took some action, he will often respond that it felt right. The ancient game of go is so complex, that it requires strong intuition to master it! To fully appreciate AlphaZero achievement, you need to know that the ancient Game of Go has more states than there are atoms in the known universe and its branching factor is more than seven times bigger than the one of chess.
In October of 2017, DeepMind published new AlphaGo version. AlphaGo Zero had defeated AlphaGo 100–0. Incredibly, it had done so by learning solely through self-play. No longer was a database of human expert games required to build a super-human AI. 48 days later they publish current state-of-the-art version AlphaZero, which can play other games like Chess and Shogi. It can beat the world-champion programs StockFish (for Chess) and Elmo (for Shogi) learning through self-play for only about 4 and 2 hours respectively! -
World Models for learning compact representation and latent dynamics model (our implementation: https://github.com/piojanu/World-Models)
We read in paper: Humans develop a mental model of the world based on what they are able to perceive with their limited senses. The decisions and actions we make are based on this internal model. Jay Wright Forrester, the father of system dynamics, described a mental model as: “The image of the world around us, which we carry in our head, is just a model. Nobody in his head imagines all the world, government or country. He has only selected concepts, and relationships between them, and uses those to represent the real system. -
PlaNet for latent model learning and planning (my fork: https://github.com/piojanu/planet):
Key contributions of the paper:- Planning in latent spaces: "We solve a variety of tasks from the DeepMind control suite, shown in Figure 1, by learning a dynamics model and efficiently planning in its latent space. Our agent substantially outperforms the model-free A3C and in some cases D4PG algorithm in final performance, with on average 50× less environment interaction and similar computation time." ~ excerpt from the paper
- Recurrent state space model: "We design a latent dynamics model with both deterministic and stochastic components (Buesing et al., 2018; Chung et al., 2015). Our experiments indicate having both components to be crucial for high planning performance." ~ excerpt from the paper
- Latent overshooting: "We generalize the standard vari- ational bound to include multi-step predictions. Using only terms in latent space results in a fast and effective regularizer that improves long-term predictions and is compatible with any latent sequence model." ~ excerpt from the paper
Technologies Used
Hardware:
- This work has been partially supported by Statutory Funds of Electronics, Telecommunications and Informatics Faculty, Gdansk University of Technology, which provided access to DGX Station deep learning server.
Languages:
- Python 3
Frameworks:
- PyTorch,
- Keras (+ TensorFlow),
- TensorFlow Probability,
- HumbleRL (https://github.com/piojanu/humblerl), proprietary framework for RL research.
Libs (under frameworks):
- Intel MKL and MKL-DNN,
- CUDA and cuDNN.
Algorithms:
- Deep Neural Networks (training with Nvidia, inference with Intel):
- Variational Autoencoders,
- Mixture Density Networks as head of LSTM.
- AlphaZero (Monte-Carlo Tree Search on steroids, DL steroids),
- CMA Evolutional Strategy (parallel implementation on Intel processor).
Other links
-
World Models implementation (https://github.com/piojanu/World-Models)
-
PlaNet fork adopted to this project's needs (https://github.com/piojanu/planet)
-
AlphaZero implementation (https://github.com/piojanu/AlphaZero)
-
Straightforward reinforcement learning Python framework (https://github.com/piojanu/humblerl)