Reinforcement learning

It considers the scenario of a dynamic environment that results in state-action-reward triples as the data.

The difference between reinforcement and Supervised learning is that, in this learning there is no optimal action in a given state, but the learning algorithm must identify an action in order to maximize the expected reward over time.

Thus, the algorithm is not told which actions to take in a given situation and is rewarded or punished after a delayed time.

Example of Reinforcement Learning Problem: Playing Chess

Each board configuration, namely the position of chess pieces on the chess board, is a given state; the actions are the possible moves in a given configuration

In this example, the reward of the algorithm is winning the game, the punishment is losing the game. This reward and punishment is delayed which is very typical for reinforcement learning.

Reward and Punishment

In order to maximize reward or minimize punishment, a learning algorithm must choose actions which have been tried out in the past and found to be effective in producing reward. In other words: the algorithm must exploit its current knowledge.

But, on the other hands, to discover those actions the learning has to choose actions not tried in the past and thus explore the state space.

Since a given state has no optimal action, one of the biggest challenges of a reinforcement learning algorithm is to find a trade-off between exploration and exploitation.