Lecture 5. Reinforcement Learning.

Reinforcement Learning

It allows models to learn by doing some actions instead of learning from pure data.

We know different types of learning:

  • supervised learning → x is data, y is label,
  • unsupervised learning → x is data, no labels,
  • reinforcement learning → state-action pairs.

Now we focus on reinforcement learning. The key scenarios where it is used is in games or robotics.

The key concepts of reinforcement learning

There is an Agent, which performs different actions in some environment. Actions are moves/things to do in such environment. We do some observations of the world to get a state. We constantly observe the environment to get state changes. We pass state changes and possible actions to agent, which decides to do nothing or to do some action to impact the state. Based on that we calculate reward, which is feedback about possible outcome of action based on current state.

Two algorithms

Value learning

DQN → Deep Q Network

We try to maximize target return. Algorithm is taking state and actions, and calculating some values. After that is picking only the highest value. So in case we can move left, right or stay, we have 3 possible actions that will be scored. Then only the best option is picked. There are some downsides of this method as we have finite amount of actions to take and there is no flexibility in algorithm as it will always pick the highest value.

Policy gradient

Here we are using probability to get optimal policy. So instead of values, we calculate probability of some action. Based on previous example, in case of our 3 options, we now get 3 probabilities, instead of one maximum value picked earlier. So now we have more flexibility as this algorithm might try to explore new ways of achieving something.

Training

  1. Initialize the agent.
  2. Run the policy until termination.
  3. Record all states, actions, and rewards.
  4. Adjust action probabilities:
    1. Decrease the probability of actions that resulted in low rewards (move them halfway closer to failure).
    2. Increase the probability of actions that resulted in high rewards (move them halfway closer to success).
  5. Repeat the process to refine the policy.

There is also a special training method, where we use some algorithm or ML model to play against our agent, so it can learn. But it’s not starting from scratch. It’s starting already with some knowledge.

Applications

The most popular option is for games. For example AlphaZero playing different games. Another option is about self-driving car as such algorithm can learn during driving and act in new environments.

Lecture: https://youtu.be/8JVRbHAVCws?si=rEv-9jqj9Tq9gX6U