- Intelligent Projects Using Python
- Santanu Pattanayak
- 487字
- 2021-07-02 14:10:45
Q-learning
We will now look at a popular reinforcement learning algorithm, called Q-learning. Q-learning is used to determine an optimal action selection policy for a given finite Markov decision process. A Markov decision process is defined by a state space, S; an action space, A; an immediate rewards set, R; a probability of the next state, S(t+1), given the current state, S(t); a current action, a(t); P(S(t+1)/S(t);r(t)); and a discount factor, . The following diagram illustrates a Markov decision process, where the next state is dependent on the current state and any actions taken in the current state:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/83330458-c0b9-4660-a1e3-6cd4927244a6.png?sign=1739630423-RUBa1noLWswfeLVTm8wObP5tDT3DGjxc-0-9452316cda5466cb0b048875ad345318)
Let's suppose that we have a sequence of states, actions, and corresponding rewards, as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/3c565184-9ac9-488b-8caa-6136105ef059.png?sign=1739630423-x6eTJfzCf81plIZTuLky35twK51DWDqn-0-5e4d071e5a60515dc36fb0d9b3054b7e)
If we consider the long term reward, Rt, at step t, it is equal to the sum of the immediate rewards at each step, from t until the end, as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/94a7a7da-3528-4fb2-9e42-1af592e0c024.png?sign=1739630423-Rlx4AXug3CmVDlwNHGf7T7shQthmIeiH-0-268af773ee7b0e7b9f5d5409bed643f0)
Now, a Markov decision process is a random process, and it is not possible to get the same next step, S(t+1), based on S(t) and a(t) every time; so, we apply a discount factor, , to future rewards. This means that, the long-term reward can be better represented as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/4a4b48fe-0a74-4d0f-92ad-bad7fdc7fb35.png?sign=1739630423-ovqNoRF1Go1vKoL6cU7zXRFfkj74jA7r-0-d70b538bd4c5eecd19bba463eb787542)
Since at the time step, t, the immediate reward is already realized, to maximize the long-term reward, we need to maximize the long-term reward at the time step t+1 (that is, Rt+1), by choosing an optimal action. The maximum long-term reward expected at a state S(t) by taking an action a(t) is represented by the following Q-function:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/b3ecdb7c-d4cf-48a2-af93-c969b62fb41e.png?sign=1739630423-qhGuPSuW9TltvjMfMJCHCABhoNE2nYIa-0-99118552eb7a98e3e2aef2316bb8026c)
At each state, s ∈ S, the agent in Q-learning tries to take an action, , that maximizes its long-term reward. The Q-learning algorithm is an iterative process, the update rule of which is as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/b6e4a367-e0aa-406d-bd8b-5e1b2cb8c1d7.png?sign=1739630423-lsCRqwK3iu1lEsSwkOQfSsVyC0vqU4Ha-0-c6a8e93711fb4e70d4d2e8e5ba466825)
As you can see, the algorithm is inspired by the notion of a long-term reward, as expressed in (1).
The overall cumulative reward, Q(s(t), a(t)), of taking action a(t) in state s(t) is dependent on the immediate reward, r(t), and the maximum long-term reward that we can hope for at the new step, s(t+1). In a Markov decision process, the new state s(t+1) is stochastically dependent on the current state, s(t), and the action taken a(t) through a probability density/mass function of the form P(S(t+1)/S(t);r(t)).
The algorithm keeps on updating the expected long-term cumulative reward by taking a weighted average of the old expectation and the new long-term reward, based on the value of .
Once we have built the Q(s,a) function through the iterative algorithm, while playing the game based on a given state s we can take the best action, , as the policy that maximizes the Q-function:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/88e6ee5c-d64d-4d70-a9cf-ee3eb0e6c32a.png?sign=1739630423-V2AmqHcMa3iiMLS1juFMUwH647kZBjKi-0-17ce7e0c4e59853f70a0c5355c8a54b3)