Policy Improvement

Définition de l’environnement

Valeurs des variables sur l’exemple :

states = ["North", "South", "East", "West"]
actions = ["clock", "anti-clock", "stay"]
gamma = 0.9  
horizon = 10  # noté H dans les diapositives

États et actions possibles

On considère un cercle découpé en 4 directions principales :
Nord, Est, Sud, Ouest, avec 3 actions possibles :

clock (sens horaire)
anti-clock (sens antihoraire)
stay (rester sur place)

Chaque action entraîne une transition déterministe vers un nouvel état :

$P (s, a) \to s^{'}$

$P = ⎩ ⎨ ⎧ P (N or t h, c l oc k) = E a s t P (N or t h, an t i - c l oc k) = W es t P (N or t h, s t a y) = N or t h P (E a s t, c l oc k) = S o u t h P (E a s t, an t i - c l oc k) = N or t h P (E a s t, s t a y) = E a s t P (S o u t h, c l oc k) = W es t P (S o u t h, an t i - c l oc k) = E a s t P (S o u t h, s t a y) = S o u t h P (W es t, c l oc k) = N or t h P (W es t, an t i - c l oc k) = S o u t h P (W es t, s t a y) = W es t$

Récompenses associées ( R(s,a) )

$R = ⎩ ⎨ ⎧ R (N or t h, c l oc k) = 2, R (N or t h, an t i - c l oc k) = 2, R (N or t h, s t a y) = - 2 R (E a s t, c l oc k) = 3, R (E a s t, an t i - c l oc k) = 10, R (E a s t, s t a y) = 1 R (S o u t h, c l oc k) = 1, R (S o u t h, an t i - c l oc k) = 3, R (S o u t h, s t a y) = 1 R (W es t, c l oc k) = 10, R (W es t, an t i - c l oc k) = - 1, R (W es t, s t a y) = 1$

Étape 1 — Politique initiale

On commence avec une politique arbitraire (non optimale) :

$π_{0} = ⎩ ⎨ ⎧ π (N or t h) = s t a y π (S o u t h) = c l oc k π (E a s t) = c l oc k π (W es t) = an t i - c l oc k$

Étape 2 — Évaluation de la politique

On estime la valeur des états sous la politique courante :

$V^{π} (s) = R (s, π (s)) + γ s^{'} \sum P (s^{'} ∣ s, π (s)) V^{π} (s^{'})$

L’algorithme procède par itérations successives (approche approchée sur un horizon (H = 10)) :

V = {s: 0 for s in states}
for _ in range(horizon):
    new_V = V.copy()
    for s in states:
        a = policy[s]
        new_V[s] = R[(s,a)] + gamma * sum(P[(s,a)][s2] * V[s2] for s2 in P[(s,a)])
    V = new_V

À la fin de cette boucle, on obtient une estimation stable de $$ (V^{\pi}) $$.

Étape 3 — Amélioration de la politique

Pour chaque état (s), on calcule la valeur d’action (Q(s,a)) :

$Q (s, a) = R (s, a) + γ s^{'} \sum P (s^{'} ∣ s, a) V^{π} (s^{'})$

et on met à jour la politique en choisissant l’action optimale :

$π^{'} (s) = ar g a max Q (s, a)$

Code correspondant :

for s in states:
    Q_values = {}
    for a in actions:
        Q_values[a] = R[(s,a)] + gamma * sum(P[(s,a)][s2] * V[s2] for s2 in P[(s,a)])
    policy[s] = max(Q_values, key=Q_values.get)

Étape 4 — Vérification de la stabilité

Si la politique ne change plus $(π^{'} = π)$ , on a atteint une politique optimale :

$π^{*} = π^{'}$

Sinon, on retourne à l’étape d’évaluation et on répète jusqu’à convergence.

Résultat final

Après convergence :

$π^{*} = ⎩ ⎨ ⎧ π^{*} (N or t h) = c l oc k π^{*} (S o u t h) = c l oc k π^{*} (E a s t) = an t i - c l oc k π^{*} (W es t) = c l oc k$

et les valeurs d’état associées $(V^{*})$ maximisent les retours attendus sous cette politique optimale.

SAE

Policy Improvement

Définition de l’environnement

États et actions possibles

Probabilités de transition

Récompenses associées ( R(s,a) )

Étape 1 — Politique initiale

Étape 2 — Évaluation de la politique

Étape 3 — Amélioration de la politique

Étape 4 — Vérification de la stabilité

Résultat final

Formule générale de la Policy Iteration

Évaluation :

Amélioration :

Keyboard shortcuts

SAE