Deep Q Learning |
1 |
Temporal Difference |
Off-Policy |
Yes |
No |
No |
Yes |
No |
Double Deep Q Learning V1 (Randomly Chosen Network) |
1 (2 Model Parameters) |
Temporal Difference |
Off-Policy |
Yes |
No |
No |
Yes |
No |
Double Deep Q Learning V2 (Target Network) |
1 (2 Model Parameters) |
Temporal Difference |
Off-Policy |
Yes |
No |
No |
Yes |
No |
Deep State-Action-Reward-State-Action |
1 |
Temporal Difference |
On-Policy |
Yes |
No |
No |
Yes |
No |
Double Deep State-Action-Reward-State-Action V1 (Randomly Chosen Network) |
1 (2 Model Parameters) |
Temporal Difference |
On-Policy |
Yes |
No |
No |
Yes |
No |
Double Deep State-Action-Reward-State-Action V2 (Target Network) |
1 (2 Model Parameters) |
Temporal Difference |
On-Policy |
Yes |
No |
No |
Yes |
No |
Deep Expected State-Action-Reward-State-Action |
1 |
Temporal Difference |
On-Policy |
Yes |
No |
No |
Yes |
No |
Double Deep Expected State-Action-Reward-State-Action V1 (Randomly Chosen Network) |
1 (2 Model Parameters) |
Temporal Difference |
On-Policy |
Yes |
No |
No |
Yes |
No |
Double Deep Expected State-Action-Reward-State-Action V2 (Target Network) |
1 (2 Model Parameters) |
Temporal Difference |
On-Policy |
Yes |
No |
No |
Yes |
No |
REINFORCE |
1 |
Both |
On-Policy |
No |
Yes |
Yes |
Yes |
Yes |
Vanilla Policy Gradient |
2 (Actor + Critic) |
Both |
On-Policy |
Yes (Actor) |
Yes (Critic) |
Yes |
Yes |
Yes |
Actor-Critic |
2 (Actor + Critic) |
Both |
On-Policy |
Yes (Actor) |
Yes (Critic) |
Yes |
Yes |
Yes |
Advantage Actor-Critic |
2 (Actor + Critic) |
Both |
On-Policy |
Yes (Actor) |
Yes (Critic) |
Yes |
Yes |
Yes |
Asynchronous Advantage Actor-Critic |
2 (Actor + Critic) |
Both |
On-Policy |
Yes (Actor) |
Yes (Critic) |
Yes |
Yes |
Yes |
Proximal Policy Optimization |
2 (Actor + Critic) |
Both |
On-Policy |
Yes (Actor) |
Yes (Critic) |
Yes |
Yes |
Yes |
Proximal Policy Optimization with Clipped Objective |
2 (Actor + Critic) |
Both |
On-Policy |
Yes (Actor) |
Yes (Critic) |
Yes |
Yes |
Yes |