Deep Learning Asteroids AI

Explore the methodologies, strategies, and key learnings from my reinforcement training process.

Hyperparameter Tuning

After testing over 23 configurations, these values were determined to be optimal key parameters:

Performance Graphs

Buffer Size
The graph below shows the phase reached with the default buffer size (gray) and the significantly increased one (purple). Note how the performance was noticably getting worse with the low buffer size, as the agent did not have a large memory of experiences to draw from. As such, whenever there is a dip in performance with a low buffer size, the agent doesn’t remember it’s prior good behavior to correct itself. This results in the performance cascading down.

Buffer

Gamma
Consider the graph below, where blue is 0.99 (medium), pink is 0.9 (low), and yellow is 0.999 (high). Intuitively, a low gamma value prioritizes short term rewards. This is ideal for turbulent environments where rewards are not guaranteed. However, even though the environment is turbulent, rewards are guaranteed, as long as the agent is able to survive. A high turbulence requires a low gamma, but a high gamma value is ideal when rewards are guaranteed. A medium gamma was able to learn more quickly and showed signs of improving. A low gamma was not ideal. A high gamma was learning much too slowly to be practical.

Gamma

Beta
The beta value corresponds to exploitation vs exploration. It was decreased from the default 1e-3 to 1e-4. A low beta corresponds to a lower learning rate, therefore favoring sticking to known strategies and exploiting rather than exploring. Indeed, the core gameplay loop is the same regardless of the phase so strategies don't change much beyond "don't get hit", so a low beta value was ideal. The light blue plotline corresponds to 1e-3 and the green line is 1e-4, both using the same parameters chosen above. We see a better rate of learning with a lower beta.

Beta

Rotation Observation
To reduce possible states, training was initially done without adding rotation to the observations for the agent. After identifying important parameters, rotation was added, significantly improving performance.

Performance after rotation observation

Rewards and Termination Logic

The reward system simplifies training by avoiding penalties, relying instead on episode termination to correct undesired behaviors.

State Representation

The agent uses four key sensors to adapt dynamically:

Back to Portfolio