Assignment #3: Reinforcement Learning

Due: November 7 by 12:05 am

This assignment is to be done individually.

Academic Integrity

The following is offered with apologies to the vast majority of students who do their work honestly and take their university learning seriously:

Your instructor takes academic integrity seriously and has no tolerance for plagiarism or any other form of academic misconduct. Failure to respect these guidelines will result in you receiving a grade of zero on this assignment.

Acceptable collaboration between students, provided it is acknowledged explicitly in your report and code, might include:

  1. discussing some aspect of the assignment specifications in attempting to understand a particular point
  2. discussing one of the functions of the Arcade Learning Environment
  3. discussing a problem you encountered while extracting the game score
Sharing of any computer code between groups, or re-using any code from a third party (e.g., open source) is acceptable, provided that you indicate this explicitly at the start of your report and (as comments) in the source code. In this case, only the portion of work that you did as an individual group will be considered toward your grade.

Unacceptable collaboration and violations of academic integrity include, but are not limited to:

  1. including any code that was not your own and failing to indicate so
  2. copying part of another group's report
If you are uncertain about any of these guidelines, please discuss with your instructor as soon as possible.

Introduction

In this assignment, you will apply reinforcement learning to improve the performance of an AI agent playing the game of qbert within the Atari emulator, Stella. The Arcade Learning Environment, despite its limited external documentation, will likely prove invaluable for accessing various elements of the game state. The qbert binary is available here.

Some tips to get started:

As potentially useful background reading, you may wish to consult previous research that investigated temporal difference learning methods in such game environments, two examples of which are:

It should be emphasized, however, that you are not expected to carry out the same degree of development and experimentation as described in such publications. These are only recommended for possible inspiration, as they go well beyond the scope of these assignment specifications.

Submitting your assignment

Your assignment must be submitted through moodle to allow for peer- and self-assessment. The submission must contain:

Marking Scheme

(Subject to minor revision)

CriterionUnsatisfactoryBare minimumSatisfactoryGoodAbove and beyond
Description of approach to generalization 0. none or incomprehensible 5. provides overview of function approximation (if following Sec. 21.4) or expression for distance metric between two states (if following Mahadevan's approach) 7. provides description of function approximation parameters (Sec. 21.4) or expression for distance metric between two states and describes state representation used by the RL agent (Mahadevan) 10. describes the degree to which function approximation specializes to characterize good state-action pairs (Sec 21.4), or provides an expression for distance metric between two states and describes state representation used by the RL agent, also includes rationale for choice of components of the distance metric (Mahadevan)
Results of generalization 0. no results provided 5. graph or table showing improvement either in training time to reach some level of performance, or performance achieved after a fixed number of cycles vs. results without generalization 7. results provided for at least 2 different generalization approaches (i.e., choice of components of the distance metric) and some discussion regarding consequences to behaviour of game agent 10. results provided for at least 3 different generalization approaches (i.e., choice of components of the distance metric) and meaningful discussion regarding consequences to behaviour of game agent
Description of approach to exploration 0. none or incomprehensible 5. provides expression for optimistic prior for (state, action) pairs 7. provides expression for optimistic prior for (state, action) pairs with clear explanation of how agent chose action at each step 10 provides expression for optimistic prior for (state, action) pairs with clear explanation of how agent chose action at each step and convincing rationale for the approach taken
Results of exploration 0. no results provided 5. graph or table showing improvement either in training time to reach some level of performance, or performance achieved after a fixed number of cycles vs. results without exploration 7. results provided for at least 2 different exploration functions (i.e., weighting or N[s,a] in optimistic prior calculation) and some discussion regarding consequences to behaviour of game agent 10. results provided for at least 2 different exploration functions (i.e., weighting or N[s,a] in optimistic prior calculation) and meaningful discussion regarding consequences to behaviour of game agent
Agent Performance 0. agent not demonstrated at competition; no results provided 10. agent was able to run during competition, but no results provided 15. graph or table showing agent performance, but inadequately annotated or explained to provide clear understanding of effects of learning 20. graph or table showing agent performance, either as a function of game score or game time (before death of game agent) as a function of RL time (e.g., number of games played) 30. as above, including analysis of effects of game events on agent behaviour and strategies (e.g., "enemy" avoidance) AND explanation of results over trials with multiple seeds to demonstrate generalization of learning
Relevant commenting and readability 0. no README, minimal or no commenting of code 5. README provided describing how to run agent, and code contains basic commenting identifying how state representation is formed   10. as above, README explains how to modify seed, and other major blocks of code are clearly commented
Ranked placement of agent in class tournament 0. (didn't play) 3. 5. 7. 10.

Last updated on 6 November 2017