As and exercise I implemented a reinforcement learning agent in a simple Gridworld with Python. Full code can be found on github.
The agent does not have any prior knowledge about the environment nor it's transitions. It learns by exploration to reach the goal, where it's given a reward. Then the world is reset. The algorithm learns the Q function of (state, action) pairs and then uses it to guide itself through a maze.
During the process, it stores all experience and during each step it replays it so the Q function converge to the true Q* quicker.
The movement of the agent is shown as an ASCII output, with 1 as walls, 9 as the goal and 2 as the current position of the agent.