Figuring out just what an AI is good at is one of the hardest thing about understanding them. To help determine this, OpenAI has designed a set of games that can help researchers tell whether their machine learning agent is actually learning basic skills or, what is equally likely, has figured out how to rig the system in its favor.
It's one of those aspects of AI research that never fails to delight: the ways an agent will bend or break the rules in its endeavors to appear good at whatever the researchers are asking it to do. Cheating may be thinking outside the box, but it isn't always welcome, and one way to check is to change the rules a bit and see if the system breaks down.
What the agent actually learned can be determined by seeing if those "skills" can be applied when it's put into new circumstances where only some of its knowledge is relevant.
For instance, say you want to learn if an AI has learned to play a Mario-like game where it travels right and jumps over obstacles. You could switch things around so it has to walk left; you could change the order of the obstacles; or you could change the game entirely and have monsters appear that the AI has to shoot while it travels right instead.
If the agent has really learned something about playing a game like this, it should be able to pick up the modified versions of the game much quicker than something entirely new. This is called "generalizing" — applying existing knowledge to a new set of circumstances — and humans do it constantly.
OpenAI researchers have encountered this many times in their research, and in order to test generalizable AI knowledge at a basic level, they've designed a sort of AI arcade where an agent has to prove its mettle in a variety of games with varying overlap of gameplay concepts.
The 16 game environments they designed are similar to games we know and love, like Pac-Man, Super Mario Bros., Asteroids, and so on. The difference is the environments have been build from the ground up towards AI play, with simplified controls, rewards, and graphics.
Each taxes an AI's abilities in a different way. For instance in one game there may be no penalty for sitting still and observing the game environment for a few seconds, while in others it may place the agent in danger. In some the AI must explore the environment, in others it may be focused on a single big boss spaceship. But they're all made to be unmistakably different games, not unlike (though obviously a bit different from) what you might find available for an Atari or NES console.
Here's the full list, as seen in the gif below from top to bottom, left to right:
Ninja: Climb a tower while avoiding bombs or destroying them with throwing stars.
Coinrun: Get the coin at the right side of the level while avoiding traps and monsters.
Plunder: Fire cannonballs from the bottom of the screen to hit enemy ships and avoid friendlies.
Caveflyer: Navigate caves using Asteroids-style controls, shooting enemies and avoiding obstacles.
Jumper: Open-world platformer with a double-jumping rabbit and compass pointing towards the goal.
Miner: Dig through dirt to get diamonds and boulders that obey Atari-era gravity rules.
Maze: Navigate randomly generated mazes of various sizes.
Bigfish: Eat smaller fish than you to become the bigger fish, while avoiding a similar fate.
Chaser: Like Pac-Man, eat the dots and use power pellets strategically to eat enemies.
Starpilot: Gradius-like shmup focused on dodging and quick elimination of enemy ships.
Bossfight: 1 on 1 battle with a boss ship with randomly selected attacks and replenishing shields.
Heist: Navigate a maze with colored locks and corresponding keys.
Fruitbot: Ascend through levels while collecting fruit and avoiding non-fruit.
Dodgeball: Move around a room without touching walls, hitting others with balls and avoiding getting hit.
Climber: Climb a series of platforms collecting stars along the way and avoiding monsters.
Leaper: Frogger-type lane-crossing game with cars, logs, etc.
You can imagine that an AI might be created that excels at the grid-based ones like Heist, Maze, and Chaser, but loses the track in Jumper, Coinrun, and Bossfight. Just like a human — because there are different skills involved in each. But there are shared ones as well: understanding that the player character and moving objects may have consequences, or that certain areas of the play area are inaccessible. An AI that can generalize and adapt quickly will learn to dominate all these games in a shorter time than one that doesn't generalize well.
The set of games and methods for observing and rating agent performance in them is called the ProcGen benchmark, since the environments and enemy placements in the games are procedurally generated. You can read more about them, or learn to build your own little AI arcade, at the project's GitHub page.