Guide to ML Agents 1.0

Independent agents, but share the same Behavior
- Many instances help to speed up training in parallel
- Behaviors determines how agents makes decisions
Agent Properties
- Max Step: how many simulation steps before episode ends. Agents restarts every N steps.
- Behaviors parameters
  - Vector Observation Space
  - Vector Action Space (Discrete/Continuous)
Mean reward should increase during training
Should end the training from Python (keyboard event for ctrl-c) to trigger saving the .nn model file
TODO Add support for --resume to load up checkpoints
To train behaviors we need to define 3 entities for a given environment
- Observations
- Actions
- Reward signals
By training, an agents learns a policy, which is an optional map between observations into actions
Heuristic behaviors is maybe a good place to put our navmesh-based control for "classical" AI (eg if agent collides, turn 90deg and continue forward) Behaviors defined as hard-coded set of rules
Behaviors are like a function: f(observations) = actions
Goal of agent: discover a behavior (a Policy) that maximizes a reward
- Rewards
  - Extrinsic
  - Intrinsic
    - GAIL (Generative Adversarial IL)
    - Behavior Cloning (BC, can be enabled on PPO or SAC)
      - Works better when there's demonstrations of almost all agent states
    - Curiosity
RL algorithms
- Proximal Policy Optimization
- Soft Actor-Critic
  - Heaver/slower (0.1 sec per step)
Imitation Learning: we can combine with RL to dramatically reduce the time the agent takes to solve the environment.
- Recording demonstrations on editor or build, saving as assets

To summarize, we provide 3 training methods: BC, GAIL and RL (PPO or SAC) that can be used independently or together:

BC can be used on its own or as a pre-training step before GAIL and/or RL
GAIL can be used with or without extrinsic rewards
RL can be used on its own (either PPO or SAC) or in conjunction with BC and/or GAIL.

Leveraging either BC or GAIL requires recording demonstrations to be provided as input to the training algorithms.

Training with curriculum
- We can add some incremental variations of parameters (eg wall height) to
increase difficulty level to agents while training
Environment Parameter Randomization
- Training agents by varying environment parameters, making them more robust to avoid overfiting
Model types
- Vector Observations
  - Include a fully connected NN to learn from vector observations
- Visual Observations
  - Include 3 CNN architectures to learn from multiple-cameras
    - Simple encoder with 2 conv layers
    - 3 conv layers proposed by Mnih et al.
    - 3 stacked layers of 2 resnets (IMPALA Resnet), much larter than other two
  - Memory-enhanced (LSTM)

Parameters: https://github.com/Unity-Technologies/ml-agents/blob/release_1/docs/Training-Configuration-File.md

PPO

Proximal Policy Optimization

Tried to eliminate stepsize flaw on Plicy gradient methods by developing TRPO and ACER
- ACER is far more complicated than PPO (replay buffer)

Guide to ML Agents 1.0

PPO

Important Parameters