‘Policy’: map states to actions Can have various shapes and forms Tables, decision trees, … Of special interest here: Artificial neural networks (NN) Weights of the NN are adapted / updated during learning
Parameters of the learning algorithm E.g., discount factors, step sizes, initial values, … Parameters of the NN E.g., number of units, number of layers, learning rates, … Other parameters E.g., agent morphology, environmental parameters Question is: How to set the parameters correctly? Answer: Tuning / optimization
NN training can be costly Thousands or millions of weights to set Evaluating the agent in the environment may also be costly E.g., observing a robot trying to solve a complex task Costs can easily become prohibitive for tuning Solution: surrogate model-based optimization
Surrogate model Learns relation between parameters and performance Evaluate the surrogate model instead of the actual problem
Cart balances pole openai gym: CartPole-v0 States: 4, continuous Cart position & velocity Pole angle & velocity Actions: 2, discrete Push cart left Push cart right
Reward: 1 for each time step Stop if: Pole angle outside +/- 12 degree Cart position outside +/- 2.4 After 200 time steps Learning problem solved if: Average reward of 195.
Actor maps states to actions Critic learns ‘value’ of states (via rewards) Improve actor to beat critic estimation Example by Apoorv Nandan, 2020/05/13, https://keras.io/examples/rl/actor_critic_cartpole/
Tuned Parameters num_hidden NN units: 8, …, 128 learning_rate of NN (\(log_{10}\) scale): -4, …, 0 gamma: Actor-Critic discount factor: 0.5, …, 1.0 Surrogate model: Gaussian process 2 hours runtime budget 5 replications for each parameter configuration
Use of resources How many replications do we need? How long should we let the agent interact with the environment? Balancing the above (they interact)? Multiple goals: Maximize reward Minimize runtime Minimize memory, CPU use (e.g., for edge devices)
Neural architecture search Representations (e.g., blocks / cells, chains, ‘unrestricted’) Search operators Measuring network similarities / kernels Conditional parameter spaces High number of hyperparameters Transfer learning