Search and Planning#
Author: Johannes Maucher
Last Update: 10.10.2025
Introduction#
As described in the previous section a rational agent perceives in each time stamp a state of the environment and he has to decide which action to take in this state, such in the long-term it maximizes it’s performance measure.
When the best action to take is not immediately obvious the agent must plan ahead. I.e. for a given current state and a targeted goal state the agent has to plan a sequence of actions, which most efficiently leads from the current to the goal state. This process is called planning and the corresponding agent is a planning agent.
Two variants of planning agents are goal-based agents and utility-based agents.
For planning agents a solution is a sequence of actions, which leads from the perceived start-state to a goal-state.
In this section in addition to planning also the related task of optimization is discussed. In contrast to planning a solution of an optimization task is just a good or even the best state. The path to this state is not relevant.
Goal-based Agent#
For a goal-based agent one or more goals are given and the agent’s task is to plan a sequence from the perceived current state to one of the goal states. For example in navigation the start-location is the current state and the target location is the goal state and a path from the former to the latter must be calculated.

For goal-based agents uninformed and informed algorithms are distinguished. In the uninformed case the agents knows in each state only
the state itself and
the actions available in this state.
In the informed case the agent can estimate the proximity of a state to the goal-state. This estimate is also called heuristic and the corresponding algorithms heuristic search algorithms.
Utility-based Agent#
For utility-based agents a function to measure the utility of states must be available. The benefits of such a measure are:
sometimes it is not possible to define a concrete goal state. If then a utility-function is available, the search target can be to find a good state, i.e. a state with a high utility, or even the best state (state with maximum utility).
sometimes different goals are possible. If a utility-function is available in this case, planning can be focused to finding the goal state with the highest utility.
In optimizsation problems the task is not to find the best sequence of actions from a current state to the goal state. However, one just have to determine a good state. In order to measure the goodness of a state the utility-function is used.

Applications#
Navigation: \(\Rightarrow\) Uniform-Cost- / A*-Algorithm
Initial State: City (location) to start from.
Task: Find the best route from start- to destination city (location)
General Planning Problems:
Image from the action film Die Hard 3
Initial State: Jugs have a volume of \((5,3)\) and an initial fill level of \((v_1,v_2)=(0,0)\)
Task: How to decant jugs, such that their fill level is \((v_1^*,v_2^*)=(4,0)\)?
8-Puzzle: \(\Rightarrow\) Breadth-First-Algorithm
Initial State: Puzzle with arbitrarily ordered plates on it.
Task: Move the plates such that their numbers are in increasing order
Boardgames (Chess) \(\Rightarrow\) MinMax-Algorithm
Initial State: Current configuration on the chess board
Task: Plan actions, such that you win.
Go: \(\Rightarrow\) Monte-Carlo-Tree-Search
Initial State: Current configuration on the Go board
Task: Plan actions, such that you win.
Constraint Satisfaction Problem:
Coloring Problem:
Task: Color regions with either red, green or blue, such that no pair of neighboring regions have the same color.
Sudoku: \(\Rightarrow\) Constraint-Satisfaction Problem (Backtracking)
Task: Fill idle locations with integers between 1 and 9, such that no column, no row and no field contains one number more than once.
Optimize Configurations: Here: find Printed Circuit Board Design, such that number of intersections is minimal:
Task: Find Printed Circuit Board Design, such that number of intersections is minimal
General Logistics and Scheduling Problems: \(\Rightarrow\) Constraint Satisfaction Problem / Genetic Algorithm
Task: Load the containers such, that in the end the empty space is as small as possible and a bunch of contstraints is fulfilled.
Requirements#
The algorithms, discussed in this notebook require an environment, which is
Fully observable: The current state of the environment is completely observable
Deterministic: Given a state \(s_t\) and a posible action \(a_t\), the successor state \(s_{t+1}\) is unique.
Static: Environment does not change during agent’s thinking.
Discrete State-Space, Action-Space and time
The algorithms described here are offline and open-loop. Offline means that the agent executes actions only after finding the complete solution. Open-loop means that all actions of the solution are executed blind, without regarding any feedback, that may become available during action execution.
Problem Specification#
For each problem, which shall be solved by an approach of this notebook, the following must be specified:
What is a state?
What is the initial state? What is the goal-state?
If initial- and/or goal- state is hard to define a utility-function may do the job
What is an action? Which actions are available in which states?
What are the costs of actions?
For heuristic algorithms: How to estimate the utility of an arbitrary state
Performance Metrics for Search- and Planning-Algorithms#
Completeness: Algorithm guarantees to find a solution.
Optimal: Algorithm guarantees to find optimal solution.
Complexity: In terms of time and memory
Global: Solution is a path from intitial- to goal-state
Local: Solution is a good state
Algorithm Categories#
uninformed
heuristic (informed)
global
local
goal-based
utility-based
Global, Uninformed Search#
All of the algorithms discussed below, find a solution, i.e. a path from the start- to the target-node, by constructing a search-tree.
Breadth-First
Depth-First
Uniform Cost Search
A crucial factor, which delimits the application of the tree-based search algorithms is complexity. Hence runtime- and memory-complexity should be estimated in advance. Both, runtime- and memory-complexity scale with the number of nodes \(n\), that must be generated. In order to calculate an estimate for \(n\) one estimates
the expected mean branching factor \(b\). This is the mean number of edges branching from a node.
the expected mean depth \(d\) of the goal
Given these two estimates, the estimation for the number of nodes, that must be generated is:
If \(k_t\) is the time, required to generate a single node, then the estimated time to find a solution is
If \(k_m\) is the memory required to store a single node and if all generated nodes have to be stored until a solution is found, the estimated memory footprint is
Example: Navigation: Determine the best path from node A to node K
Breadth-First#
Start from the initial node
Expand node by applying all available actions of this state and obtain the corresponding sucessor states.
Select next state to expand, which is the topmost not yet expanded node.
Ifthe selected node is the goal-state: Terminate and return path from initial- to goal-stateElse: Continue with step 2
For the Breadth-First algorithm both, runtime- and memory- complexity is estimated as described above, i.e.
In Breadth-First all generated nodes have to be stored until the solution is found.
Breadth-First is complete. Moreover, in the case where all edge-costs are the same, it is also optimal.
Depth-First#
Start from the initial node
Expand node by applying all available actions of this state and obtain the corresponding sucessor states.
Select next state to expand, which is the deepest not yet expanded node.
Ifthe selected node is the goal-state: Terminate and return path from initial- to goal-stateElse: Continue with step 2
For the Depth-First algorithm both, runtime-complexity is estimated as described above, i.e.
However, the main advantage of Depth-First is it’s decreased memory demand (compared to Breadth-First). This is because subtrees, which have been traversed completely without finding a solution, can be removed, before new subtrees are generated. Memory complexity is
Depth-first is complete but not optimal.
Variants of Depth-First#
Backtracking: In contrast to the tree-based algorithms discussed so far, in backtracking always not all successors of a node are generated in the node-expansion phase. Instead in each step only one successor is generated. This further reduces memory demand. Memory complexity is then
Limited Depth-First Search: In this variant a maximum depth \(L\) can be configured, beyond which the tree cannot be expanded. This helps to control complexity. However, this algorithm is obviously not complete.
Iterative Deeping Algorithm (IDA): This variant applies Limited Depth-First Search in many iterations. In each iteration the maximum depth \(L\) is inkremented either until a solution is found or until a pre-configured maximum value for \(L\) is reached. After each iteration in which no solution has been find, the entire tree is removed and in the next iteration the tree is again generated from scratch. The drawback of this variant is that many nodes must be generated multiple times (in each iteration). However, the algorithm combines the advantages of Depth-First and Breadth-First search: It is memory-efficient and it is optimal, if all edge-costs are equal and if the solution lies not beyond a possibly configured maxium value of \(L\) (=number of iterations).
Uniform Cost search (UCS)#
In the example:
Expand node A
Expand node E
Expand node B
Expand node G in level 1
Expand node G in level 2
Expand node H in level 2
Expand node C in level 2 …
In general:
Start from the initial node
Expand node by applying all available actions of this state and obtain the corresponding sucessor states.
Select next state \(s\) to expand, which is the not yet expanded node, with the lowest accumulated costs \(g(s)\).
Ifthe selected node is the goal-state: Terminate and return path from initial- to goal-stateElse: Continue with step 2
UCS can guarantee to find the optimal solution - even if the edge-costs vary.
Heuristic Search#
A* algorithm
Find solution by construction of search-tree from current-state to goal-state
Which node shall be expanded next? regards not only previous costs, but also estimated Costs to Goal.
For navigation tasks the typical heuristic at a node k is the line-of-sight-distance between node k and the goal state.
In the example below (8-puzzle), the heuristic measures the number of plates, which are not at their correct (ordered) position.
A*-Algorithm#
In the example:
Expand node A
Expand node E
Expand node G
Expand node H
Expand node K (one of both), but before check if goal: K is goal -> Return path from A to K
In general:
Start from the initial node
Expand node by applying all available actions of this state and obtain the corresponding sucessor states.
Select next state \(s\) to expand, which is the not yet expanded node, with the lowest value of \(f(s)=g(s)+h(s)\), where
\(g(s)\) are the actual accumulated costs from the start-node to node \(s\).
\(h(s)\) are the esimated costs (heuristic) from node \(s\) to the goal-node.
Ifthe selected node is the goal-state: Terminate and return path from initial- to goal-stateElse: Continue with step 2
Under- and Over-estimating heuristics:
The \(A^*\)-algorithm guarantees to find the optimum solution if and only if the heuristic is admissible.
An admissible heuristic is an heuristic, which guarantees to not overestimate the costs at all costs. For example the line-of-sight heuristic, which is typically applied for pathfinding, is an admissible heuristic.
Heuristics, which overestimate the actual costs have
the disadvantage, that they can not guarantee to find the optimum solution
the advantage, that they faster find a solution
2-Player Games#
The algorithm described in this section is applicable for all games with the following characteristics:
2 players
fully observable
deterministic
Zero-Sum
Examples: Checkers, Chess, Go, Reversi, Backgammon, …
MinMax-Algorithm#
Evaluation of states in level \(L\) (planning horizon)

Root node of the tree is the current state of the game. Player Max is in turn.
Max has two possible actions, the corresponding successor states are the states for which the other player, Min, must find a good action.
In each of the two states in Min’s level 2 actions are available. Again all successor states can be generated for the next level, which belongs to Max
…
In this way a tree can be calculated down to a level \(L\). This level \(L\) constitutes the planning horizon and depends on the affordable complexity.
Once all nodes in level \(L\) are generated a utility function is applied to all states in this level.
If level \(L-1\) belongs to Min he will choose in each state the action, which leads to a successor of minimum utility - since minimum utility for Max is maximum utility of Min. Therefore, to each node in level \(L-1\), the minimum utility of it’s successor nodes is assigned.
In the same way to each node in level \(L-2\), the maximum utility of it’s successor nodes in level \(L-1\) is assigned.
This process is repeated up to the root node. Player Max then knows his next action: The one, which yields to a successor with maximum utility-value.
Player Min selects actions to nodes with minimum utility for Player Max:

Player Max selects actions to nodes with maximum utility:

Player Min selects actions to nodes with minimum utility for Player Max:

Player Max selects actions to nodes with maximum utility:

The Tic-Tac-Toe example below, actually allow planning up to the terminate states. In this case the heuristic state evaluation function is replaced by a function which assigns \(1\) to win and \(-1\) to lost.
Alpha-Beta Pruning#
Alpha-Beta Pruning is an improvement of the MinMax algorithm in the sense that it reduces the number of nodes, that have to be evaluated and generated by applying a simple logic. The process is as follows:
Create tree with limited depth search from left to right
As in MinMax-search evaluate nodes in the deepest level by applying a domain-specific evaluation function
For each max-node store the maximum value of the underlying subtree traversed so far in the variable \(\alpha\).
For each min-node store the minimum value of the underlying subtree traversed so far in the variable \(\beta\).
If at a minimum node
kthe current value \(\beta\) is smaller than \(\alpha\), the search underkcan be terminated. Here \(\alpha\) is the largest value of a maximum node in the path from the root tok.If at a maximum node
kthe current value \(\alpha\) is larger than \(\beta\), the search underkcan be terminated. Here \(\beta\) is the smallest value of a minimum node in the path from the root tok.
Demo of MinMax- and AlphaBeta-Pruning-Algorithm
A nice simulation of MinMax- and AlphaBeta-Pruning can be found here: https://raphsilva.github.io/utilities/minimax_simulator/#
Some remarks on the complexity of real boardgames:
In chess:
the effective mean branching factor is \(b=35\)
the mean number of turns is 40, i.e. \(d=80\) halfturns
the number of leave-nodes of the complete tree would then be \(35^{80} \approx 10^{123}\).
Typically, for chess AlphaBeta-pruning with a depth of 8 halfturns is applied. Of course, the quality of the algorithm strongly depends on the evaluation function. Good evaluation functions for chess can be found here: https://www.chessprogramming.org/Evaluation.
For the boardgame GO: its even worse:
the effective mean branching factor is \(b=250\)
the mean number of turns is 75, i.e. \(d=150\) halfturns
it’s even more challenging to find a good evaluation function.
Complexity-estimates for other popular boardgames can be found here: https://de.wikipedia.org/wiki/Spiel-Komplexität.
Monte-Carlo Tree Search (MCTS)#
The concept of a limited planning horizon, as applied in the Min/Max-algorithm is one way to handle complex problems, which can not be planned to the goal-state in one step.
Another concept is to construct a tree until a final state in such a way that in each state only promising actions and corresponding successors are generated. This concept is applied by Monte Carlo Tree Search .
Monte-Carlo Tree Search:
Monte Carlo Tree Search:
Selection: Starting from the root node, apply Tree Policy in order to select next action. Tree Policy exploits node-statistics
Expansion: In a leaf-node select an arbitrary action and generate the corresponding new child-node.
Simulation: Simulate the game, starting from the new child-node. Action-Selection during simulation according to Default Policy, e.g.:
Randomly (Pure Monte Carlo)
Favor actions with higher estimated chance of success, if corresponding heuristics are available.
At the end of each simulation: Game is won or lost.
Backpropagation: Adapt node-statistics of all nodes in tree, starting from the leaf-node of previous simulation.
If the computational budget is reached MCTS returns with the best action a for the given root node.
Characteristics of MCTS:
Applicable even for games with very large branching factor
Basic Algorithm easy to implement
Configurable stop-time (Longer time yields better game play)
Doesn’t require domain knowledge
AlphaGo and AlphaZero#
AlphaGo is MCTS, which applies (Deep) Neural Networks in order to calculate good heuristics for
the next action to choose in the selection-phase
which leaf-node to extend in the expansion phase
which actions to choose in the simulation phase
In AlphaGo the networks for the heuristics in the selection- and the simulation-phase are trained from expert-moves
In AlphaZero no Domain-knowledge (database of expert moves) is required.
A nice summary of MCTS and AlphaGo can be found here: https://www.analyticsvidhya.com/blog/2019/01/monte-carlo-tree-search-introduction-algorithm-deepmind-alphago/
Constraint Satisfaction Problems (CSP)#
Problem Specification#
Set of variables:
Domain \(D_i\) of a variable \(X_i\) is the set of possible values for \(X_i\).
Set of constraints:
Each constraint refers to a set of variables from \(\mathcal{X}\)
Problem Specification (continued):
State is a concrete assignment of values to a set of variables from \(\mathcal{X}\)
An assignment of values, which does not violate any constraint is called consistent.
Complete Assignment: If values are assigned to all variables
Solution of CSP: If a complete assignment is consistent.
Sometimes solutions that maximize a utility function are required.
Backtracking Algorithm#
Backtracking is a variant of deep-first-search, which assigns in each level values to only one variable.
Values are assigned to variables such that assignment is consistent
Target test: Check if assignment is complete and consistent
Local Search / Optimisation Algorithms#
Genetic Algorithm#
Realizes concept of Darwin’s Theory of Evolution
Population of individuals
Individuals are selected (randomly).
Selection probability depends on fitness of individuals
Selected individuals generate new individuals by crossover (randomly)
There may be random mutations in new individuals
Each individual has a fitness
Only the fittest survive
Problem Specification:#
What is an individual (a state)?
Define random selection
Define function for crossover
Define function for mutation
Define fitness-function
Applications#
Genetic Algorithms are applied for a wide range of search- and optimisation problems.
Examples:
Best solution in the Daimler container packing project
Wireless Network optimisation
Neural network optimisation
Logistik and Scheduling problems of all types
Find optimal parameters for complex control systems (motor-control)
Gaming AI
Evolution of artificial agents
Procedural Content Generation
Exercises#
Demonstrate how \(A^*\)-algorithm finds the best path from Frankfurt to Ulm, given the Map and the Line-of-Sight distances below:
Solutions#
Click to see solutions
Solution Task 1: