Quiz: Markov Decision Processes (MDPs)

Question 1 (Multi-Select)

Which of the following accurately describe the purpose and significance of Markov Decision Processes (MDPs) in reinforcement learning? (Select all that apply)

A. To serve as the foundational mathematical model for reinforcement learning
B. To formalize sequential decision making under uncertainty
C. To provide a framework for defining optimal policies
D. To directly implement a robotic control algorithm
E. To enable modeling of state transitions and rewards
F. To quantify the long-term value of actions and states

Show Answer

Correct Answers: A, B, C, E, F Explanation:
MDPs serve multiple important purposes in reinforcement learning beyond just being a theoretical framework.

"MDPs can be thought of as a theoretical framework underlying RL." "MDPs are a mathematical formulation of the sequential decision making problem that capture all the essential elements of the RL problem." "The MDP framework enables us to quantify the value of different states and actions in terms of expected future rewards."

Question 2 (Multiple Select)

Which of the following are components of a standard MDP definition?

Show Answer

Correct Answers: A, C, D Explanation:
The MDP tuple includes states, actions, rewards, transitions, and a discount factor.

"An MDP is defined as a tuple of five items. S... A... R is the reward distribution... T is the transition probability distribution... Gamma is a discount factor..."

Question 3 (True/False)

The Markov property implies that the next state depends on the entire history of states and actions.

A. True
B. False

Show Answer

Correct Answers: B Explanation:
The Markov property asserts dependence only on the current state and action, not the full history.

"The distribution of possible next states given state s, and action a. Does not depend on any of the previous states or actions..."

Question 4 (Multi-Select)

Which of the following accurately describe the type of information and feedback available to an agent in a typical MDP learning scenario? (Select all that apply)

A. Samples from the reward and transition distributions
B. Observations of states after taking actions
C. Reward signals after actions are taken
D. Full knowledge of the underlying transition probabilities
E. Complete reward tables for all state-action pairs
F. Information about all possible outcomes before choosing an action

Show Answer

Correct Answers: A, B, C Explanation:
Agents observe samples of transitions and rewards, but typically do not have full knowledge of the underlying MDP dynamics.

"The transition distribution and the reward distribution are both not known. Instead, only samples from these distributions are observed by the agent..." "The agent observes states and rewards after taking actions, building up experience rather than being given complete information about the environment."

Question 5 (Multiple Select)

Which of the following accurately describe a deterministic or stochastic policy?

A. A deterministic policy maps states to actions
B. A stochastic policy maps actions to states
C. A stochastic policy defines a probability distribution over actions
D. A deterministic policy stores one action per state

Show Answer

Correct Answers: A, C, D Explanation:
Deterministic and stochastic policies differ by how they assign actions: one fixed action vs. a distribution.

"A deterministic policy is defined as a mapping from states to actions... A stochastic policy is defined as a probability distribution of actions given a state..."

Question 6 (Multi-Select)

Which of the following are influenced by or related to the discount factor ( \gamma ) in an MDP? (Select all that apply)

A. How much weight is given to future rewards
B. The effective planning horizon of the agent
C. The stability of value function calculations
D. The convergence rate of value-based algorithms
E. The number of available actions per state
F. Whether immediate rewards are valued more than distant ones

Show Answer

Correct Answers: A, B, C, F Explanation:
The discount factor has several important effects on MDP behavior and solutions.

"The discount factor gamma lies between 0 and 1... implying that the rewards at earlier timestamps, are given more weight..." "A discount factor close to 0 makes the agent myopic (focused on immediate rewards), while a value close to 1 makes it consider the long-term future rewards." "The discount factor also ensures mathematical convergence of infinite sums in continuing tasks."

Question 7 (True/False)

Lower values of the discount factor ( \gamma ) cause the agent to prioritize short-term rewards.

A. True
B. False

Show Answer

Correct Answers: A Explanation:
Low gamma places more emphasis on near-term rewards.

"...a lower value of gamma, prioritizes the lower rewarding state at the right endpoint."

Question 8 (Multiple Select)

Which of the following are value-related functions defined in the lecture?

A. State-value function (V-function)
B. Reward matrix
C. State-action value function (Q-function)
D. Advantage function

Show Answer

Correct Answers: A, C Explanation:
Both value functions are introduced explicitly to evaluate policies.

"A value function... is a prediction of discounted sum of future rewards."
"A state action value function or a Q-function... informs us of how good is taking a particular action at a state."

Question 9 (Multiple Choice)

What happens when the reward at non-absorbing states is made more negative?

A. The agent becomes more cautious and avoids all rewards
B. The optimal policy becomes independent of the environment
C. The agent tends to prefer shorter or riskier paths to reach absorbing states
D. The discount factor becomes irrelevant

Show Answer

Correct Answers: C Explanation:
As negative rewards increase, the policy changes to prefer quicker or alternative outcomes.

"...as this constant reward decreases to -0.4... the optimal policy... takes the riskier shorter path..."
"Further, decreasing this constant to -2... the optimal policy now prefers the -1 absorbing state..."

Question 10 (True/False)

A Q-function provides an estimate of future rewards for a given action taken at a particular state.

A. True
B. False

Show Answer

Correct Answers: A Explanation:
The Q-function assesses the expected future reward for an action-state pair.

"The Q-function for a policy... is the expected sum of discounted rewards... after taking action a at state s."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quiz: Markov Decision Processes (MDPs)

Question 1 (Multi-Select)

Question 2 (Multiple Select)

Question 3 (True/False)

Question 4 (Multi-Select)

Question 5 (Multiple Select)

Question 6 (Multi-Select)

Question 7 (True/False)

Question 8 (Multiple Select)

Question 9 (Multiple Choice)

Question 10 (True/False)

FilesExpand file tree

17.2Combined.md

Latest commit

History

17.2Combined.md

File metadata and controls

Quiz: Markov Decision Processes (MDPs)

Question 1 (Multi-Select)

Question 2 (Multiple Select)

Question 3 (True/False)

Question 4 (Multi-Select)

Question 5 (Multiple Select)

Question 6 (Multi-Select)

Question 7 (True/False)

Question 8 (Multiple Select)

Question 9 (Multiple Choice)

Question 10 (True/False)