Which of the following accurately describe the purpose and significance of Markov Decision Processes (MDPs) in reinforcement learning? (Select all that apply)
- A. To serve as the foundational mathematical model for reinforcement learning
- B. To formalize sequential decision making under uncertainty
- C. To provide a framework for defining optimal policies
- D. To directly implement a robotic control algorithm
- E. To enable modeling of state transitions and rewards
- F. To quantify the long-term value of actions and states
Show Answer
Correct Answers: A, B, C, E, F
Explanation:
MDPs serve multiple important purposes in reinforcement learning beyond just being a theoretical framework.
"MDPs can be thought of as a theoretical framework underlying RL." "MDPs are a mathematical formulation of the sequential decision making problem that capture all the essential elements of the RL problem." "The MDP framework enables us to quantify the value of different states and actions in terms of expected future rewards."
Which of the following are components of a standard MDP definition?
- A. A reward function
- B. A value function
- C. A discount factor
- D. A transition probability distribution
- E. A Q-function
Show Answer
Correct Answers: A, C, D
Explanation:
The MDP tuple includes states, actions, rewards, transitions, and a discount factor.
"An MDP is defined as a tuple of five items. S... A... R is the reward distribution... T is the transition probability distribution... Gamma is a discount factor..."
The Markov property implies that the next state depends on the entire history of states and actions.
- A. True
- B. False
Show Answer
Correct Answers: B
Explanation:
The Markov property asserts dependence only on the current state and action, not the full history.
"The distribution of possible next states given state s, and action a. Does not depend on any of the previous states or actions..."
Which of the following accurately describe the type of information and feedback available to an agent in a typical MDP learning scenario? (Select all that apply)
- A. Samples from the reward and transition distributions
- B. Observations of states after taking actions
- C. Reward signals after actions are taken
- D. Full knowledge of the underlying transition probabilities
- E. Complete reward tables for all state-action pairs
- F. Information about all possible outcomes before choosing an action
Show Answer
Correct Answers: A, B, C
Explanation:
Agents observe samples of transitions and rewards, but typically do not have full knowledge of the underlying MDP dynamics.
"The transition distribution and the reward distribution are both not known. Instead, only samples from these distributions are observed by the agent..." "The agent observes states and rewards after taking actions, building up experience rather than being given complete information about the environment."
Which of the following accurately describe a deterministic or stochastic policy?
- A. A deterministic policy maps states to actions
- B. A stochastic policy maps actions to states
- C. A stochastic policy defines a probability distribution over actions
- D. A deterministic policy stores one action per state
Show Answer
Correct Answers: A, C, D
Explanation:
Deterministic and stochastic policies differ by how they assign actions: one fixed action vs. a distribution.
"A deterministic policy is defined as a mapping from states to actions... A stochastic policy is defined as a probability distribution of actions given a state..."
Which of the following are influenced by or related to the discount factor ( \gamma ) in an MDP? (Select all that apply)
- A. How much weight is given to future rewards
- B. The effective planning horizon of the agent
- C. The stability of value function calculations
- D. The convergence rate of value-based algorithms
- E. The number of available actions per state
- F. Whether immediate rewards are valued more than distant ones
Show Answer
Correct Answers: A, B, C, F
Explanation:
The discount factor has several important effects on MDP behavior and solutions.
"The discount factor gamma lies between 0 and 1... implying that the rewards at earlier timestamps, are given more weight..." "A discount factor close to 0 makes the agent myopic (focused on immediate rewards), while a value close to 1 makes it consider the long-term future rewards." "The discount factor also ensures mathematical convergence of infinite sums in continuing tasks."
Lower values of the discount factor ( \gamma ) cause the agent to prioritize short-term rewards.
- A. True
- B. False
Show Answer
Correct Answers: A
Explanation:
Low gamma places more emphasis on near-term rewards.
"...a lower value of gamma, prioritizes the lower rewarding state at the right endpoint."
Which of the following are value-related functions defined in the lecture?
- A. State-value function (V-function)
- B. Reward matrix
- C. State-action value function (Q-function)
- D. Advantage function
Show Answer
Correct Answers: A, C
Explanation:
Both value functions are introduced explicitly to evaluate policies.
"A value function... is a prediction of discounted sum of future rewards."
"A state action value function or a Q-function... informs us of how good is taking a particular action at a state."
What happens when the reward at non-absorbing states is made more negative?
- A. The agent becomes more cautious and avoids all rewards
- B. The optimal policy becomes independent of the environment
- C. The agent tends to prefer shorter or riskier paths to reach absorbing states
- D. The discount factor becomes irrelevant
Show Answer
Correct Answers: C
Explanation:
As negative rewards increase, the policy changes to prefer quicker or alternative outcomes.
"...as this constant reward decreases to -0.4... the optimal policy... takes the riskier shorter path..."
"Further, decreasing this constant to -2... the optimal policy now prefers the -1 absorbing state..."
A Q-function provides an estimate of future rewards for a given action taken at a particular state.
- A. True
- B. False
Show Answer
Correct Answers: A
Explanation:
The Q-function assesses the expected future reward for an action-state pair.
"The Q-function for a policy... is the expected sum of discounted rewards... after taking action a at state s."