马尔可夫决策过程:值迭代,它是如何工作的?它是、马尔、过程、可夫

2023-09-11 02:55:10 作者:雨點淋湿思念

我一直在阅读了很多关于马尔可夫决策过程(使用值迭代)最近,但我根本无法让我的头周围。我发现在互联网上/书了大量的资源,但它们都使用数学公式,这对我的能力太复杂。

I've been reading a lot about Markov Decision Processes (using value iteration) lately but I simply can't get my head around them. I've found a lot of resources on the Internet / books, but they all use mathematical formulas that are way too complex for my competencies.

由于这是我在大学的第一年,我发现网上提供的解释和公式中使用的概念/术语实在是太复杂了,我和他们假定读者都知道,我已经简单的一些事情从来没有听说过的。

Since this is my first year at college, I've found that the explanations and formulas provided on the web use notions / terms that are way too complicated for me and they assume that the reader knows certain things that I've simply never heard of.

我想用它在2D网格(填充墙(达不到的),硬币(希望)和敌人的移动(必须不惜一切代价避免))。整个目标是收集所有硬币而不触及敌人,我想创建一个AI利用马尔可夫决策过程的主要球员(的 MDP 的)。下面是它的部分看起来像(注意,游戏相关的方面是没有那么多这里关注我真的想了解的的MDP 的一般。):

I want to use it on a 2D grid (filled with walls(unattainable), coins(desirable) and enemies that move(which must be avoided at all costs)). The whole goal is to collect all the coins without touching the enemies, and I want to create an AI for the main player using a Markov Decision Process (MDP). Here is how it partially looks like (note that the game-related aspect is not so much of a concern here. I just really want to understand MDPs in general):

据我了解,中的的MDP 的一个粗鲁的简化是,他们可以创建一个拥有我们需要往哪个方向走(一种指向的箭一格的,我们需要一个网格去,开始于对电网的特定位置),以获得对某些目标,并避免某些障碍。具体到我的情况,这将意味着它可以让玩家知道往哪个方向走,收集硬币和避免的敌人。

From what I understand, a rude simplification of MDPs is that they can create a grid which holds in which direction we need to go (kind of a grid of "arrows" pointing where we need to go, starting at a certain position on the grid) to get to certain goals and avoid certain obstacles. Specific to my situation, that would mean that it allows the player to know in which direction to go to collect the coins and avoid the enemies.

现在,使用的 MDP 的方面,那岂不是它创建状态的集合(网格),持有一定的政策(采取的措施 - >上,下,左,右)为一定的状态(在网格上的位置)。这些政策是由每个国家的实用的价值观,它本身是由评估多少到达那里将有利于在短期和长期计算确定。

Now, using the MDP terms, it would mean that it creates a collection of states(the grid) which holds certain policies(the action to take -> up, down, right, left) for a certain state(a position on the grid). The policies are determined by the "utility" values of each state, which themselves are calculated by evaluating how much getting there would be beneficial in the short and long term.

这是正确的?或者完全是我在错误的轨道上吗?

Is this correct? Or am I completely on the wrong track?

我至少想知道从下面的公式在我的处境中的变量重新present:

I'd at least like to know what the variables from the following equation represent in my situation: