ボール「Roller」が立方体「Target」の方向に転がることを学ぶ学習環境を用意しています

強化学習について(「9_強化学習.ipynb」を参照) 強化学習の要素 サンプル

観察 (VectorObservation)
- SpaceSize (今回のサイズは6)
  0. TargetのX座標
  1. TargetのZ座標
  2. RollerAgentのX座標
  3. RollerAgentのZ座標
  4. RollerAgentのX速度
  5. RollerAgentのZ速度
- Stacked Vectors
行動 (Actions)
- Continuous(連続。今回のサイズは2)
  0. RollerAgentのX方向に加える力
  1. RollerAgentのZ方向に加える力
- Discrete(離散)
報酬 RollerAgentがTargetの位置に到着時に+1.0(エピソード完了)
RollerAgentがTargetが床から落下時に±0.0(エピソード完了)
決定 10ステップ毎

追加するコンポーネントや手順

Agentオブジェクトに下記を追加していく

Behavior Parametersの追加
Agentクラスのスクリプトの追加
DecisionRequesterの追加

1.Behavior Parametersの追加

学習環境のエージェント(RollerAgent)にBehavior Parametersコンポーネントがアタッチされているはずです
これは、エージェントの「観察」と「行動」のデータ型を設定するコンポーネントです

BehaviorName
学習設定ファイルのセクションとしての利用
VectorObservation(観察)

「観察」は、特定のエージェントが利用できる環境の「状態」の部分情報です。
「VectorObservation」は「観察」のデータ型の１つで、配列で問題解決に必要な情報を格納します。今回は、観察の配列として、下記のように格納しています

        観察の配列[0] = TargetのX座標     
        観察の配列[1] = TargetのZ座標  
        観察の配列[2] = RollerAgentのX座標  
        観察の配列[3] = RollerAgentのZ座標  
        観察の配列[4] = RollerAgentのX速度  
        観察の配列[5] = RollerAgentのZ速度  

Actions(行動)

「行動」は、エージェントが実行するポリシーからの指示。「Continuous」は「行動」のデータ型の１つで、連続値(-1.0~+1.0)を格納します。今回は、行動の配列として、下記のように格納しています。

        行動の配列[0] = RollerAgentのX方向に加える力  
        行動の配列[1] = RollerAgentのZ方向に加える力

Model
利用するモデルファイル(拡張子はonnx)
BehaviorType
実行モード
- Default　　学習用Pythonスクリプトが実行中の時は「学習モード」それ以外でModelが設定済みの時は「推論モード」それ以外は「Heauristicモード」
- Heauristic Only　　
```
// ヒューリスティックモードの行動決定時に呼ばれる
public override void Heuristic(in ActionBuffers actionBuffers)
{
    var actionsOut = actionBuffers.ContinuousActions;
    actionsOut[0] = Input.GetAxis("Horizontal");
    actionsOut[1] = Input.GetAxis("Vertical");
}
```
今回のContinuousActionはサイズ２の配列
[0]にはInput.GetAxis("Horizontal")
[1]にはInput.GetAxis("Vertical")

UnityのPlayボタンを押して、方向キーで「RollerAgent」の操作ができる

(inについて)
- Inference Only 　「推論モード」

2.Agentクラスのスクリプトの追加

作成した学習環境のエージェント（RollerAgent）に、「Agentクラスのスクリプト」を追加します。

「Agentクラスのスクリプト」では、「エピソード開始時の初期化」「観察取得」「行動実行と報酬取得」などを実装します。

具体的には下記メソッドをオーバーライドします。

・ void Initialize()：ゲームオブジェクト生成時に呼ばれる

・ void OnEpisodeBegin()：エピソード開始時に呼ばれる

・ void CollectObservations()：観察取得時に呼ばれる

・ void OnActionReceived：行動決定時に呼ばれる

スクリプトを書いていく

①パッケージのインポート

using System.Collections.Generic;
using UnityEngine;
using Unity.MLAgents;
using Unity.MLAgents.Sensors;
using Unity.MLAgents.Actuators;
using Unity.MLAgents.Policies;

②RollerAgent

TransformとRigidBodyの参照を定義

public class RollerAgent : Agent
{
    public Transform target; // TargetのTransform
    Rigidbody rBody; // RollerAgentのRigidBody

    (Agentクラスのオーバーライドメソッドを追加していく)

③Initialize()

ゲームオブジェクト生成時の初期化を行います。

    // ゲームオブジェクト生成時に呼ばれる
    public override void Initialize()
    {
        // RollerAgentのRigidBodyの参照の取得
        this.rBody = GetComponent<Rigidbody>();
    }

④OnEpisodeBegin()

エピソード開始時の初期化を行います。
RollerAgentが床から落下している時は、RollerAgentの位置と速度をリセットし、Targetの位置は常にリセットします

    // エピソード開始時に呼ばれる
    public override void OnEpisodeBegin()
    {
        // RollerAgentが床から落下している時
        if (this.transform.localPosition.y < 0)
        {
            // RollerAgentの位置と速度をリセット
            this.rBody.angularVelocity = Vector3.zero;
            this.rBody.velocity = Vector3.zero;
            this.transform.localPosition = new Vector3(0.0f, 0.5f, 0.0f);
        }

        // Targetの位置のリセット
        target.localPosition = new Vector3(
            Random.value*8-4, 0.5f, Random.value*8-4);
    }

⑤CollectObservations()

エージェントに観察を渡します。
引数「VectorSensor」のAddObservation()で観察の値を渡します

    // 観察取得時に呼ばれる
    public override void CollectObservations(VectorSensor sensor)
    {
        sensor.AddObservation(target.localPosition.x); //TargetのX座標
        sensor.AddObservation(target.localPosition.z); //TargetのZ座標
        sensor.AddObservation(this.transform.localPosition.x); //RollerAgentのX座標
        sensor.AddObservation(this.transform.localPosition.z); //RollerAgentのZ座標
        sensor.AddObservation(rBody.velocity.x); // RollerAgentのX速度
        sensor.AddObservation(rBody.velocity.z); // RollerAgentのZ速度
    }

⑥OnActionReceived()

決定された行動に応じて行動実行を行い、その結果に応じて報酬取得とエピソード完了を行います

// 行動決定時に呼ばれる
    public override void OnActionReceived(ActionBuffers actionBuffers)
    {
        // RollerAgentに力を加える
        Vector3 controlSignal = Vector3.zero;
        controlSignal.x = actionBuffers.ContinuousActions[0];
        controlSignal.z = actionBuffers.ContinuousActions[1];
        rBody.AddForce(controlSignal * 10);

        // RollerAgentがTargetの位置にたどりついた時
        float distanceToTarget = Vector3.Distance(
            this.transform.localPosition, target.localPosition);
        if (distanceToTarget < 1.42f)
        {
            AddReward(1.0f);
            EndEpisode();
        }

        // RollerAgentが床から落下した時
        if (this.transform.localPosition.y < 0)
        {
            EndEpisode();
        }
    }

決定された行動の取得は引数「ActionBuffers」のContinuousActions/DiscreteActionsを使います。今回のContinuousActionsはサイズ2の浮動小数配列で、ContinuousActions[0]はX方向に加える力（-1.0 ~ +1.0）、ContinuousActions[1]はZ方向に加える力（-1.0 ~ +1.0）が渡されます。

報酬の加算は、AgentクラスのAddReward(float increment)、エピソード完了はEndEpisode()を使います。

「報酬」と「エピソード完了」については3章でも取り上げる予定。

ContinuousActionsについて

Agentクラスコンポーネントのインスペクターでの設定

Max Step :
エピソードの最大ステップ。エピソードのステップ数がこの値を超えると、エピソード完了となる。０の場合は無制限。
Target :
ドラッグ＆ドロップでTargetを設定

下記スクリプトは最終的な全体のスクリプトになります

using System.Collections.Generic;
using UnityEngine;
using Unity.MLAgents;
using Unity.MLAgents.Sensors;
using Unity.MLAgents.Actuators;
using Unity.MLAgents.Policies;

// RollerAgent
public class RollerAgent : Agent
{
    public Transform target; // TargetのTransform
    Rigidbody rBody; // RollerAgentのRigidBody

    // ゲームオブジェクト生成時に呼ばれる
    public override void Initialize()
    {
        // RollerAgentのRigidBodyの参照の取得
        this.rBody = GetComponent<Rigidbody>();
    }

    // エピソード開始時に呼ばれる
    public override void OnEpisodeBegin()
    {
        // RollerAgentが床から落下している時
        if (this.transform.localPosition.y < 0)
        {
            // RollerAgentの位置と速度をリセット
            this.rBody.angularVelocity = Vector3.zero;
            this.rBody.velocity = Vector3.zero;
            this.transform.localPosition = new Vector3(0.0f, 0.5f, 0.0f);
        }

        // Targetの位置のリセット
        target.localPosition = new Vector3(
            Random.value*8-4, 0.5f, Random.value*8-4);
    }

    // 観察取得時に呼ばれる
    public override void CollectObservations(VectorSensor sensor)
    {
        sensor.AddObservation(target.localPosition.x); //TargetのX座標
        sensor.AddObservation(target.localPosition.z); //TargetのZ座標
        sensor.AddObservation(this.transform.localPosition.x); //RollerAgentのX座標
        sensor.AddObservation(this.transform.localPosition.z); //RollerAgentのZ座標
        sensor.AddObservation(rBody.velocity.x); // RollerAgentのX速度
        sensor.AddObservation(rBody.velocity.z); // RollerAgentのZ速度
    }

    // 行動決定時に呼ばれる
    public override void OnActionReceived(ActionBuffers actionBuffers)
    {
        // RollerAgentに力を加える
        Vector3 controlSignal = Vector3.zero;
        controlSignal.x = actionBuffers.ContinuousActions[0];
        controlSignal.z = actionBuffers.ContinuousActions[1];
        rBody.AddForce(controlSignal * 10);

        // RollerAgentがTargetの位置にたどりついた時
        float distanceToTarget = Vector3.Distance(
            this.transform.localPosition, target.localPosition);
        if (distanceToTarget < 1.42f)
        {
            AddReward(1.0f);
            EndEpisode();
        }

        // RollerAgentが床から落下した時
        if (this.transform.localPosition.y < 0)
        {
            EndEpisode();
        }
    }

    // ヒューリスティックモードの行動決定時に呼ばれる
    public override void Heuristic(in ActionBuffers actionBuffers)
    {
        var actionsOut = actionBuffers.ContinuousActions;
        actionsOut[0] = Input.GetAxis("Horizontal");
        actionsOut[1] = Input.GetAxis("Vertical");
    }
}

3.DecisionRequesterの追加

「DecisionRequester」は「何ステップごとに１回決定を要求するか」を設定するコンポーネントです

DecisionPeriod　: 何ステップごとに１回決定を要求するか　

ここまででUnity側の設定は終わり、次は学習ファイルを設定し、pythonとUnityを走らせます

1.Behavior Parametersの追加​

2.Agentクラスのスクリプトの追加​

①パッケージのインポート​

②RollerAgent​

③Initialize()​

④OnEpisodeBegin()​

⑤CollectObservations()​

⑥OnActionReceived()​

Agentクラスコンポーネントのインスペクターでの設定​

3.DecisionRequesterの追加​

DecisionPeriod : 何ステップごとに１回決定を要求するか ​