ViSA: Visited-State Augmentation for Generalizaed Goal-Space Contrastive Reinforcement Learning

1.Abstract

Goal-Conditioned Reinforcement Learning (GCRL) is a framework for learning a policy that can reach arbitrarily given goals. In particular, Contrastive Reinforcement Learning (CRL) provides a framework for policy updates using an approximation of the value function estimated via Contrastive Learning, achieving higher sample efficiency compared to conventional methods. However, because CRL treats the visited state as a pseudo-goal during learning, it can accurately estimate the value function only for limited goals. To address the issue, we propose a novel data augmentation approach for CRL called ViSA (Visited-State Augmentation). ViSA consists of two components: 1) generating augmented state samples to augment hard-to-visit state samples during on-policy exploration, and 2) learning consistent embedding space, which uses augmented state as auxiliary information to regularize the embedding space, by reformulating the objective function of the embedding space based on Mutual Information. We evaluate ViSA in simulation and real-world robotic tasks and show improved goal-space generalization, enabling accurate value estimation

2.Overview

We artificially augment visited states (augmented states) that are inherently reachable under current policy but insufficiently visited in limited on-policy rollouts. In addition, instead of using augmented states merely as additional visited states, we leverage them as auxiliary information to regularize the embedding space, encouraging it to jointly calibrate relative distances to both rarely visited and originally visited states. As a result, this prevents overfitting to the limited visited states which ware obtained from on-policy rollouts.

Approach
Fig.1 Overview of ViSA

3.Simulation experiments

3.1.Common experiment settings

To confirm that our method improves the learning performance of CRL across a variety of robotic tasks, we conducted simulation-based learning on several robotic tasks and compared task performance. As comparison methods, we used the previous CRL method and several goal-conditioned learning baselines. Each method was learned for three trials. On this page, we present demonstrations comparing the proposed method with the previous CRL method.

3.2.Demonstration

Fetch push

A task in which a manipulator pushes a box to a target position on the table. The initial positions of the end-effector and the box, as well as the goal positions, were uniformly randomized.

Pick & Place

A task in which a manipulator grasps a box and transports it to a target position on the table or in the air. The initial positions of the end-effector and the box, as well as the goal positions, were uniformly randomized.

Robel Turn

A task in which a three-fingered robotic hand rotates a valve to an arbitrary target angle. The initial valve positions, robot fingertips, and goals were uniformly randomized.

Robel Screw

A task in which the robot rotates a valve by half a turn at a constant speed.

4.Real-world Experiment

To evaluate the effectiveness of our method for real-world robot learning, we conducted comparative experiments against previous CRL method using the Robel robot. Each task was learned under the same settings as in the simulation experiment. In addtion, for Robel Turn which has a large exploration-space, we initialized the policy with a network trained in simulation and fine-tuned it on the real robot. For Robel Screw, we learns the policy with full scratch.

Robel Turn

Robel Turn

Robel Screw

Robel Screw

5.Analysis of Mutual Information Estimation Accuracy

5.1.Setup

Learning the embedding space in CRL can be viewed as a mutual information (MI) estimation problem, since the objective is to maximize InfoNCE loss. To evaluate whether the factorized MI estimation using ISaFEI_{SaFE} in ViSA contributes to improving accuracy, we compared the errors between ground-truth and estimated MI values using synthetic datasets.

We simulated the sample bias in CRL using a key property of Gaussian distributions, where samples concentrate near the mean and few samples occur far from it. Specifically, three variables x,y,zx,y,z corresponding to anchor, visited state, and augmented state are generated via linear transformations and added Gaussian noise N(0,I)\mathcal{N}(0,I):

p(x)=N(x;0,I), p(yx)=N(y;x,σxy), p(z)=N(z;y,σyz).p(x)= \mathcal{N}(x;0,I),~p(y|x)= \mathcal{N}(y;x,\sigma_{xy}),~p(z)= \mathcal{N}(z;y,\sigma_{yz}).

Here, II is the identity matrix. To preferentially augment hard-to-sample yy samples, we set σxy<σyz\sigma_{xy}<\sigma_{yz} and generated zz from a higher-variance Gaussian. Mini-batch size was set to 256, and σxy\sigma_{xy} was varied so that the ground-truth MI I(x;y)I(x;y) takes {3.5, 4.5, 5.5}, validating estimator generality. For a fair comparison, the total number of yy and zz samples matched the number of yy samples in CRL. To validate the effect of augmented state relevance on MI estimation, we varied σyz\sigma_{yz} so that I(y;z)I(y;z) corresponds to {1/2, 1/32, 1/55} of I(x;y)I(x;y), as shown in the bellow figure (b).

5.2.Results

The MI analysis results are shown in the bellow figure (a). Across different MI values, our method (orange line) estimates MI closer to the ground truth (black line) than the conventional estimator INCEI_{NCE}, since augmented states supplement hard-to-sample data and regularize the embedding space. Consequently, when augmented states are properly related to visited states, MI estimation is more accurate.

resutl MI
Fig.2 Results of MI analysis experiment: (a) Experiment result, where solid lines show estimated MI values for each method and the ground-truth MI of the dataset and dashed black lines indicate the upper bound of MI estimated by InfoNCE loss. From left to right, these results correspond to settings where the MI between visited states and augmented states is increasingly higher. (b) Outline diagram illustrating the relevance between visited and augmented states.

6.Comparison with Naive Sample-Bias Mitigation Methods of Conventional CRL

6.1.Setup

One possible way to mitigate sample bias in conventional CRL is to broaden the range of positive samples collected from visited states without introducing augmented states. In conventional CRL, the discount factor γ\gamma determines the sampling distribution of future visited states used as positive examples. Increasing γ\gamma allows positive samples to be collected from a wider temporal range of the trajectory, thereby reducing the concentration of positive samples around states close to the anchor.

To investigate whether the benefits of ViSA can be explained solely by collecting positive samples from a broader range of visited states, we compare ViSA with a conventional CRL approach that broadens the positive-sample range by adjusting the discount factor. Specifically, we evaluate CRL using three discount factors, γ0.99,0.999,0.9999\gamma \in {0.99, 0.999, 0.9999}. The visited state distribution for each compared method is shown in Fig.3 (a). The Robel Screw task is used for evaluation.

Ablation_CRL
Fig.3 Visited state distributions and learning results for a naive sample diversification CRL method. (a) Visualization of visited state distributions. CRL is modified by adjusting discount factor $\gamma$ to collect visited state samples $s_v$ more broadly from on-policy rollouts. For visualization, sampling probabilities are normalized so that the maximum value is 1. (b) Learning curves of success rates. Solid lines show mean success rates over three trials for proposed method and baselines, and shaded regions designate variance.

6.2.Results

The learning curves of success rates are shown in Fig.3 (b). Although increasing γ\gamma broadens the range of visited states from which positive examples are sampled, the CRL variants with γ=0.999\gamma=0.999 (solid yellow line) and γ=0.9999\gamma=0.9999 (solid green line) exhibit larger variance across trials and slightly worse performance than the original setting with γ=0.99\gamma=0.99 (solid blue line).

A possible explanation is that increasing γ\gamma reduces the difference in sampling frequencies among future states at different temporal distances from the anchor, thereby weakening the relative value structure induced by the number of steps required to reach the goal. In contrast, ViSA (solid purple line) incorporates augmented states sas_a as auxiliary information while preserving the relative ordering of values associated with visited states in the embedding space. As a result, ViSA achieves faster convergence and lower variance across trials.

These results suggest that the performance gains of ViSA cannot be explained solely by collecting positive examples from a broader range of visited states. Instead, incorporating augmented states through the proposed objective plays an important role in mitigating sample bias while maintaining a consistent value structure in the learned representation.

7.Detail of learning settings

The details of the learning settings used in the experiments of this study are presented in the table below. In the actor network, two types of activation functions are employed: 1) NormalArcTangent is used only in the final layer, while 2) ReLU function is primarily applied throughout the network. Furthermore, to enable parallel learning, the actor operates with four instances in parallel during exploration. However, for tasks using the Robel robot, parallel learning is not performed, and training is conducted using only a single instance.

Learning SettingValue
number of Actor4
1 (When Robel Turn task and Robel Screw task)
batch size128
learning rate0.0003
discount0.99
gradient updates to perform per step32
actor target entropy0
hidden layers sizes(256, 256)
initial random data collection10,000 transitions
replay buffer size1,000,000 transitions
samples per insert256
train-collect interval16
representation dimensionψ,ϕ,ϕ^\psi, \phi, \hat{\phi}64
activation function(critic)ReLU
activation function(actor)ReLU (hidden layer)
NormalArcTagent (final layer)

8.Detail of task settings

8.1.Fetch Push

T_FetchPush
Overview of Fetch Push

The observation space is a 25-dimensional vector that includes the end-effector position, the box position, their velocities, and additional state information relevant to the task. The goal space is originally defined as a 6-dimensional vector composed of the end-effector position and object position. However, to map observations and goals into the same embedding space for contrastive learning, the remaining dimensions are padded with dummy values, resulting in a 25 dimensional representation matching the observation space. The action space is defined as a 3-dimensional vector representing the displacement of the end-effector position. The task achievement condition ΦTA\Phi_{TA} is defined as bellow. Specifically, an episode is considered successful if the box position sT=[xb,yb,zb]s_T=[x_b,y_b,z_b] reaches the goal g=[xg,yg,zg]g=[x_g,y_g,z_g] at least once during the episode and the Euclidean distance between them is smaller than a predefined threshold:

ΦTA=1(sTg<0.05)=1(xbxgybygzbzg<0.05).\Phi_{TA}=\mathbf{1}(\lVert s_{T}-g\lVert<0.05)=\mathbf{1}\left(\left\lVert\begin{array}{c} x_{b}-x_{g}\\y_{b}-y_{g}\\z_{b}-z_{g} \end{array}\right\lVert<0.05\right).

The reward function rr is defined as the sparse reward shown in bellow, where a reward is provided only when the task achievement condition is satisfied:

r=1(ΦTA=1). r = \mathbf{1}(\Phi_{TA}=1).

8.2.Fetch Push hard

T_FetchPushHard
Overview of Fetch Push Hard

The training setup is identical to that of the Fetch Push task, except that a time-based metric scorereachtimescore_{\mathrm{reach-time}} is introduced to encourage faster object transport. Specifically, this metric is defined based on the number of timesteps required to reach the goal and is included in both the observation and goal spaces. As a result, the observation space becomes 26-dimensional and the goal space becomes 7-dimensional, which is then padded with dummy values to obtain a 26-dimensional representation. The task achievement condition ΦTA\Phi_{TA} is extended as show in bellow. Specifically, a task is considered successful if the box position sTs_T reaches the goal gg and scorereachtimescore_{\mathrm{reach-time}} satisfies a predefined threshold ϵ\epsilon.

ΦTA=1(sTg<0.05)(scorereachtime>ϵ)=1(xbxgybygzbzg<0.05)(scorereachtime>ϵ). \Phi_{TA}=\mathbf{1}(\lVert s_{T}-g\lVert<0.05)\cdot(score_{\mathrm{reach-time}}>\epsilon)\\=\mathbf{1}\left(\left\lVert\begin{array}{c} x_{b}-x_{g}\\y_{b}-y_{g}\\z_{b}-z_{g} \end{array}\right\lVert<0.05\right)\cdot(score_{\mathrm{reach-time}}>\epsilon).

The reward rr is defined in the same manner as in the Fetch Push task, where a sparse reward is provided only when the task achievement condition is satisfied:

r=1(ΦTA=1). r = \mathbf{1}(\Phi_{TA}=1).

8.3.Pick & Place

T_Pick
Overview of Pick & Place

The observation space is a 27-dimensional vector that includes the end-effector position, the box position, their velocities, and additional state information relevant to the task. The action space is defined as a 4-dimensional vector representing the displacement of the end-effector position and gripper position. The goal space is originally defined as a 7-dimensional vector that includes the end-effector position, the box position, and additional elements. It is then padded with dummy values, resulting in a 27-dimensional goal representation. In addition, a subgoal-based goal representation is adopted in this task. Specifically, while the end-effector is away from the box, the box position is provided as the goal position of the end-effector. Once the end-effector reaches the box, the goal position of the end-effector is switched to the task goal position. In this manner, the goal of the end-effector is specified in two stages according to the progress of the task. The task achievement condition ΦTA\Phi_{TA} is defined as in bellow. Specifically, an episode is considered successful if the box position sT=[xb,yb,zb]s_T=[x_b,y_b,z_b] reaches the goal g=[xg,yg,zg]g=[x_g,y_g,z_g] at least once during the episode and the Euclidean distance between them is smaller than a predefined threshold:

ΦTA=1(sTg<0.05)=1(xbxgybygzbzg<0.05). \Phi_{TA}=\mathbf{1}(\lVert s_{T}-g\lVert<0.05)=\mathbf{1}\left(\left\lVert\begin{array}{c} x_{b}-x_{g}\\y_{b}-y_{g}\\z_{b}-z_{g} \end{array}\right\lVert<0.05\right).

The reward function rr is defined as the sparse reward, where a reward is provided only when the task achievement condition is satisfied:

r=1(ΦTA=1). r = \mathbf{1}(\Phi_{TA}=1).

8.4.Pick & Place Hard

T_PickHard
Overview of Pick & Place Hard

The training setup is identical to that of the Pick & Place task, except that a height constraint on the box is introduced. Specifically, the average box height over the past 5 steps, avg(zbox)\mathrm{avg}(z_{\mathrm{box}}), is added to both the observation and goal representations. As a result, the observation space becomes 28-dimensional and the goal space becomes 8-dimensional, which is then padded with dummy values to obtain a 28-dimensional representation. The task achievement condition ΦTA\Phi_{TA} is extended as shown in bellow. Specifically, a task is considered successful if the box position sTs_T reaches the goal gg and the average box height avg(zbox)\mathrm{avg}(z_{\mathrm{box}}) remains within a predefined tolerance range.

ΦTA=1(sTg<0.05)1(avg(zbox)zg<0.05)=1(xbxgybygzbzg<0.05)1(avg(zbox)zg<0.05). \Phi_{TA}=\mathbf{1}(\lVert s_{T}-g\lVert<0.05)\cdot\mathbf{1}(\lVert \mathrm{avg}(z_{\mathrm{box}})-z_g\lVert<0.05)\\ =\mathbf{1}\left(\left\lVert\begin{array}{c} x_{b}-x_{g}\\y_{b}-y_{g}\\z_{b}-z_{g} \end{array}\right\lVert<0.05\right)\cdot\mathbf{1}(\lVert \mathrm{avg}(z_{\mathrm{box}}) -z_g\lVert<0.05).

The reward rr is defined in the same manner as in the Pick & Place task, where a sparse reward is provided only when the task achievement condition is satisfied:

r=1(ΦTA=1). r = \mathbf{1}(\Phi_{TA}=1).

8.5.Pick & Place Obstacle

T_PickObstacle
Overview of Pick & Place Obstacle

The training setup is identical to that of the Pick & Place task. The observation space is a 27-dimensional vector, while the goal space is a 7-dimensional vector. To align the goal representation with the observation space, the goal vector is padded with dummy values, resulting in a 27-dimensional representation. The task achievement condition ΦTA\Phi_{TA} is extended as shown in bellow. Specifically, a task is considered successful if the box position sTs_T reaches the goal gg and the box is lifted to a height zboxhobstaclez_{box}\geq h_{obstacle} sufficient to avoid the obstacle at least once during the episode.

ΦTA=1(sTg<0.05)1(zboxhobstacle)=1(xbxgybygzbzg<0.05)1(zboxhobstacle). \Phi_{TA}=\mathbf{1}(\lVert s_{T}-g\lVert<0.05)\cdot\mathbf{1}(z_{box}\geq h_{obstacle})\\=\mathbf{1}\left(\left\lVert\begin{array}{c} x_{b}-x_{g}\\y_{b}-y_{g}\\z_{b}-z_{g} \end{array}\right\lVert<0.05\right)\cdot\mathbf{1}(z_{box}\geq h_{obstacle}).

The reward rr is defined in the same manner as in the Pick & Place task, where a sparse reward is provided only when the task achievement condition is satisfied:

r=1(ΦTA=1). r = \mathbf{1}(\Phi_{TA}=1).

8.6.Robel Turn

T_RobelTurn
Overview of Robel Turn
T_TaskSpace
Task space of Robel Tasks

The observation space is a 13-dimensional vector consisting of the valve angle, the end-effector position, and additional state information. The action space is constrained such that each finger of the robotic hand moves along a 2-dimensional trajectory. For each finger, a 3-dimensional action vector corresponding to the displacement along the trajectory is provided.

The goal space is a 5-dimensional vector consisting of the valve angle and the number of goal-holding steps. To align the goal representation with the observation space, the goal vector is padded with dummy values, resulting in a 13-dimensional representation. In addition, a subgoal-based representation is adopted in this task, where the required number of goal-holding steps is adjusted in three stages according to the difference between the current valve angle and the target goal angle. The reward rr is defined as in bellow, where a sparse reward of 1 is given when the difference between the valve angle and the goal angle falls below a predefined threshold.

r=1(xvxgyvyg0.35) r=\mathbf{1}\left(\left\lVert\begin{array}{c} x_{v}-x_{g}\\y_{v}-y_{g}\\ \end{array}\right\lVert\leq0.35 \right)

Here, (xv,yv)(x_v, y_v) and (xg,yg)(x_g, y_g) denote the 2-dimensional embeddings of the valve angle and goal angle on the unit circle, respectively. The task achievement condition ΦTA\Phi_{TA} is defined as success if r=1r = 1 holds at the final step of an episode.

Furthermore, in the Robel Turn task, training is performed with an early-termination mechanism, where the episode is terminated if the task achievement condition is satisfied for 10 consecutive steps.

8.7.Robel Screw

T_RobelScrew
Overview of Robel Screw

The observation space is a 13-dimensional vector consisting of the valve angle, the end-effector position, and additional state information. The action space is constrained such that each finger of the robotic hand moves along a 2-dimensional trajectory. For each finger, a 3-dimensional action vector corresponding to the displacement along the trajectory is provided. The goal space is a 5-dimensional vector consisting of the valve angle and the number of goal-holding steps. To align the goal representation with the observation space, the goal vector is padded with dummy values, resulting in a 13-dimensional representation. In addition, a subgoal-based representation is adopted in this task, where the target valve angle is updated at each step in order to rotate the valve at a constant speed. This design provides the agent with sequential angle targets that evolve over time. The task achievement condition ΦTA\Phi_{TA} is defined as in bellow. Specifically, an episode is considered successful if the number of timesteps during which the valve reaches the goal exceeds a predefined threshold:

ΦTA=1(t=0TΦt144)Φt=1(xtxgytyg0.35) \Phi_{TA}=\mathbf{1}\left(\sum_{t=0}^{T}\Phi_t\geq144\right)\\ \Phi_{t}=\mathbf{1}\left(\left\lVert \begin{array}{c} x_{t}-x_{g}\\y_{t}-y_{g} \end{array} \right\lVert\leq0.35\right)

Here, (xt,yt)(x_t, y_t) and (xg,yg)(x_g, y_g) denote the 2-dimensional embeddings of the valve angle and goal angle on the unit circle at timestep tt, respectively. The reward rr is sparsely provided when the task achievement condition is satisfied.