Reinforcement Learning Octupus Arm Utilizing On-Line GPTD

The octopus arm is known for its unique muscular hydrostat structure - it applies force with the sole use of muscles, without any rigid skeletal support.

Abstract
The octopus arm is known for its unique muscular hydrostat structure – it applies force with the sole use of muscles, without any rigid skeletal support. The biomechanical attributes of such an arm enable it to perform tasks no skeletal arm can perform. Hence, a robotic implementation of an octopus arm with a real-time learning control mechanism will yield a highly versatile application, with yet unseen robotic capabilities. Due to the arm’s continuous nature and high level of complexity, no existing machine-learning algorithm was shown to give applicable results in terms of time and space consumption.

The problem
The goal of the project is to teach the octopus arm’s 2-dimensional model to reach a given point in space with reasonable time and space consumption and with high success rates.

The solution
To solve this intricate problem we used a novel Reinforcement Learning approach, the On-Line GPTD, upon the 2-dimensional multi-segmented dynamic model of the octopus arm.

The Octopus arm model
The model we used is a 2-dimensional segmentation of the arm, utilizing only masses and springs for its dynamic characteristics. The arm is divided into (N-1) rectangular segments, each defined by four vertices. For simplicity, the muscles are deprived of their mass, and the entire arm’s mass content is concentrated in point masses. The point masses are located in the four vertices of each segment, giving a total of 2N masses. The idealized massless springs function as muscles and connect all the adjacent point mass pairs of the model. The 2N masses are arranged in N pairs, each consisting of one ventral and one dorsal mass. (N-1) ventral and (N-1) dorsal longitudinal muscles connect the N ventral and N dorsal masses respectively. In addition, a transverse muscle connects each ventral-dorsal pair. Figure 1 shows the general structure of the modeled arm. The simulation supports various physical parameters such as gravitation, water specific weight, arm specific weight. Within the simulation parameters, the user defines a set of activations (i.e sets of muscle strength forces for each segment). These activations enable the complex movement of the octopus arm. This simulation was implemented in an earlier project by Chen Kojokaro and Keren Sasson, also supervised by Yaakov Engel.

1
Figure 1

On-Line GPTD (Gaussian Processes for TD learning)
The algorithm is used for finding the value function of an MDP online. Basically, the algorithm is about imposing a Gaussian Process prior over the value function, suggesting prior knowledge: E(V(x))=0, E(V(x)V(x’))=k(x,x’). Here, k(x,x’) is a kernel function which reflects our prior knowledge concerning the similarity of values between two states. Using the Bellman equation, and the prior knowledge above, the MDP can be written as a matrix product. Using these matrices, an efficient sparsification method and the standard Gaussian variables theorems, the on-line GPTD algorithm is obtained.

The Learning System
We have implemeted both stochastic and deterministic on-line GPTD algorothms in C++, creating a general purpose and standalone algorithmic module. Using this module as an on-line value function estimator, we have implemented different policy iteration methods: OPI (Optimistic Policy Iteration, which includes e-greedy and softmax), Interval Estimation and Actor-Critic. In addition, the system enables various reward and goal settings. A general scheme of the learning algorithm is shown in Figure 2.

2
Figure 2: The learning algorithm

Performance Evaluation
For evaluating the arm’s performance, we saved the GPTD matrices (the value function) during the learning process. For each GPTD save, a greedy simulation was run over a pool of randomly distributed the initial arm states. The resulting data for each initial state was whether, under greedy policy, it had reached the goal, and if so, how long a time did the trial last. This data was then rendered into learning curves, mean run time curves and rates of success. We expect the success rates to increase over time, and the mean run time to decrease respectively.

Results
In our experiments, we used an octopus arm simulation with 10 segments, which yields resonable complexity, yet is still flexible enough to manipulate. The state vector has 88 dimensions, since 10 segments result in 22 masses, where each mass has position x, y and velocity dx, dy. In general, a basic set of 5-10 muscule activations was used, the trial time limit was set to 4 simulation seconds and each simulation step lasted 0.4 sec.

We present several examples of results. The rest can be found in the project’s book:

1) This is one of our basic learning tasks, in which the arm’s base was fixed (the two base masses cannot move), and the goal can be reached with any of the arm’s vertices. In addition, the gravitational acceleration was set to 9.8 m/s^2, as on earth’s surface.

3

2) This is one of our advanced learning tasks. This time, the arm’s base can rotate in both directions, and the goal can be reached only with the two masses in the extremety of the arm. Here to, the gravitational acceleration was set to 9.8 m/s^2, as on earth’s surface.

4

Conclusions
We believe our results to be self-evident. In each and every task that was presented to the learning system, the agent eventually succeeded in reaching the goal with striking rates of success, proving the capabilities of the GPTD de facto. All this was accomplished with a considerably small dictionary in comparison with the total number of different states visited by the agent. Finally, we believe that the GPTD based octopus arm can handle even more complex and realistic learning tasks, namely those typical of octopi. These would be reaching any point in space (not just one given point as shown above) and even chasing a moving goal. Another further development would be enabling physical interaction between the goal and the arm, which will allow the arm to manipulate the goal’s location in space. The arm may pull the original goal towards a distinct location in space (a second goal), as a real octopus would drag food into its mouth.

Acknowledgment
We wish to extend our sincere gratitude to our instructor, Dr. Yaakov (Yaki) Engel, for his advise, support and guidance. We would also like to thank PSPL staff and chief engineer Johanan Erez, for providing us an impeccable working environment and allowing us to occupy computers for long simulations. We also wish to thank the Ollendorff Minerva Center, which supported this project.