## I Introduction

As autonomous robotic systems are deployed to support efforts in industry or exploration it is vital that these systems have the ability to autonomously respond to new environments and off-nominal conditions. As in-space exploration continues to increase, with humanity returning to the moon and journeying to Mars according to the current directive for NASA, the required infrastructures will grow in scale and complexity. Constructing and maintaining these infrastructures with an astronaut workforce raises circular prerequisite errors given these infrastructures will be required to support the presence of astronauts. A solution to this integrates the use of autonomous robotic systems where these autonomous units collaboratively assemble and maintain infrastructures such as living quarters and power acquisition facilities similar to those in Fig. 1, prior to the arrival of astronauts [belvin_-space_2016]. Due to the cost of sending equipment into space and the time lost delivering it to a location such as Mars, it is not feasible to deliver pre-assembled facilities. This will lead to a variety of tasks requiring autonomous robotic attention. To ensure that a single robot’s failure does not halt construction, there must be ability overlap across the different types of jobs that each robot is capable of completing. This overlap of ability between robotic units creates the possibility of many feasible assembly schemes. Solving for valid schemes that are also efficient will become more important as the number of assemblies increases. In addition to being solved before deployment, new schemes may need to be determined after work has already begun if an off-nominal occurrence causes the original scheme to be invalid.

This work seeks to provide a general problem formulation that describes the in-space assembly task assignment problem, facilitating the application of different solution methods that seek a valid and optimal assembly scheme. Additionally, two possible solution formulations were developed and evaluated. This general formulation takes the form of a flexible job shop scheduling problem (FJSP) which was then utilized in the investigation of two different solution methodologies, mixed integer programming (MIP) and reinforcement learning (RL). A MIP solution methodology was selected due to its extensive use in solving job shop scheduling problems (JSP) [OZGUVEN2012846, ku_mixed_2016] and its ability to find provably optimal solutions for these problems given enough computation time. An RL approach was chosen due to its inherent ability to learn from elements not explicitly defined in the solution formulation. This ability allows for increasing interactions and prerequisites between jobs without additional complexity to the state space [sutton_reinforcement_1998]. The rest of this paper is structured in the following way. Section II will discuss the general problem formulation as an FJSP. Section III will then define the realistic scenario used in the solution evaluations. Section IV will discuss the MIP solution formulation to the FJSP followed by section V which will discuss the RL solution formulation. Section VI will describe the simulation details for the MIP and RL formulations and section VII will discuss the results from both of these simulations. The final section, section VIII, will discuss the results and future research.

## Ii General Problem Formulation

This paper proposes framing the autonomous assembly task assignment problem as a FJSP, an extension of the JSP, where each operation can be processed on any machine. In this formulation, a given job, , will have a set of operations, , where is a processing plan defining the processing strategy. This processing strategy defines how the operations are processed by the machines (robotic units) represented as . Depending on the the type of operation, different types of robots will have different completion efficiencies. If represents a type of robot working on a specific project (a set of jobs required to complete the desired facility or structure) and represents a type of operation in then the completion efficiency for each can be represented as an efficiency matrix, . In many projects, the work on a given job cannot begin until a different job in the project has already been completed. To efficiently represent this precedence constraint a directed acyclic graph (DAG), , is defined where the set of vertices, , represent the jobs and the arcs, , represent the directed precedence paths between jobs. Additionally, represents a set of these arcs that require machine-operation continuity. In an assembly scenario, it is important to include information for the distance between different jobs in the task assignment process. These distances can be represented as the edges, , of a completed graph, , where the vertices are the jobs in the project.

## Iii Experimental Scenario

To illustrate the implementation of this formulation and provide a realistic experimental example for simulation, consider a solar farm assembly scenario where robotic systems must autonomously assemble solar panels for an in-space application on the surface of the moon. For a mission such as this, it is beneficial to launch the solar panels as base components to maximize the volume usage in the delivery system. These components will then be deposited in a storage area on the edge of the project’s workspace. As mentioned in section I, robotic units with a range of abilities would be present to assemble these base components into solar panels. One such unit would be a large robot, capable of unloading the basic components from a lander or moving the components quickly across the workspace. NASA Langley Research Center (LaRC) has developed the Lightweight Surface Manipulation System (LSMS) [doggett_design_2008, dorsey_recent_2011] to accomplished this type of task. This assembly scenario will also require robots that are mobile and capable of utilizing grippers to: move assembly components, operate tools used for affixing components, and collaborate in the carrying of larger components. This type of robot will likely take a form akin to the rover and arm planned in the Mars 2020 mission [williford_chapter_2018] where a robotic arm is affixed to a rover chassis capable of traversing the terrain in the assembly workspace. The experimental robot fulfilling this role in the research presented here is the Mobile Assembly Robotic Collaborator (MARC) at the Field and Space Experimental Robotics (FASER) Laboratory [noauthor_virginia_nodate]. In autonomous assembly, precise manipulation is required to align components before they are permanently affixed into place. To accomplish this type of task, LaRC has developed the Assembler robot, a serialized parallel robot consisting of modularizable Stewart platforms [moser_reinforcement_2019]. Built units of these robots can be seen in Fig. 2. The robots used in the following simulations are modeled after these robots.

As a representation of the solar panel assembly, this work will model the solar panel as three components; two for the frame and one for the solar cell sheet, shown in Fig. 2(a) as . These three components will be attached via the connection points shown in the same figure.

To assemble this solar array, the components will need to be moved, jigged (aligned), and then affixed (welded or attached) at the connection points. These three types of jobs, represented by respectively, will have operations consisting of combinations of four operation types that constitutes the set . The operation types and the set of operations for each job are given in Tables I II respectively. The completion efficiency for each robot type to each operation type is also given in Table I. To successfully assemble the solar panel, both of the frame components must first be moved to the assembly location. Following this, each component of the frame must be aligned into position with respect to the adjoining piece (for this representation only one component needs to be aligned to a stationary piece for each frame piece). Once aligned, it must be welded into place before the alignment can be released. When the frame is ready to receive the solar cell sheet, it is moved into place. One end is affixed to the frame and then the cell is unrolled before it is affixed to the final side of the frame ending the assembly sequence. The DAG representing these precedence constraints for assembling a solar panel is given in Fig. 2(c) and the holding constraints are between the jigging jobs and their respective affixing jobs since the components can not be disturbed between these jobs. This representation of the solar farm assembly problem will be used in the following sections demonstrating the solution formulations to solve for a valid assembly scheme.

q | Description | Assembler | MARC | LSMS |
---|---|---|---|---|

0 | Hold frame link | 1 | 5 | 10 |

1 | Hold solar cell sheet | 1 | 5 | 10 |

2 | Weld connection point | 5 | 1 | 10 |

3 | Locomote | 100 | 5 | 1 |

Jobs | Process plans |
---|---|

## Iv Mixed Integer Programming Formulation

Mixed integer programming is a powerful tool for solving optimization problems and provides a way to compare solutions to calculated bounds on optimal solutions. In particular, the backbone of commercial solvers use Branch & Bound to intelligently explore the space of feasible solutions while producing tighter bounds on how good the optimal solution could be. As a result, a report on a measure of optimality gap (a ratio of how far the best found solution is from optimality) can be given. The formulation originally inspired by [OZGUVEN2012846] is written to allow multiple process plans per job, allowing for different types of operations to be done to complete a job, and hence potential for faster job completion.

The model is described in the following form. For shorthand, means , , , , respectively. Furthermore, a specific process plane is referred to as and operations as , to reference which job or plan they come from.

Sets

arc set for precedence of jobs
operation continuity set

Parameters

Operation time of to do
Setup time of moving from to

Continuous Variables (non-negative)

the start time of job
completion time of job
upper bound of completion times
completion time of operation
completion time of machine on job

Binary Variables

plan is selected for job
machine is selected for job
machine moves from to
machine assigned to operation

Mixed Integer Programming Model

Objective function:

(1a) |

Completion and start time constraints: The total makespan occurs when all jobs are finished, and jobs are finished when all operations for the job are finished. Furthermore, from the DAG, jobs cannot start until preceding jobs are completed:

(1b) | |||||

(1c) |

If moving from to , then must start after completion of and travel time of to : ( is a large enough number)

(1d) |

(1e) |

Assignment constraints: Exactly 1 process plan chosen for job :

(1f) |

If m assigned to operation , then process plan p must be chosen for job :

(1g) |

If is chosen, then each in must be assigned exactly 1 machine:

(1h) |

If assigned to , then must be chosen for :

(1i) |

If machine chosen for job , then it must have exactly one edge entering and leave that job on its path:

(1j) | |||||

(1k) |

must have path from starting to ending node:

(1l) |

Operation continuity: for specific arcs , if machine does operation for , then it must do operation for and it must transition to job next

(1m) | |||||

(1n) |

## V Reinforcement Learning Formulation

As stated previously, a reinforcement learning method was chosen as a solution method in an attempt to implicitly learn some of the environment dynamics that were explicitly encoded in the mixed integer programming formulation. The chosen reinforcement learning algorithm for this evaluation was an off-policy temporal difference algorithm known as Q-learning. The Q-learning equation is defined by

(2) |

This temporal difference method samples the environment, learning from experiences rather then a dynamics model, giving it similarities to a Monte Carlo Method while updating estimates based partially on previously learned estimates making it akin to a dynamics programming method

[sutton_reinforcement_1998]. This method updates (the action-value function) to approximate the optimal action-value function off-policy (i.e. independent of the current policy). The discount factor, , controls the weight of the future effect of the action chosen at time . The learning rate, , controls the weight at which the present value of the action-value entry is updated from the temporal difference evaluation. To evaluate if the RL was able to learn the precedent and holding constraints the action space was the same size as the state space, allowing the policy to choose any state as the next state. To minimize the total makespan a small penalty was incurred at every time step and a large penalty was given if the policy violated the precedent or holding constraints or if it tried to complete a job that had already been completed.## Vi Simulation

To evaluate the solution methodologies with the scenario described above, the completion efficiency values shown in Table I where chosen to reflect the general ability differences of the robots described above. The general spatial layout of this simulation environment is qualitatively shown in Fig. 4. In this workspace, the Assembler robot is used in its disassociated state. That is, each Stewart Platform is spatially separated from the others while still being treated as a single robot unit. This is represented by the four red diamonds. Additionally, the LSMS base and its end-effector are denoted by the red hexagons while the MARC units are represented by the red triangles. The blue circles represent the qualitative spatial location of the jobs. In the simulations the spatial distances between jobs are shown in Table III. The time it takes a robot to traverse this distance is the distance value multiplied by the specific robot’s ability to locomote divided by a constant value ( for all robots aside from MARC 2 which used a constant value of ).

### Vi-a Mixed Integer Programming

For the mixed integer programming solution simulation, the setup time was the amount of time it took a robot to traverse the distance between jobs. The current model formulation allows robots to begin the simulation at any location, which is easily adjusted by changing the fix starting points. For this simulation only one process plan was used, which prescribed one machine per task for a given job. The operation time was the completion efficiency of the robot assigned to work on it given in Table I. Using these parameters and the formulation described in section IV this simulation was evaluated using the Gurobi solver [noauthor_gurobi_nodate]. The results from this simulation will be discussed in section VII.

Jobs | M1 | M2 | Ja | Jb | Aa | Ab | M3 | Jc | Ac | Md | Jd | Ad |
---|---|---|---|---|---|---|---|---|---|---|---|---|

M1 | 0 | 0 | 43 | 50 | 38 | 55 | 0 | 36 | 35 | 0 | 56 | 55 |

M2 | 0 | 0 | 41 | 50 | 36 | 55 | 0 | 35 | 35 | 0 | 55 | 55 |

Ja | 43 | 41 | 0 | 0 | 0 | 0 | 40 | 0 | 0 | 40 | 0 | 0 |

Jb | 50 | 50 | 0 | 0 | 0 | 0 | 52 | 0 | 0 | 53 | 0 | 0 |

Aa | 38 | 36 | 0 | 0 | 0 | 0 | 35 | 0 | 0 | 35 | 0 | 0 |

Ab | 55 | 55 | 0 | 0 | 0 | 0 | 57 | 0 | 0 | 58 | 0 | 0 |

M3 | 0 | 0 | 40 | 52 | 35 | 57 | 0 | 35 | 35 | 0 | 55 | 55 |

Jc | 36 | 35 | 0 | 0 | 0 | 0 | 35 | 0 | 0 | 35 | 0 | 0 |

Ac | 35 | 35 | 0 | 0 | 0 | 0 | 35 | 0 | 0 | 35 | 0 | 0 |

Md | 0 | 0 | 40 | 53 | 35 | 58 | 0 | 35 | 35 | 0 | 55 | 55 |

Jd | 56 | 55 | 0 | 0 | 0 | 0 | 55 | 0 | 0 | 55 | 0 | 0 |

Ad | 55 | 55 | 0 | 0 | 0 | 0 | 55 | 0 | 0 | 55 | 0 | 0 |

### Vi-B Reinforcement Learning

The reinforcement learning simulation used the same setup and completion efficiency data as the MIP. However, due to the limitations stemming from the size of the state space, only a subset of the jobs, , were used. These five jobs were chosen because they reflect the same constraints seen in the simulation for the MIP formulation. The state space for this Q-learning consisted of the five different jobs multiplied by the 24 possible different permutations based off the process plan yielding a state space of 120 entries, where each state is a job and process plan assignment combination. This leads to a state-action -table of the size since the action space consists of the next job and process plan to proceed to from a given state. After a coarse grid search and were chosen for the learning parameters. The penalty for violating a constraint or trying to complete a job already completed was whereas a the penalty for time unit spent was . This RL simulation was programmed using OpenAI’s Gym framework [noauthor_gym_nodate].

## Vii Results

### Vii-a Mixed Integer Programming

The mixed integer program was solved using Gurobi version 9.0 and its Python API [noauthor_gurobi_nodate]. Computations were done on a 2014 MacBook Air laptop running macOS Mojave with a 1.7 GHz Intel Core i7, and memory of 8 GB, 1600 MHz DDR3. For the single solar panel example, the MIP finds an optimal solution within 2 seconds and proves optimality in 5 seconds. A larger example with 2 solar panels was also tested where an optimal solution was found in 26 seconds, however, the solver had trouble proving optimality. The optimal schedule output from the MIP for both cases are shown in Figs. 5 and 6. Colored rectangles indicate intervals of time that a job is being worked on, while black lines indicate a machine traveling to a different job. Some jobs are located in the same place and do not require travel time between those jobs.

### Vii-B Reinforcement Learning

The number of jobs had to be reduced in order for the RL to converge to a viable, though nonoptimal, solution. This nonoptimality is highlighted by the fact that the LSMS was not chosen to complete despite it being the better choice as described in Fig. 7. Fig. 7(a) shows that the model has converged and with additional training time the chosen actions for the given states will not change. It is important to note that the RL did not always converge to a correct schedule. Fig. 8

shows the variance between state-action spaces. This is most likely due to the fact that the state-action space is very large compared to the number of correct state-actions it needs to learn for an optimal schedule.

## Viii Discussion & Conclusion

The work presented here describes a novel application of FJSP to frame the in-space autonomous assembly problem providing a general description that can then be utilized by different solution formulations. The proposed MIP solution efficiently solved to optimality for the test instances. This approach solves the deterministic version of the problem from an offline (predetermined solution) planning perspective. This solution is ideal since it guarantees the most efficient way to complete the project. However, in this formulation, all of the precedence and holding constraints had to be directly encoded in the constraint equations. As autonomous assembly scenarios become more complex, the interjob dynamics will become harder to explicitly define. In contrast to the MIP results, the RL approach did not successfully converge to an optimal schedule. While it did learn the interjob dynamics to create a policy of decisions based on interactions with the environment thus allowing for flexible scheduling based on unforeseen circumstances, it was limited by the state space formulation. Future research will evaluate ways to combine the strengths of these two methods along with understanding stochastic elements of uncertainty to give autonomous systems the ability to autonomously learn and solve for schedules to in-space assembly projects facilitating persistent space exploration.