In recent decades, reconfigurable hardware, and field programmable gate arrays (FPGAs) in particular, have received much attention because of their ability to be reconfigured to any custom desired computing architecture rapidly . We can construct an entire hardware system on an FPGA chip or include an FPGA on a system-on-chip to provide hardware programmability. Traditionally, FPGAs are exploited using compile-time (static) reconfiguration, and the configuration remains the same throughout the running time of an application. To change the configuration, we have to stop the computation, reconfigure the chip by means of power-on resetting, and then start a new application. With the evolution of FPGA technology, dynamic reconfiguration (DR) has been developed, which provides more flexibility to reconfigure the FPGA by changing its predetermined functions at run-time. Through DR, one large application can be partitioned into smaller tasks; then, the tasks can be sequentially configured at run-time. In this process, the entire chip must be reconfigured for each task; thus, significant reconfiguration overhead is incurred for loading the configuration each time .
To reduce the reconfiguration overhead and improve performance, several techniques are employed in modern FPGA architectures, such as partially dynamic reconfiguration (PDR), module reuse, and configuration prefetching, where PDR is a technique that reconfigures part of the FPGA at run-time while retaining normal operation of the remaining areas of the FPGA . By applying the PDR technique, different tasks can be executed and configured in parallel, and a portion of the configuration latency can be hidden by careful scheduling of the configurations and executions of tasks. Hereafter, the FPGA, with the characteristic of PDR, is regarded as a partially dynamically reconfigurable FPGA (PDR-FPGA).
To implement a large application composed of task modules on a PDR-FPGA, we must consider two problems: when the task modules should be configured and executed and where the task modules should be placed. The former is a scheduling problem, and the latter is a floorplanning problem. Unfortunately, both of them are NP-hard  . In addition, to enable PDR, the reconfigurable resources on the FPGA are partitioned into several reconfigurable regions, which will be dynamically reconfigured to realize different tasks over time. Therefore, the number of partitioned reconfigurable regions and their sizes should be considered in this process.
I-a Related Work
Many studies have focused on partitioning, scheduling, and floorplanning for PDR. R. Cordone et al. 
proposed an integer linear programming (ILP) based method and a heuristic method for partitioning and scheduling task graphs on PDR-FPGAs, where configuration prefetching and module reuse are considered to minimize the reconfiguration overhead. A. Purgato et al. proposed a fast task scheduling heuristic to schedule the tasks in either the hardware or the software with minimization of the overall execution time on partially reconfigurable systems. However, the proposed method only focuses on generating reconfigurable regions to satisfy the resource requirements, which will easily cause the final result to fail to produce a valid floorplan. Y. Jiang et al. 
proposed a network flow-based multi-way task partitioning algorithm to minimize the total communication costs across temporal partitions. However, in this work, the partitioning is simplified without considering the partial reconfiguration, and it is difficult to effectively estimate the communication costs without the floorplan information. All the aforementioned works mainly focus on partitioning/scheduling of the tasks without consideration of the floorplan, which will often cause the schedule to fail to be floorplanned effectively, as they do not consider the resource constraints on the FPGA chips.
E. A. Deiana et al.  proposed a mixed-integer linear programming (MILP) based scheduler for mapping and scheduling applications on partially reconfigurable FPGAs, and if the schedule cannot be successfully floorplanned, the scheduler is re-executed until a feasible floorplan is identified. However, the time-consuming MILP based method is impractical for large applications. In addition, scheduling and floorplanning are solved separately, which can cause large communication costs in the spatial domain. M. Vasilko  proposed a temporal floorplanning method for solving the scheduling and floorplanning of dynamically reconfigurable systems. P. Yuh et al.   modeled the tasks as three-dimensional (3D) boxes and proposed simulated annealing-based 3D floorplanners to solve the floorplanning and scheduling problems of the tasks. However, the task modules are assumed to be reconfigured at any time and in any region, which may not match practical reconfigurable architectures. For example, in the Virtex 7 series FPGA chips from Xilinx , the reconfiguration partitions (dynamically reconfigurable regions) cannot be overlapped. Given scheduled task graphs, many works have focused on the floorplanning of partially reconfigurable designs [14, 15, 16, 17, 18, 19].
The design of reconfigurable systems with PDR generally involves partitioning, scheduling, and floorplanning of the tasks, which are interdependent considering communication costs and system performance. Therefore, these three problems have to be solved in an integrated optimization framework to effectively explore the design space. However, the aforementioned works either solve the three problems sequentially, where, at most, a simple iterative refinement between scheduling and floorplanning is included, or solve only two of the three problems in an integrated framework.
In this paper, we propose an integrated optimization framework for task partitioning, scheduling, and floorplanning on partially dynamically reconfigurable FPGAs. This paper expands our previous work . Numerous theoretical analyses are provided for the feasibility of the -s (defined below). The main contributions of this paper are outlined as follows.
1). The term - is proposed to represent the partitions, schedule, and floorplan of task modules, where , , and are the sequences of task modules. is regarded as a hybrid nested sequence pair () representing the floorplan with spatial and temporal partition, and is the partitioned dynamic configuration order of the tasks. The floorplan can be computed from the in time, and the schedule of tasks can be computed in time by solving a single-source longest-path problem on a reconfiguration constraint graph (), which is constructed based on - and the task precedence graph.
2). We elaborate a perturbation method to integrate the exploration of the schedule and floorplan design space into simulated annealing-based searching. In the perturbation, a randomly chosen task module is removed from a - and is then re-inserted into the partitioned sequence triple at a proper position selected from all possible insertion points, which are efficiently evaluated in time based on an insertion point enumeration procedure.
3). We prove a sufficient and necessary condition for the feasibility of the partitioning of tasks and scheduling of task configurations, which is not included in , and derive conditions for the feasibility of the insertion points in a -.
The experimental results demonstrate the efficiency and effectiveness of the proposed optimization framework.
The remainder of the paper is organized as follows. Section II describes the target hardware architecture and the problem definition. Section III discusses the representation of a sequence triple. Section IV shows the optimization framework to explore the design space of partitioning, scheduling and floorplanning of task modules. Experimental results and conclusions are shown and discussed in Section V and Section VI, respectively.
Ii Problem Description
Ii-a Dynamically Reconfigurable Architecture
The dynamically reconfigurable system typically includes a host processor, an FPGA chip, an external memory, and the communication infrastructure among them. The host processor and communication infrastructure could be on-chip or off-chip. Pre-synthesized task modules are stored in off-chip external memory in the form of bitstreams. According to the scheduled sequence and floorplanned locations, the host processor deploys task modules on the FPGAs.
Modern FPGAs have evolved into complex heterogeneous and hierarchical devices. However, the basic logic cell still comprises configurable logic blocks (CLBs) . In the target architecture, the CLB is the smallest reconfigurable element. Configuration bitstreams are transferred into FPGAs using one configuration port, which is an external Joint Test Action Group protocol or an internal configuration access port (ICAP).
On the other hand, PDR is subject to the technology limitation, which is that the configuration process of a task module must not disrupt the execution of other task modules . Thus, generally, dynamically reconfigurable regions (DRRs), where the task modules are dynamically reconfigured in a manner similar to that of a context (time layer) switching mode, are used for implementing partial reconfiguration. On an FPGA chip, we can have multiple DRRs and one DRR can be dynamically reconfigured while the others continue to execute.
A DRR is a rectangular region on FPGAs because irregular-shaped reconfiguration regions (such as T or L shapes) can introduce routing restriction issues . A task can be implemented as a rectangular hardware module on the FPGA. The module area represents the occupied CLBs (the number of rows and columns on the FPGA).
Ii-B Problem Definition
The design is composed of pre-synthesized tasks whose resource usage and internal routing are predetermined. Let be a set of tasks. A task
, has a physical attribute vector,. The meanings are shown in Table I. is proportional to the area and is estimated by , where is the configuration time of a single CLB.
|and is a task module.|
|number of CLB rows and CLB columns required by .|
|configuration span (time) and execution span (time) of .|
|start configuration time /start execution time of .|
|and is a dynamically reconfigurable region.|
|the DRR where is located.|
|the -th time layer in .|
|the time layer where is located.|
|configuration span of time layer .|
|start configuration time of time layer .|
|configuration order of time layer .|
|start of lifetime of a time layer .|
|end of lifetime of a time layer .|
|=(, ), lifetime of a time layer .|
The data dependencies among these tasks are given as a task dependence graph, , where and , and must end before starts. denotes the transitive closure of .
The partitioning, scheduling, and floorplanning of PDR are formulated as follows:
In the spatial domain, the tasks are partitioned into DRRs. Let be the number of DRRs.
The DRRs are denoted as , where , . If , we denote the DRR of as .
In the temporal domain, the tasks are partitioned into different time layers to reuse the resources of DRRs. A time layer is configured as a whole. Thus, in the same DRR, a time layer can only be configured after the completion of all the tasks in the previous time layer. Let be the number of time layers in .
The time layers are denoted as , where , . If , we denote the time layer of as and the total number of time layers as .
For convenience, we define to be the configuration order of time layer , and stipulate that , .
The configuration span (time) of the time layers in a DRR is proportional to the area of the DRR, we use or (= ) to denote the configuration span of a time layer . To reduce the time complexity in the proposed integrated optimization framework, is also under-estimated by summing the configuration time of task modules:
For the scheduling, we consider the following constraints:
The precedence constraints between tasks cannot be violated, that is, .
A task must be configured before execution, that is, .
Considering the technical limitation of only one configuration port, the configuration span of time layers must be non-overlapped.
In the same DRR, a time layer can only be configured after the execution of all the tasks in the previous time layer because they share the same hardware resources.
The constraints for the floorplanning process are as follows.
Each DRR occupies a rectangular region, and all the rectangular regions of the DRRs should be placed without overlapping each other and should be within the FPGA chip area, which is defined by the chip width and chip height (fixed-outline constraint).
The task modules in the same time layer must be non-overlapped and placed within their corresponding DRR.
Under the above constraints, we solve the partitioning problem to determine and , the scheduling problem to determine the start configuration time and start execution time of the tasks (time layers), and the flooprlanning problem to determine the floorplan of DRRs and the floorplan of tasks inside the DRRs.
We define schedule length to be the time from the beginning of the configuration process to the end of the executions of all tasks. The objective is to find a reasonable floorplan of tasks on a PRD-FPGA while minimizing the schedule length of designs as well as the communication costs among tasks.
Iii Partitioned Sequence Triple
In this paper, a partitioned sequence triple (-) is proposed to represent the partitioning, scheduling, and floorplanning of tasks for partially dynamically reconfigurable designs.
The partitioned sequence triple - is a 3-tuple of task sequences, , where forms a hybrid nested sequence pair () to represent the spatial partition (DRR), the temporal partition (time layer) and the floorplan of the task modules, and defines the configuration order of the time layers.
In a -, task partitioning is constrained as follows:
The task modules in the same time layer will consecutively appear in , , and .
The task modules in the same DRR will consecutively appear in both and .
The structure of - is illustrated as follows:
, , .
In a -, denotes the sequence of tasks in the time layer and is the sequence of tasks in the DRR .
An imposes the position relationship between each pair of task modules as follows.
if or , then
is left to ;
is below .
Notice that the relationship between the task modules from different time layers in the same DRR is not defined, as there are no non-overlapping constraints involved. Without loss of generality, we require the task modules in the same time layer to occur consecutively in and for clarity in representing the partitions of time layers and the floorplan of time layers.
The configuration order of a time layer can be represented by a configuration sequence , which is defined as follows:
Given an sequence, (), the configuration constraints are defined as follows:
1) if , then and are configured simultaneously, along with the corresponding time layer;
2) if , then is configured before , and the configuration order relationship is .
In the , the ordering of task modules within a time layer makes no sense because the time layer is configured as a whole.
For example, a task graph with ten task modules is shown in Fig. 1 and a - in this example is given as follows:
For simplifying the notations, we use to represent the task module in the examples of -. From the partitioned sequence triple -, we can obtain the corresponding configuration order and floorplan on the FPGA as shown in Fig. 2.
According to Definition 5 and the given configuration sequence , Fig. (a)a shows the configuration order of the time layers. First, the time layer from is configured and, second, the time layer from can be configured during the executions of and . The computation of the beginning configuration times of time layers will be discussed in Section III-C.
Considering the relationship between each pair of task modules defined in , Fig. (b)b shows the corresponding floorplan of task modules, where is below because they are in the same time layer ( = = ), and is below because they are in different DRRs ( and ).
Knowing the dimensions of task modules, we can compute the floorplan from the in time by solving the longest weighted common subsequence of and  hierarchically. We can compute the floorplan of task modules within every DRR to obtain the occupied resource arrays of DRRs, and then compute the floorplan of DRRs to determine the total resource usage by regarding each DRR as a whole. The computation of the schedule will be discussed in the following subsections.
Iii-B Feasibility of Partition and Configuration Order
Owing to the dependencies between tasks, not all the partitioned sequence triples - are feasible. In this subsection, we prove a sufficient and necessary condition for the feasibility of partitions and configuration order of task modules.
Iii-B1 Lifetime of Time Layers
The task modules in a time layer can be executed only after the configuration of the time layer and will be destroyed while configuring the next time layer (if more time layers exist) in the same DRR. Consequently, we have the following definition.
Given the spatial partition , the temporal partition , and the configuration order of the time layers, we define the lifetime of a time layer , , as follows.
Note that the lifetime of a time layer is also the lifetime of the task modules in the time layer.
To discuss the feasibility of a configuration order, we define the dependencies between time layers based on the dependency graph of tasks given and . A dependence graph is constructed as follows.
; If there exist and respectively from and such that }. Note that is the edge set of the transitive closure of .
Iii-B2 Dependencies Between Time Layers
Given a configuration order, the dependencies between time layers fall into two groups: forward dependencies and backward dependencies.
A dependence is forward if , which indicates that the output of a task module in a time layer, , is the input to a task module from a future time layer, .
Forward dependencies are always feasible because even if the lifetime of a time layer ends, the computed data can be stored and used in the future.
A dependence is backward if , which indicates that the output of a task module in a time layer is the input to a task module from an earlier configured time layer, .
However, backward dependencies are infeasible if there is no overlapping between the lifetimes of the dependent time layers, and , that is, . In this situation, is destroyed (replaced by a new time layer) before the time layer is configured, so the input to a task module is generated after the task module has been destroyed.
Fig. 3 shows examples of lifetimes of time layers and the dependencies between time layers. The spatial partition and the temporal partition of the tasks are shown in Fig. 2, and the dependencies between tasks are shown in Fig. 1. The configuration order of the time layers is as follows: (also shown as the x-axis in Fig.3).
The time layers and have backward dependence because needs the data from , as shown in Fig. 1, and their lifetimes and are non-overlapped. That is, in is destroyed ( in the same DRR has occupied the hardware resource) before the execution of in . Consequently, will never receive the data from , so the configuration order of task modules shown in Fig. 3 is infeasible.
Iii-B3 Condition of Feasibility
We thus argue that the given spatial partition, temporal partition, and configuration order is feasible if a schedule of executions and configurations of task modules can be computed without consideration of resource constraints. We have the following theorem:
The given spatial partition, temporal partition, and configuration order is feasible if and only if there are no backward dependencies between time layers that have no lifetime overlap.
Proof. Given a partition, a configuration order, and the task dependency graph, we can construct a reconfiguration constraint graph () for scheduling the configurations of the time layers and the executions of the task modules, i.e., the computation of , , and defined in Section II-B.
is constructed by adding to the graph () the vertex set and three edge sets representing the scheduling constraints. , where represents time layers and is defined in Section III-B. and , and are defined as follows.
The set of edges represents the configuration order. .
The set of edges indicates that a task must be executed only after the configuration of the time layer where is located. and .
The set of edges indicates that, in a DRR, a time layer must be configured after the execution of all the tasks in the previous time layer because they share the same hardware resources. , , and is the time layer before in }.
A schedule can be computed only if the RCG is acyclic because a cycle produces a conflict in the constraints.
IF. Here we show that if there are backward dependencies between the time layers that have no lifetime overlap, there will be a cycle in the RCG and hence the given partition and configuration order is infeasible.
A pair of time layers, and with , have a backward dependence if there are two task modules, and , respectively from and and there is a direct or indirect data dependence between them (). Fig.(a)a shows an illustration of this, where a dashed arrow represents an edge or a path and solid arrows represent edges. While there is no overlap between the lifetime of and , the hardware resources occupied by the time layer must be reconfigured to be the next time layer in the same DRR, before the configuration of , and there must be an edge from to (shown in a bold dashed arrow in Fig.(a)a) because a time layer can only be configured after the execution of all tasks in the earlier time layers in the same DRR. Accordingly, a cycle is formed, which indicates the conflict of constraints.
ONLY IF. Here, we show that if the given partition and configuration order is infeasible, there must be backward dependencies between the time layers that have no lifetime overlap.
If the given partition and configuration order is infeasible, there must be a cycle in . Notice that . The subgraph induced by , which includes the edge set representing the configuration order of the time layers, is acyclic. The subgraph induced by (exactly ), which includes the edge set representing the dependences between tasks, is also acyclic. Moreover, all the edges in are from to , which represents that a task must be configured before it is executed, and all the edges in are from to , which represents that a time layer can only be configured after the execution of all the tasks in the earlier time layers. Consequently, the cycle must include four parts: 1) a path (one or more edges) from , 2) a path (one or more edges) from , 3) an edge from , and 4) an edge from .
Without loss of generality, we assume that the cycle includes a path from to and a path from to , respectively, constructed by the edges from and . The cycle must also include two edges: and . Fig.(b)b shows an illustration of this. According to the definition of , is in because we have the edge . On the other hand, the edge indicates that is configured after is executed, which means that must be located in the previous time layer of in , . We can see that and have a backward data dependence and their lifetimes are non-overlapping, as the region occupied by has been reconfigured to be before is configured.
Note that if represents a topological ordering of , the partition and configuration order will always be feasible because there are no backward dependencies involved.
Given a partition, a configuration order, and the task dependency graph, the is acyclic if there is always lifetime overlap between time layers that have backward dependencies.
Fig. (a)a shows the of the feasible - in Formula (2), where . If is changed to the configuration order in Fig. 3, , the corresponding is shown in Fig. (b)b, where a cycle is formed and no feasible schedule can be found.
Iii-C Computation of the Schedule
The schedule can be computed by finding the longest paths on the RCG with edges weighted as follows.
Let be the vertex corresponding to the time layer that is configured first (having zero in-degree in ), and denote the vertex-weighted longest-path from to a vertex . The schedule ( and ) can be determined by computing and , respectively.
The schedule length of the partially dynamically reconfigurable system is the maximum of the paths, and can be calculated as follows:
Given a feasible -, we can construct RCG and compute the schedule in time if the RCG is acyclic, where is the number of task modules.
Iv Optimization Framework
An insertion point in the partitioned sequence triple - is defined as a four-tuple, , where , , and are the positions immediately after the - task module in , the - task module in , and the - task module in , respectively, and is the -th time layer in . = 0 (or = 0 or = 0) indicates the position before the first task module of the sequence.
In Section IV-B2, we will discuss the feasibility and types of insertion points in detail.
Iv-a Overall Design Flow
In this work, we modify the perturbation method, Insertion-after-Remove (IAR) in , to explore the design space of the schedule and floorplan in a simulated annealing-based search. With the IAR operation, we can perturb the partitioning, scheduling, and floorplanning of task modules simultaneously. The detailed steps are as follows:
Select and remove a task module randomly and then compute the floorplan and schedule of task modules without the removed task module ;
Select a fixed number of feasible candidate insertion points, , for by rough evaluations of all the feasible insertion points;
Choose the best insertion point from for the removed task module by accurate evaluations.
In step , the feasible insertion points are evaluated by the linear combination of resource costs, schedule length, and communication costs. In this step, the resource costs are calculated accurately. To reduce the time complexity, the communication cost is calculated roughly without updating the floorplan and schedule of task modules, and the schedule length is roughly evaluated by under-estimating the configuration spans of time layers using Formula 1. In step , all the insertion points in will be evaluated accurately based on the entire floorplan considering the communication costs, and the best one will be chosen as the candidate insertion point. The feasibility of insertion points will be discussed in Subsection IV-B.
In the experiments, we set the size of at 15. The objective function is defined as the linear combination of the area cost (), which depends on the dimensions of all occupied resources (), the schedule length (), and the communication costs ():
Iv-B Feasible insertion points in -
Generally, given a - of task modules, there are a total of insertion points for inserting a task module. However, when considering Theorem 1 and the definition of -, some insertion points are infeasible. Here we discuss the feasibility of insertion points in -s.
Iv-B1 Lifetime overlap constraint
First, inserting could introduces new backward dependencies between time layers. To ensure the lifetime overlap between the backward-dependent time layers, we have the following corollary from Theorem 1.
The lifetime of a time layer, , where is inserted, must satisfy the following condition.
Second, the lifetime of a time layer is changed when a new time layer is inserted into an existing DRR. To ensure the lifetime overlap between the time layers that have backward dependences, we have the following corollary from Theorem 1.
Given a partition and configuration order, the lifetime of a time layer must satisfy the following minimum lifetime constraint, denoted as .
The minimum lifetime constraints ensure a lifetime overlap between any two time layers that have a backward dependence. This constraint cannot be violated after is inserted back into the -. In Section IV-B3, an example is provided.
Iv-B2 Feasibility of insertion points
Let , , and , represent the - task in , , and , respectively, with removed. For each possible insertion point , there exist three possible types of optional partitions to re-insert depending on the time layer .
Type-1: Create a new time layer in a new DRR, , for . In this case, must be located within the boundary of task sequences corresponding to different DRRs in -, i.e.,
Without loss of generality, we assume that , and that and correspond to a virtual DRR and to virtual time layers, respectively. , , , and are dealt with similarly.
This type of insertion point will not change the lifetimes of any other time layers according to Definition 3. Consequently, if the constraint (6) in Corollary 2 is satisfied, then is feasible. Note that the new generated time layer is configured between and .
Type-2: Create a new time layer, , in an existing DRR, , for . In this case, the insertion point must be located within the boundary of task sequences corresponding to different time layers, i.e., there is a combination such that , and
This type of insertion point will change the lifetime of the time layer that is immediately before in . An insertion point is feasible if the constraints in both Corollary 2 and Corollary 3 are satisfied. Note that the new generated time layer is configured between and .
Type-3: Insert into an existing time layer, . In this case, an insertion point must satisfy the condition that there is a combination , such that .
Iv-B3 An example
Given the task dependencies shown in Fig.1, we have the following - with removed: