1 Introduction
Sankey diagram depicts flows among entities in a system, where the thickness of a flow represents the flow quantity. Although Sankey diagram is originally introduced in [6] to display energy flows of a steam engine, its excellence in emphasizing dominant flows in a network makes it popular for analyzing sequence data, such as a vendortocustomer network or pageviewing paths in Google Analytics ([5]). To better display data with increasing complexity, methods that automatically adjust the layout to enhance readability is in much desire. According to [1], a proper layout of the Sankey diagram should meet the following three criteria: minimum edge intersection, shortaspossible edge lengths and straight edges. In this paper, we focus on the first criteria: reducing the number of edge crossing.
The most common form of a Sankey diagram, called the parallel form, is a horizontallylayered diagram with entities assigned to different vertical layers and flows with various thicknesses connecting entities between layers (normally from left to right). By viewing entities as vertices and flows edges, we can transform the hierarchical structure of a Sankey diagram into a layered graph. In this way, layout optimization of a Sankey diagram can be generalized as the layered graph drawing problem formulated by [8], which is further decomposed into several subproblems in [7]. Among the subproblems our work focuses on the NPhard (proven in [2]) crossing minimization problem where an ordering of vertices for each layer is determined to achieve minimum edge crossing. However, our problem differs from the classic crossing minimization problem in that each edge crossing is weighted as the intersected edges bear different weights from the thicknesses of flows.
For the classic crossing minimization problem (without considering edge weight), the most famous heuristic method is the barycentric (BC) method proposed by [7]. In this method, each vertex is sorted among its layer in ascending order of the barycentre of vertices connecting to it. For crossing reduction with weighted edges, [1] gives an approach combining the BC method and linear programming. This combined method uses the ordering produced by the BC method as input of the linear programming method whose objective function is minimizing the weighted sum of distances between connected nodes. That is, the linear programming method improves the placement of nodes within each layer determined previously by the BC method. However, the resultant layout is nonoptimal since the heuristic BC method is used without considering edge weights. For exact solution, there is an integer linear programming (ILP) model with the objective being minimizing the sum of weighted crossing [2]. The attained optimal layout, in comparison to the layout obtained from the BC method, shows remarkable improvement on crossing reduction and readability improvement.
Another form of Sankey diagram with increasing popularity is the cycle form with flows travelling in opposite direction (right to left) representing reversed data flows such as the recycle of resources. In this case, we study the crossing reduction problem for a specific type of the cycle form where the reverse flows only exist between the last (rightmost) layer and the first (leftmost) layer. There is not yet an algorithm to the crossing reduction problem on this type of Sankey diagram.
In this work, we first propose a twostaged heuristic method to reduce weighted crossing in the parallel form of the Sankey diagram (Section 2). In the first stage, we design a Markov Chain Method where we formulate the process of obtaining an ordering as a Markov chain and solve it with the eigenvector corresponds to the second largest eigenvalue of the Markov transition matrix. The solution is sufficient while nonoptimal. For the second stage, we design a recursive Partition Refinement Method to further improve the ordering from the first stage. In this method, a vertex is given a range within the level and it gives different value within the range when used to calculate the barycentre of a connected vertex. We iterate the ordering in a backandforce manner among layers until the positions of vertices remain unchanged.
In the following Section 3, we show a modified version of the above method that is applicable for reducing weighted crossing on our specified cycle form of the Sankey diagram. In the beginning, we ignore the connection between the first and the last level such that the formulated graph becomes parallel again, allowing us to obtain a partially calculated barycentre ordering. Subsequently, the second stage of this amended method undergoes a circular iteration route to also include the connection between the first and last levels.
In the Experiment Section, we first show the effect of our method on the parallel form by comparing the resultant ordering with those of the exact ILP method, the heuristic BC method and the combined method. The comparison includes both the visual effects as well as the weighted and nonweighted number of crossed edges. We find that our method is able to produce much better ordering than the two heuristic methods. In the process we also compare the difference between orderings produced in the two stages to demonstrate the effectiveness of both stages. For the cycle form, as there is no other methods for comparison, we apply our modified method on an artificial dataset with known optimal ordering. The result shows that in this case we are able to achieve optimal ordering even in the first stage. We also select a nonoptimal output from stage 1 by using a different parameter set to verify the effectiveness of the second stage, which still produces the optimal ordering. Finally, we conduct a robustness test where we vary the complexity of the dataset and use the result from the ILP method as comparison. The test result verifies the stability of our method against the change of datasets.
2 Method on the Parallel Form
We start by formulating a Sankey diagram in parallel form with layers as a level layered graph . We regard each entity in the Sankey diagram as a vertex and let denote the set of all vertices in . We say and belong to the same level in the graph if the corresponding entities lie in the same layer in the diagram. Set of vertices in the th level is denoted as and form a partition of . Without loss of generality, we assume that in a Sankey diagram all links are formed between successive layers. For links connecting nodes belonging to nonsuccessive levels, we follow the practice in [2] and [1] where dummy entities are added to all crossed levels such that the ”long link” becomes the composition of several ”short links” of the same thickness as the ”long link” itself. Consequently, we say an undirected edge exists only if and belong to successive levels and are connected by a link in the diagram. We then have the edge set of as
With our assumption, can be partitioned into subsets where each subset is the set of edges connecting vertices between and . Weight of an edge , denoted as , follows from the thickness of the corresponding link in the Sankey diagram.
For the nlevel layered graph , its ordering is the set where denotes the ordering of vertices in . With the formulated graph , our aim is to find an ordered graph with reduced weighted crossing. We measure the weighted crossing of an ordered graph by , the sum of production between weights of the crossed edges. Its calculation is a variation from the method of obtaining crossing number in [8] and is described in the following. Given ordering , we first define for each with a weighted interconnection matrix of size where
(1) 
In particular, we use and to denote the
th row vector and transposed
th column vector of .To obtain the weighted crossing of for a particular pair of ordering nad , we need to reorder the rows and columns such that they comply with the given ordering. Therefore, we define for each ordering a transformation matrix with equals 1 if the th vertex in has index and 0 otherwise. Then the transformed matrix, denoted as , can be derived from the equation .
The weighted crossing of can therefore be calculated by the formula
(2) 
Subsequently, the total weighted crossing number of ordered graph with ordering is defined as
(3) 
It has been shown in [7] that barycentre ordering can effectively reduce crossing number . Here barycentre of a vertex is the weighted average of the position value of its connected vertices. A barycentre ordering places each vertex at its barycentre while avoiding the trivial case where all vertices share on barycentre value. To find such ordering, [7] also gives the heuristic Barycentre(BC) Method in which weights among connected vertices are equal, regardless the different weights of the corresponding edges. However, in our case, crossing involving edge with larger weight will contribute more to the total weighted crossing number and therefore has higher priority to be avoided when deciding the ordering. That is, the weight of a connected vertex is proportional to the weight of the corresponding edge. Moreover, to define the positions of vertices, we view each level as a vertical line of height equal 1 and each vertex point on the vertical line takes a position value within . We further define the position vector of the th level where is the position of .
For a vertex where , i.e. in one the the middle levels, all its connected vertices forms a neighboring vertex set . We further partition this set into the left neighboring vertex set and the right neighboring set containing vertices belonging to th level and the th level respectively. For in the first or last level, consists of one onesided neighboring set and an empty set for the other side. For each of the parted neighboring set, we have the vertex barycentre by the following equations
(4) 
(5) 
where is the norm of . Subsequently, we have the twosided barycentre of as the average of the two onesided barycentres:
(6) 
Given position vector , is the descending order of entries in with the first vertex in the ordering placed uppermost in the level. Subsequently, to find a barycentre ordering is to find the corresponding position vectors where each entry is the barycentre of the corresponding connected vertices. To this end, we design a twostage algorithm. In Stage 1, we introduce a Markov Chain Method which produces an ordering where most vertices satisfy the requirement for a barycentre ordering. In Stage 2, we give a Partition Refinement Method to refine the ordering from Stage 1 towards a complete barycentre ordering.
2.1 Stage 1: the Markov Chain Method
In Stage 1, we start with onesided barycentre and eventually we want each vertex to have equal left and right barycentres so as to achieve twosided barycentre. Given position vector where , we can update the the position vector such that for each . Rewrite this process in matrix form, we have
(7) 
where is the th left transition matrix of size transformed from
(8) 
However, we cannot determine an ordering if multiple entries in
have the same barycentre value. Consequently, we add a random matrix
of the same size as with normalized rows (row entries summing to 1) to equation 7 by a factor . Then equation 7 becomes(9) 
where we have the modified left transition matrix as .
Similarly, given position vector with , we update such that for . Here we define the modified right transition matrix where the original right transition matrix is again transformed from via
(10) 
Then we have the matrix form of the process as
(11) 
In addition, both modified transition matrices have the following properties: (1) all entries are nonnegative as the entries of are nonnegative weights of edges, (2) all row sums equal 1.
With the above, given position vector , with equation 7, we have , which means that vertices in are placed based on the positions of nodes in such that each node in is at its left barycentre. Then, with calculated, we can calculate and so forth. Finally we can obtain
(12) 
Therefore, by propagating the position vector of thorough the product of the position matrix from to , we are able to place all vertices in all levels in their left barycentres except for the first level based on .
On the other hand, given position vector , we can also get from equation 11 the following equation:
(13) 
Note that in this way all vertices in with are placed in their right barycentres based on .
From above, we can formally describe the Markov Chain Method. We set an initial position vector and use it to calculate from equation 12, with which we can reversely update by equation 13. We repeat the above process until it converges, i.e. both and remain unchanged in iteration. This indicates that each vertex in is placed on both its left and right barycentres, meaning that for any . From equation 6, we have for all defined above, showing that we are able to achieve a barycentre ordering with this process.
Let of size and of size , we simplify the above process as
(14) 
Here we define transition matrix . We show in the following that it is a
right stochastic matrix
, i.e. a nonnegative real square matrix with each row summing to 1 that are used to represent the probabilities in the transition of a Markov chain. Firstly,
is of size and therefore a square matrix. Moreover, by Property 2 of and , all entries in , and consequently , are obviously nonnegative real values. It remains to show that is a rownormalized matrix. To this end, we first prove the following lemma:Lemma 1.
Let , be two rownormalized matrices, then their product is also a rownormalized matrix.
Proof.
We denote and as
(15) 
Then we have
(16) 
and the sum of the th row of is
(17) 
∎
With lemma 1, it follows that the product of more than two rownormalized matrices is still rownormalized. Thus, we have that , and therefore are rownormalized matrices. This completes the proof that is a right stochastic matrix. As a result, we rewrite equation 14 in the form of a Markov chain
(18) 
where convergence is guaranteed.
To solve a Markov chain , we need to find the stationary distribution which satisfies , i.e. is invariant by the transition matrix . Normally, the stationary distribution is first right eigenvector corresponding to the largest eigenvalue of . However, in our case, all entries in are identical, which means that all vertices in should be placed at the same position. Consequently, vertices in other levels will also be placed at the same position as the barycentre remains unchanged. The resultant order is not a barycentre ordering by our previous definition.
On the other hand, the second largest signless eigenvalue of , , gives a heuristic solution for the Markov chain. One example of the usage of comes from [4], where the author aims to use Markov chain to solve the NPcomplete state clustering problem. In their case, the eigenvector associated with the largest eigenvalue yields a trivial solution having all vertices in one cluster. The second largest eigenvalue, however, gives a eigenvector that generates a satisfying approximation of the proper clustering. In our case, we find that is also a competent alternative to . The sign of an eigenvalue is insignificant here as it just inverts the resultant ordering upside down without changing the crossing of edges.
Given from solving the Markov chain, we obtain all other position vectors by and subsequently the ordering . This completes the Markov Chain Method.
The Markov Chain Method cannot yield a complete barycentre ordering for the following two reasons: (1) the random component in transition matrix from modified left/right transition matrix affects the final ordering; (2) the use of the second largest eigenvector is not optimal in nature. We propose a solution for the first problem. We solve the second problem in Stage 2 using the Partition Refinement Method.
Adding a random component in the left/right transition matrix avoids the problem of having multiple vertices with the same barycentre for which an ordering is impossible to determine. However, the calculated barycentres incorporate the randomness and then pass on to the subsequent levels. As a result, output of the Markov chain method is different each time. In this case, we repeat the Markov Chain Method for a predefined times and each time we calculate and record the output ordering and its weighted crossing . Then we choose the ordering with minimum weighted crossing as the bestin ordering of Stage 1. This is facilitated by the fact that there are various efficient methods for calculating the second largest eigenvalue and the corresponding eigenvector. While by the nature of randomness, the bestin ordering is still not a complete barycentre ordering, we have that the larger is, the smaller weighted crossing number the bestin ordering has.
2.2 Stage 2: Partition Refinement Method
In Stage 1, we view each vertex as a point in its level. As a result, it gives the position value of the corresponding point when calculating the barycentres of all its connected vertices. Stage 2, on the other hand, allows a vertex to give each connected vertex an individual position value within a certain range for the calculation of barycentre. That is, instead of using a point to represent a vertex, each vertex is given a range, called block, within the level while the end of each edge connecting this vertex is symbolized as a point in this given range. For , we use to denote the point of edge on the block of and for the position value of .
Apparently, should be dependent on the positions and therefore orders of and in their own levels. First of all, the th level is splitted into blocks of equal height and the th block from top to bottom is assigned to the th vertex in the ordering . Thus, the range of is attribute to the order of . Furthermore, the value of within the block follows from the position of , specifically the value of . For the th block in the th level, it has base and height . We obtain the initial position of from the order of in bestin ordering from Stage 1. For , , ordering of points and complies with that of and . Ordered points of vertices in are then distributed evenly within the block from top to bottom, dividing the block into equal segments. Points of vertices in follow the same. Let has order in and a connected vertex has order in , then
(19) 
If belongs to , Equation 19 uses instead of .
With from Stage 1, we can assign the vertices based on their order in the level. Subsequently, we obtain initial values for all points with the above equation. Then the iteration begins with calculating the barycentre for each of the middle levels from left to right. Starting with the second level, we obtain the barycentre of each vertex from both the first and the third levels. For the calculation of , rather than using the position vector as in Equation 4, we use a position vector unique for this where
And we have
Then the twosided barycentre can be derived from Equation 6. With all barycentres of vertices in the second layer calculated, we have a new ordering complying with the descending ordering of the barycentres. On the basis of the new , we reassign a block for each vertex. Asides from the change in the range of value, new position of a point is also under the affect of point ’s position. Instead of evenly distributing the points within the block as before, we let takes the value of after being scaled to the height of the block. Thereupon, let the order of in the new be , Equation 19 becomes
The calculation of the remaining middle levels follows. For each vertex in the last level, its barycentre is equal to its left barycentre. After updating the ordering and points for the last level, we go backwards and repeat the operation on the middle levels from right to left. Procedure on the first level is similar to that of the last. This completes one iteration.
This iteration converges to a complete barycentre ordering when the ordering remains unchanged for a predetermined times. However, as mentioned before, a complete barycentre ordering is not necessarily an optimal one. It is therefore possible that we encounter an ordering with less weighted crossing than the final complete barycentre ordering. In light of this possibility and our ultimate aim of reducing weighted crossing, we keep a record of the weighted crossing from each iteration if the resultant ordering is different from its predecessor. This allows us to choose the ordering with minimum weighted crossing produced in the process as the output ordering of Stage 2.
3 Modified Method on the cycle form
In this section, we modify the method introduced in previous section to reduce weighted crossing of the specified cycle form of Sankey diagram. We start by formulating the cycle form of the Sankey diagram as a circular layer graph. We refer the links connecting the last and the first level as the binding links. By ignoring the binding links, the cycle form becomes a parallel one with . On the other hand, for the binding links, we use to denote corresponding the edge set. Subsequently, we have the graph for the original cycle form by adding back , i.e. .
For Stage 1, we supply to the Markov Chain method. In this way, Stage 1 yields the bestin ordering without considering the binding links. Subsequently, we utilize Stage 2 to take the binding links back into consideration.
In Stage 2, the first modification we make to the Partition Refinement method is on the calculation of barycentre. In the previous section, the barycentre of vertex in is just the barycentre of its right neighboring set. Now each vertex in also has a left neighboring set consisting of connected vertices in the last level, allowing the Equation 6 to be applicable. Similarly, vertices in the last level also have their own right neighboring set for calculating the barycentres.
Given the circular nature of graph , the route of iteration also changes. Instead of adopting the ”back and forth” manner in the previous section, after obtaining a new , we proceed to apply the method on the first level. The modified Partition Refinement method is summarized as pseudo code in Algorithm 1.
4 Result
We prove the efficiency of our method on the parallel form by comparing our result with that of the stateofart heuristic method in [1]. Moreover, we show that our method produces nearoptimal result by demonstrating a low contrast between the our performance and that of the optimal ILP method in [2]. Besides from visual comparison, we also measure the performances of the above methods by both the weighted and nonweighted crossing of the output ordering. In particular, resultant orderings from the three comparing methods are obtained from the corresponding Sankey diagrams provided in their articles. To increase the sensitivity of our method towards edges with considerably small weights, we use logarithm (with base 10) for the edge weights in the above tests to reduce their difference. For our adapted method on the cycle form, we apply it on a graph with zero crossing to see if it can achieve the optimal result in this case. We also conduct a robust test to validate the consistency of our method against varying complexity of graph.
4.1 Test against StateofArt Heuristic Method
In [1], they applied their method on Canada’s energy usage data in 1978 and Figure 0(a) displays their resultant Sankey diagram. We apply our method on the same dataset. In Stage 1, we set and . In Stage 2, we set , and obtain the convergent result with only 10 iterations. Figure 0(b) and Figure 0(c) give the Sankey diagrams with orderings from Stage 1 and 2 of our algorithm respectively. Table 1 summarizes the weighted and nonweighted crossing of orderings from all three Sankey diagrams in Figure 1.
Ordering from:  Figure 0(a)  Figure 0(b)  Figure 0(c) 

Weighted Crossing  146.77  112.59  87.855 
NonWeighted Crossing  279  226  158.0 
Base on the comparison of both crossing measurements, we see that even without the refinement in Stage 2, the output from Stage 1 already surpass that of the heuristic method. And the improvement in both measurements from Stage 1 to Stage 2 validates the effectiveness of Stage 2. To be specific, we find that Stage 2 is able to resolve some unnecessary crossings in Stage 1 resulting from the additional random component. On the other hand, Stage 1 gives a satisfying semibarycentre ordering such that Stage 2 only takes a few iterations to converge.
4.2 Test against ILP Method and BC Method
This test examines the difference between output from our method and the optimal output from the ILP method in [2] to see if we have a nearoptimal result. Moreover, we compare our result against that of the BC method which shares our idea of finding a barycentre ordering such that we can can demonstrate our ability to produce a better barycentre ordering.
In this test, we use the same dataset as in [2]: the ”World Greenhouse Gas Emissions” data from the World Resource Institute [3]. From [2] we have results of both ILP method and the BC method on this dataset. For our method, we supply , to Stage 1 and , to Stage 2 which converges with only one iteration. For the output orderings of the two comparing methods and the two stages of our method, we calculate their weighted and nonweighted crossings and summarize them in Table 2. We also plot the four orderings as Sankey diagrams in 2 for visual comparison.
Comparing Figure 1(b) and Figure 1(d), we see that we are still at a relatively small distance from the optimal layout. Table 2 shows that although the output from Stage 2 has less nonweighted crossing number but still larger weighted crossing number.
On the other hand, our output excels from that of the BC method from both visual aspect and two crossing measurements. Also, BC method is an iterative method and therefore requires time to achieve a nearoptimal ordering. In contrast, the output from Stage 1 already suffices as a nearoptimal ordering. Moreover, iteration times to convergence in Stage 2 is also small.
Ordering from:  Figure 1(a)  Figure 1(b)  Figure 1(c)  Figure 1(d) 

Weighted Crossing  1220.07  156.64  300.89  278.68 
NonWeighted Crossing  322  125  126  121 
4.3 Test of Modified Method on the Cycle Form
In this test we apply the modified algorithm on an artificial dataset of cycle form. This dataset has the same number of level and same number of vertices in each level as the dataset in the previous test against the ILP method and the BC method. In particular, this artificial data has a known optimal layout which has zero crossing number.
In this test case, we find that Stage 1 alone is able to produce an optimal layout with , . To test the effectiveness of Stage 2, we reset in Stage 1 and obtain an ordering of and . Stage 2 then refines the BestinN ordering to the optimal ordering within 2 iterations with .
4.4 Robust Test
With this test, we aim to show the stability of our method towards cases of different complexity level. The complexity level of a graph is measured by the number of level (denoted as ) and the number of vertex in each level (denoted as ). Consequently, we vary both and
and for each pair we generate ten different random cases. The total edge number of a random case is considered an estimation of its complexity. All test case run with
, for Stage 1 and , for Stage 2. We record for each case the weighted crossing produced by both Stage 1 and Stage 2, as well as the optimal result from the ILP method for comparison. Let a result from our method be and the corresponding optimal result be , we measure their difference by a ratio(20) 
where is a very small number to avoid the denominator being 0 when . The reason we do not use the difference between and directly is because as the complexity of the graph increase, the resultant weighted crossing and therefore the difference are also increasing.
We summarize the result of the robust test in Figure 3. For starters, it shows that most of result from Stage 1 are less than twice of the ILP result. Stage 2 further improves the result to be no more than 1.5 times of the ILP result. We believe this demonstrates the consistency of our method’s performance on cases with various complexity.
5 Conclusion
In this paper, we investigate the NPhard weighted crossing reduction problem for the Sankey diagram. Other than the common parallel form of Sankey diagram, we also study a particular circular form where the first and the last layers are connected.
Our heuristic method, aiming to find a barycentre ordering, are composed of two stages. The first stage employs the Markov Chain method and the second stage serves to improve Stage 1’s output with the Partition Refinement Method. We also adapt this method to be applicable for reducing weighted crossing in the specified circular form of the diagram.
After experiments, we can conclude that our method performs nearly as well as the ILP method and surpasses the existing heuristic methods. In terms of the measurement of weighted crossing, in the ILP experiment, our method achieved 300.89 weighted crossings, very close to the 278.68 weighted crossings from the ILP method while the BC method has a weighted crossing number of 1220.07. Also, we obtained only 87.855 weighted crossings while the stageofart heuristic method attained 146.77. Visually speaking, we were able to obtain high readability even from complicated sevenlayer data. We also performed a robust test which verified the stability of our method against changing complexity of dataset.
References
 [1] (20160301) EnergyViz: an interactive system for visualization of energy systems. The Visual Computer 32 (3), pp. 403–413. External Links: ISSN 14322315, Document, Link Cited by: §1, §1, §2, 0(a), §4.1, §4.
 [2] (201804) Optimal sankey diagrams via integer programming. pp. 135–139. External Links: Document Cited by: §1, §1, §2, §4.2, §4.2, §4.
 [3] (2009) World greenhouse gas emissions in 2005. World Resources Institute. Cited by: §4.2.

[4]
(2011)
Markov chains and spectral clustering
. In Performance Evaluation of Computer and Communication Systems. Milestones and Future Challenges: IFIP WG 6.3/7.3 International Workshop, PERFORM 2010, in Honor of Gnter Haring on the Occasion of His Emeritus Celebration, Vienna, Austria, October 1416, 2010, Revised Selected Papers, K. A. Hummel, H. Hlavacs, and W. Gansterer (Eds.), pp. 87–98. External Links: ISBN 9783642255755, Document, Link Cited by: §2.1.  [5] (2016) What are health website visitors doing: insights from visualisations towards exploratory search. In Proceedings of the 28th Australian Conference on ComputerHuman Interaction, pp. 631–633. Cited by: §1.
 [6] (1898) Introductory note on the thermal efficiency of steamengines. In Minutes of Proceedings of the Institution of Civil Engineers, Vol. 134. Cited by: §1.
 [7] (198102) Methods for visual understanding of hierarchical system structures. IEEE Transactions on Systems, Man, and Cybernetics 11 (2), pp. 109–125. External Links: Document, ISSN 00189472 Cited by: §1, §1, §2.
 [8] (1977) Crossing theory and hierarchy mapping. IEEE Transactions on Systems, Man, and Cybernetics 7 (7), pp. 505–523. Cited by: §1, §2.
Comments
There are no comments yet.