Sankey diagram depicts flows among entities in a system, where the thickness of a flow represents the flow quantity. Although Sankey diagram is originally introduced in  to display energy flows of a steam engine, its excellence in emphasizing dominant flows in a network makes it popular for analyzing sequence data, such as a vendor-to-customer network or page-viewing paths in Google Analytics (). To better display data with increasing complexity, methods that automatically adjust the layout to enhance readability is in much desire. According to , a proper layout of the Sankey diagram should meet the following three criteria: minimum edge intersection, short-as-possible edge lengths and straight edges. In this paper, we focus on the first criteria: reducing the number of edge crossing.
The most common form of a Sankey diagram, called the parallel form, is a horizontally-layered diagram with entities assigned to different vertical layers and flows with various thicknesses connecting entities between layers (normally from left to right). By viewing entities as vertices and flows edges, we can transform the hierarchical structure of a Sankey diagram into a layered graph. In this way, layout optimization of a Sankey diagram can be generalized as the layered graph drawing problem formulated by , which is further decomposed into several sub-problems in . Among the sub-problems our work focuses on the NP-hard (proven in ) crossing minimization problem where an ordering of vertices for each layer is determined to achieve minimum edge crossing. However, our problem differs from the classic crossing minimization problem in that each edge crossing is weighted as the intersected edges bear different weights from the thicknesses of flows.
For the classic crossing minimization problem (without considering edge weight), the most famous heuristic method is the barycentric (BC) method proposed by . In this method, each vertex is sorted among its layer in ascending order of the barycentre of vertices connecting to it. For crossing reduction with weighted edges,  gives an approach combining the BC method and linear programming. This combined method uses the ordering produced by the BC method as input of the linear programming method whose objective function is minimizing the weighted sum of distances between connected nodes. That is, the linear programming method improves the placement of nodes within each layer determined previously by the BC method. However, the resultant layout is non-optimal since the heuristic BC method is used without considering edge weights. For exact solution, there is an integer linear programming (ILP) model with the objective being minimizing the sum of weighted crossing . The attained optimal layout, in comparison to the layout obtained from the BC method, shows remarkable improvement on crossing reduction and readability improvement.
Another form of Sankey diagram with increasing popularity is the cycle form with flows travelling in opposite direction (right to left) representing reversed data flows such as the recycle of resources. In this case, we study the crossing reduction problem for a specific type of the cycle form where the reverse flows only exist between the last (rightmost) layer and the first (leftmost) layer. There is not yet an algorithm to the crossing reduction problem on this type of Sankey diagram.
In this work, we first propose a two-staged heuristic method to reduce weighted crossing in the parallel form of the Sankey diagram (Section 2). In the first stage, we design a Markov Chain Method where we formulate the process of obtaining an ordering as a Markov chain and solve it with the eigenvector corresponds to the second largest eigenvalue of the Markov transition matrix. The solution is sufficient while non-optimal. For the second stage, we design a recursive Partition Refinement Method to further improve the ordering from the first stage. In this method, a vertex is given a range within the level and it gives different value within the range when used to calculate the barycentre of a connected vertex. We iterate the ordering in a back-and-force manner among layers until the positions of vertices remain unchanged.
In the following Section 3, we show a modified version of the above method that is applicable for reducing weighted crossing on our specified cycle form of the Sankey diagram. In the beginning, we ignore the connection between the first and the last level such that the formulated graph becomes parallel again, allowing us to obtain a partially calculated barycentre ordering. Subsequently, the second stage of this amended method undergoes a circular iteration route to also include the connection between the first and last levels.
In the Experiment Section, we first show the effect of our method on the parallel form by comparing the resultant ordering with those of the exact ILP method, the heuristic BC method and the combined method. The comparison includes both the visual effects as well as the weighted and non-weighted number of crossed edges. We find that our method is able to produce much better ordering than the two heuristic methods. In the process we also compare the difference between orderings produced in the two stages to demonstrate the effectiveness of both stages. For the cycle form, as there is no other methods for comparison, we apply our modified method on an artificial dataset with known optimal ordering. The result shows that in this case we are able to achieve optimal ordering even in the first stage. We also select a non-optimal output from stage 1 by using a different parameter set to verify the effectiveness of the second stage, which still produces the optimal ordering. Finally, we conduct a robustness test where we vary the complexity of the dataset and use the result from the ILP method as comparison. The test result verifies the stability of our method against the change of datasets.
2 Method on the Parallel Form
We start by formulating a Sankey diagram in parallel form with layers as a -level layered graph . We regard each entity in the Sankey diagram as a vertex and let denote the set of all vertices in . We say and belong to the same level in the graph if the corresponding entities lie in the same layer in the diagram. Set of vertices in the -th level is denoted as and form a partition of . Without loss of generality, we assume that in a Sankey diagram all links are formed between successive layers. For links connecting nodes belonging to non-successive levels, we follow the practice in  and  where dummy entities are added to all crossed levels such that the ”long link” becomes the composition of several ”short links” of the same thickness as the ”long link” itself. Consequently, we say an undirected edge exists only if and belong to successive levels and are connected by a link in the diagram. We then have the edge set of as
With our assumption, can be partitioned into subsets where each subset is the set of edges connecting vertices between and . Weight of an edge , denoted as , follows from the thickness of the corresponding link in the Sankey diagram.
For the n-level layered graph , its ordering is the set where denotes the ordering of vertices in . With the formulated graph , our aim is to find an ordered graph with reduced weighted crossing. We measure the weighted crossing of an ordered graph by , the sum of production between weights of the crossed edges. Its calculation is a variation from the method of obtaining crossing number in  and is described in the following. Given ordering , we first define for each with a weighted interconnection matrix of size where
In particular, we use and to denote the
-th row vector and transposed-th column vector of .
To obtain the weighted crossing of for a particular pair of ordering nad , we need to reorder the rows and columns such that they comply with the given ordering. Therefore, we define for each ordering a transformation matrix with equals 1 if the -th vertex in has index and 0 otherwise. Then the transformed matrix, denoted as , can be derived from the equation .
The weighted crossing of can therefore be calculated by the formula
Subsequently, the total weighted crossing number of ordered graph with ordering is defined as
It has been shown in  that barycentre ordering can effectively reduce crossing number . Here barycentre of a vertex is the weighted average of the position value of its connected vertices. A barycentre ordering places each vertex at its barycentre while avoiding the trivial case where all vertices share on barycentre value. To find such ordering,  also gives the heuristic Barycentre(BC) Method in which weights among connected vertices are equal, regardless the different weights of the corresponding edges. However, in our case, crossing involving edge with larger weight will contribute more to the total weighted crossing number and therefore has higher priority to be avoided when deciding the ordering. That is, the weight of a connected vertex is proportional to the weight of the corresponding edge. Moreover, to define the positions of vertices, we view each level as a vertical line of height equal 1 and each vertex point on the vertical line takes a position value within . We further define the position vector of the -th level where is the position of .
For a vertex where , i.e. in one the the middle levels, all its connected vertices forms a neighboring vertex set . We further partition this set into the left neighboring vertex set and the right neighboring set containing vertices belonging to -th level and the -th level respectively. For in the first or last level, consists of one one-sided neighboring set and an empty set for the other side. For each of the parted neighboring set, we have the vertex barycentre by the following equations
where is the norm of . Subsequently, we have the two-sided barycentre of as the average of the two one-sided barycentres:
Given position vector , is the descending order of entries in with the first vertex in the ordering placed uppermost in the level. Subsequently, to find a barycentre ordering is to find the corresponding position vectors where each entry is the barycentre of the corresponding connected vertices. To this end, we design a two-stage algorithm. In Stage 1, we introduce a Markov Chain Method which produces an ordering where most vertices satisfy the requirement for a barycentre ordering. In Stage 2, we give a Partition Refinement Method to refine the ordering from Stage 1 towards a complete barycentre ordering.
2.1 Stage 1: the Markov Chain Method
In Stage 1, we start with one-sided barycentre and eventually we want each vertex to have equal left and right barycentres so as to achieve two-sided barycentre. Given position vector where , we can update the the position vector such that for each . Rewrite this process in matrix form, we have
where is the -th left transition matrix of size transformed from
However, we cannot determine an ordering if multiple entries in
have the same barycentre value. Consequently, we add a random matrixof the same size as with normalized rows (row entries summing to 1) to equation 7 by a factor . Then equation 7 becomes
where we have the modified left transition matrix as .
Similarly, given position vector with , we update such that for . Here we define the modified right transition matrix where the original right transition matrix is again transformed from via
Then we have the matrix form of the process as
In addition, both modified transition matrices have the following properties: (1) all entries are non-negative as the entries of are non-negative weights of edges, (2) all row sums equal 1.
With the above, given position vector , with equation 7, we have , which means that vertices in are placed based on the positions of nodes in such that each node in is at its left barycentre. Then, with calculated, we can calculate and so forth. Finally we can obtain
Therefore, by propagating the position vector of thorough the product of the position matrix from to , we are able to place all vertices in all levels in their left barycentres except for the first level based on .
On the other hand, given position vector , we can also get from equation 11 the following equation:
Note that in this way all vertices in with are placed in their right barycentres based on .
From above, we can formally describe the Markov Chain Method. We set an initial position vector and use it to calculate from equation 12, with which we can reversely update by equation 13. We repeat the above process until it converges, i.e. both and remain unchanged in iteration. This indicates that each vertex in is placed on both its left and right barycentres, meaning that for any . From equation 6, we have for all defined above, showing that we are able to achieve a barycentre ordering with this process.
Let of size and of size , we simplify the above process as
Here we define transition matrix . We show in the following that it is a right stochastic matrix
right stochastic matrix
, i.e. a non-negative real square matrix with each row summing to 1 that are used to represent the probabilities in the transition of a Markov chain. Firstly,is of size and therefore a square matrix. Moreover, by Property 2 of and , all entries in , and consequently , are obviously non-negative real values. It remains to show that is a row-normalized matrix. To this end, we first prove the following lemma:
Let , be two row-normalized matrices, then their product is also a row-normalized matrix.
We denote and as
Then we have
and the sum of the -th row of is
With lemma 1, it follows that the product of more than two row-normalized matrices is still row-normalized. Thus, we have that , and therefore are row-normalized matrices. This completes the proof that is a right stochastic matrix. As a result, we rewrite equation 14 in the form of a Markov chain
where convergence is guaranteed.
To solve a Markov chain , we need to find the stationary distribution which satisfies , i.e. is invariant by the transition matrix . Normally, the stationary distribution is first right eigenvector corresponding to the largest eigenvalue of . However, in our case, all entries in are identical, which means that all vertices in should be placed at the same position. Consequently, vertices in other levels will also be placed at the same position as the barycentre remains unchanged. The resultant order is not a barycentre ordering by our previous definition.
On the other hand, the second largest sign-less eigenvalue of , , gives a heuristic solution for the Markov chain. One example of the usage of comes from , where the author aims to use Markov chain to solve the NP-complete state clustering problem. In their case, the eigenvector associated with the largest eigenvalue yields a trivial solution having all vertices in one cluster. The second largest eigenvalue, however, gives a eigenvector that generates a satisfying approximation of the proper clustering. In our case, we find that is also a competent alternative to . The sign of an eigenvalue is insignificant here as it just inverts the resultant ordering upside down without changing the crossing of edges.
Given from solving the Markov chain, we obtain all other position vectors by and subsequently the ordering . This completes the Markov Chain Method.
The Markov Chain Method cannot yield a complete barycentre ordering for the following two reasons: (1) the random component in transition matrix from modified left/right transition matrix affects the final ordering; (2) the use of the second largest eigenvector is not optimal in nature. We propose a solution for the first problem. We solve the second problem in Stage 2 using the Partition Refinement Method.
Adding a random component in the left/right transition matrix avoids the problem of having multiple vertices with the same barycentre for which an ordering is impossible to determine. However, the calculated barycentres incorporate the randomness and then pass on to the subsequent levels. As a result, output of the Markov chain method is different each time. In this case, we repeat the Markov Chain Method for a predefined times and each time we calculate and record the output ordering and its weighted crossing . Then we choose the ordering with minimum weighted crossing as the best-in- ordering of Stage 1. This is facilitated by the fact that there are various efficient methods for calculating the second largest eigenvalue and the corresponding eigenvector. While by the nature of randomness, the best-in- ordering is still not a complete barycentre ordering, we have that the larger is, the smaller weighted crossing number the best-in- ordering has.
2.2 Stage 2: Partition Refinement Method
In Stage 1, we view each vertex as a point in its level. As a result, it gives the position value of the corresponding point when calculating the barycentres of all its connected vertices. Stage 2, on the other hand, allows a vertex to give each connected vertex an individual position value within a certain range for the calculation of barycentre. That is, instead of using a point to represent a vertex, each vertex is given a range, called block, within the level while the end of each edge connecting this vertex is symbolized as a point in this given range. For , we use to denote the point of edge on the block of and for the position value of .
Apparently, should be dependent on the positions and therefore orders of and in their own levels. First of all, the -th level is splitted into blocks of equal height and the -th block from top to bottom is assigned to the -th vertex in the ordering . Thus, the range of is attribute to the order of . Furthermore, the value of within the block follows from the position of , specifically the value of . For the -th block in the -th level, it has base and height . We obtain the initial position of from the order of in best-in- ordering from Stage 1. For , , ordering of points and complies with that of and . Ordered points of vertices in are then distributed evenly within the block from top to bottom, dividing the block into equal segments. Points of vertices in follow the same. Let has order in and a connected vertex has order in , then
If belongs to , Equation 19 uses instead of .
With from Stage 1, we can assign the vertices based on their order in the level. Subsequently, we obtain initial values for all points with the above equation. Then the iteration begins with calculating the barycentre for each of the middle levels from left to right. Starting with the second level, we obtain the barycentre of each vertex from both the first and the third levels. For the calculation of , rather than using the position vector as in Equation 4, we use a position vector unique for this where
And we have
Then the two-sided barycentre can be derived from Equation 6. With all barycentres of vertices in the second layer calculated, we have a new ordering complying with the descending ordering of the barycentres. On the basis of the new , we reassign a block for each vertex. Asides from the change in the range of value, new position of a point is also under the affect of point ’s position. Instead of evenly distributing the points within the block as before, we let takes the value of after being scaled to the height of the block. Thereupon, let the order of in the new be , Equation 19 becomes
The calculation of the remaining middle levels follows. For each vertex in the last level, its barycentre is equal to its left barycentre. After updating the ordering and points for the last level, we go backwards and repeat the operation on the middle levels from right to left. Procedure on the first level is similar to that of the last. This completes one iteration.
This iteration converges to a complete barycentre ordering when the ordering remains unchanged for a predetermined times. However, as mentioned before, a complete barycentre ordering is not necessarily an optimal one. It is therefore possible that we encounter an ordering with less weighted crossing than the final complete barycentre ordering. In light of this possibility and our ultimate aim of reducing weighted crossing, we keep a record of the weighted crossing from each iteration if the resultant ordering is different from its predecessor. This allows us to choose the ordering with minimum weighted crossing produced in the process as the output ordering of Stage 2.
3 Modified Method on the cycle form
In this section, we modify the method introduced in previous section to reduce weighted crossing of the specified cycle form of Sankey diagram. We start by formulating the cycle form of the Sankey diagram as a circular layer graph. We refer the links connecting the last and the first level as the binding links. By ignoring the binding links, the cycle form becomes a parallel one with . On the other hand, for the binding links, we use to denote corresponding the edge set. Subsequently, we have the graph for the original cycle form by adding back , i.e. .
For Stage 1, we supply to the Markov Chain method. In this way, Stage 1 yields the best-in- ordering without considering the binding links. Subsequently, we utilize Stage 2 to take the binding links back into consideration.
In Stage 2, the first modification we make to the Partition Refinement method is on the calculation of barycentre. In the previous section, the barycentre of vertex in is just the barycentre of its right neighboring set. Now each vertex in also has a left neighboring set consisting of connected vertices in the last level, allowing the Equation 6 to be applicable. Similarly, vertices in the last level also have their own right neighboring set for calculating the barycentres.
Given the circular nature of graph , the route of iteration also changes. Instead of adopting the ”back and forth” manner in the previous section, after obtaining a new , we proceed to apply the method on the first level. The modified Partition Refinement method is summarized as pseudo code in Algorithm 1.
We prove the efficiency of our method on the parallel form by comparing our result with that of the state-of-art heuristic method in . Moreover, we show that our method produces near-optimal result by demonstrating a low contrast between the our performance and that of the optimal ILP method in . Besides from visual comparison, we also measure the performances of the above methods by both the weighted and non-weighted crossing of the output ordering. In particular, resultant orderings from the three comparing methods are obtained from the corresponding Sankey diagrams provided in their articles. To increase the sensitivity of our method towards edges with considerably small weights, we use logarithm (with base 10) for the edge weights in the above tests to reduce their difference. For our adapted method on the cycle form, we apply it on a graph with zero crossing to see if it can achieve the optimal result in this case. We also conduct a robust test to validate the consistency of our method against varying complexity of graph.
4.1 Test against State-of-Art Heuristic Method
In , they applied their method on Canada’s energy usage data in 1978 and Figure 0(a) displays their resultant Sankey diagram. We apply our method on the same dataset. In Stage 1, we set and . In Stage 2, we set , and obtain the convergent result with only 10 iterations. Figure 0(b) and Figure 0(c) give the Sankey diagrams with orderings from Stage 1 and 2 of our algorithm respectively. Table 1 summarizes the weighted and non-weighted crossing of orderings from all three Sankey diagrams in Figure 1.
|Ordering from:||Figure 0(a)||Figure 0(b)||Figure 0(c)|
Base on the comparison of both crossing measurements, we see that even without the refinement in Stage 2, the output from Stage 1 already surpass that of the heuristic method. And the improvement in both measurements from Stage 1 to Stage 2 validates the effectiveness of Stage 2. To be specific, we find that Stage 2 is able to resolve some unnecessary crossings in Stage 1 resulting from the additional random component. On the other hand, Stage 1 gives a satisfying semi-barycentre ordering such that Stage 2 only takes a few iterations to converge.
4.2 Test against ILP Method and BC Method
This test examines the difference between output from our method and the optimal output from the ILP method in  to see if we have a near-optimal result. Moreover, we compare our result against that of the BC method which shares our idea of finding a barycentre ordering such that we can can demonstrate our ability to produce a better barycentre ordering.
In this test, we use the same dataset as in : the ”World Greenhouse Gas Emissions” data from the World Resource Institute . From  we have results of both ILP method and the BC method on this dataset. For our method, we supply , to Stage 1 and , to Stage 2 which converges with only one iteration. For the output orderings of the two comparing methods and the two stages of our method, we calculate their weighted and non-weighted crossings and summarize them in Table 2. We also plot the four orderings as Sankey diagrams in 2 for visual comparison.
Comparing Figure 1(b) and Figure 1(d), we see that we are still at a relatively small distance from the optimal layout. Table 2 shows that although the output from Stage 2 has less non-weighted crossing number but still larger weighted crossing number.
On the other hand, our output excels from that of the BC method from both visual aspect and two crossing measurements. Also, BC method is an iterative method and therefore requires time to achieve a near-optimal ordering. In contrast, the output from Stage 1 already suffices as a near-optimal ordering. Moreover, iteration times to convergence in Stage 2 is also small.
|Ordering from:||Figure 1(a)||Figure 1(b)||Figure 1(c)||Figure 1(d)|
4.3 Test of Modified Method on the Cycle Form
In this test we apply the modified algorithm on an artificial dataset of cycle form. This dataset has the same number of level and same number of vertices in each level as the dataset in the previous test against the ILP method and the BC method. In particular, this artificial data has a known optimal layout which has zero crossing number.
In this test case, we find that Stage 1 alone is able to produce an optimal layout with , . To test the effectiveness of Stage 2, we reset in Stage 1 and obtain an ordering of and . Stage 2 then refines the Best-in-N ordering to the optimal ordering within 2 iterations with .
4.4 Robust Test
With this test, we aim to show the stability of our method towards cases of different complexity level. The complexity level of a graph is measured by the number of level (denoted as ) and the number of vertex in each level (denoted as ). Consequently, we vary both and
and for each pair we generate ten different random cases. The total edge number of a random case is considered an estimation of its complexity. All test case run with, for Stage 1 and , for Stage 2. We record for each case the weighted crossing produced by both Stage 1 and Stage 2, as well as the optimal result from the ILP method for comparison. Let a result from our method be and the corresponding optimal result be , we measure their difference by a ratio
where is a very small number to avoid the denominator being 0 when . The reason we do not use the difference between and directly is because as the complexity of the graph increase, the resultant weighted crossing and therefore the difference are also increasing.
We summarize the result of the robust test in Figure 3. For starters, it shows that most of result from Stage 1 are less than twice of the ILP result. Stage 2 further improves the result to be no more than 1.5 times of the ILP result. We believe this demonstrates the consistency of our method’s performance on cases with various complexity.
In this paper, we investigate the NP-hard weighted crossing reduction problem for the Sankey diagram. Other than the common parallel form of Sankey diagram, we also study a particular circular form where the first and the last layers are connected.
Our heuristic method, aiming to find a barycentre ordering, are composed of two stages. The first stage employs the Markov Chain method and the second stage serves to improve Stage 1’s output with the Partition Refinement Method. We also adapt this method to be applicable for reducing weighted crossing in the specified circular form of the diagram.
After experiments, we can conclude that our method performs nearly as well as the ILP method and surpasses the existing heuristic methods. In terms of the measurement of weighted crossing, in the ILP experiment, our method achieved 300.89 weighted crossings, very close to the 278.68 weighted crossings from the ILP method while the BC method has a weighted crossing number of 1220.07. Also, we obtained only 87.855 weighted crossings while the stage-of-art heuristic method attained 146.77. Visually speaking, we were able to obtain high readability even from complicated seven-layer data. We also performed a robust test which verified the stability of our method against changing complexity of dataset.
-  (2016-03-01) EnergyViz: an interactive system for visualization of energy systems. The Visual Computer 32 (3), pp. 403–413. External Links: Cited by: §1, §1, §2, 0(a), §4.1, §4.
-  (2018-04) Optimal sankey diagrams via integer programming. pp. 135–139. External Links: Cited by: §1, §1, §2, §4.2, §4.2, §4.
-  (2009) World greenhouse gas emissions in 2005. World Resources Institute. Cited by: §4.2.
Markov chains and spectral clustering. In Performance Evaluation of Computer and Communication Systems. Milestones and Future Challenges: IFIP WG 6.3/7.3 International Workshop, PERFORM 2010, in Honor of Gnter Haring on the Occasion of His Emeritus Celebration, Vienna, Austria, October 14-16, 2010, Revised Selected Papers, K. A. Hummel, H. Hlavacs, and W. Gansterer (Eds.), pp. 87–98. External Links: Cited by: §2.1.
-  (2016) What are health website visitors doing: insights from visualisations towards exploratory search. In Proceedings of the 28th Australian Conference on Computer-Human Interaction, pp. 631–633. Cited by: §1.
-  (1898) Introductory note on the thermal efficiency of steam-engines. In Minutes of Proceedings of the Institution of Civil Engineers, Vol. 134. Cited by: §1.
-  (1981-02) Methods for visual understanding of hierarchical system structures. IEEE Transactions on Systems, Man, and Cybernetics 11 (2), pp. 109–125. External Links: Cited by: §1, §1, §2.
-  (1977) Crossing theory and hierarchy mapping. IEEE Transactions on Systems, Man, and Cybernetics 7 (7), pp. 505–523. Cited by: §1, §2.