Two Dimensional Router: Design and Implementation

08/12/2019 ∙ by Shu Yang, et al. ∙ 0

Higher dimensional classification has attracted more attentions with increasing demands for more flexible services in the Internet. In this paper, we present the design and implementation of a two dimensional router (TwoD router), that makes forwarding decisions based on both destination and source addresses. This TwoD router is also a key element in our current effort towards two dimensional IP routing. With one more dimension, the forwarding table will grow explosively given a straightforward implementation. As a result, it is impossible to fit the forwarding table to the current TCAM, which is the de facto standard despite its limited capacity. To solve the explosion problem, we propose a forwarding table structure with a novel separation of TCAM and SRAM. As such, we move the redundancies in expensive TCAM to cheaper SRAM, while the lookup speed is comparable with conventional routers. We also design the incremental update algorithms that minimize the number of accesses to memory. We evaluate our design with a real implementation on a commercial router, Bit-Engine 12004, with real data sets. Our design does not need new devices, which is favorable for adoption. The results also show that the performance of our TwoD router is promising.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 13

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

To provide reachability services to the Internet users, conventional Internet routers classify packets based only on destination address, Although one dimensional routers are adequate for destination-based routing, there are increasing demands for higher dimensional routers

[15], for security, traffic engineering, quality of service, etc. Among the higher dimensional routers, two dimensional routers (TwoD routers), that classify packets based on both destination and source addresses, have gained considerable attentions [2][34][20], due to the important semantics of destination and source addresses [36]. For example, TwoD routers can easily express the policies between host and host, or network and network.

China Education and Research Network 2 (CERNET2), the largest naive IPv6 network around the world, is now deploying Two Dimensional-IP (TwoD-IP) routing [39]. More specifically, the routing decisions will be based not only on destination address, but also on the source address. Such extension provides rooms to solve problems of the past and foster innovations in the future. TwoD router is a key element in TwoD-IP routing.

There has been many research works on TwoD routers. Most of them focus on software-based solutions [37][35][1], however, software-based solutions need many accesses to memory, and cause non-deterministic lookup time. The problem gets worse after deploying IPv6, where more bits should be matched. Hardware-based, especially TCAM-based solutions are the de facto standard for core routers, due to their constant lookup time and high speeds. Despite its high speeds, TCAM is limited by its low capacity, large power consumption and high cost [23]. The largest TCAM chip available currently can only accommodate 1 million IPv4 prefixes [24].

TCAM resources are further limited in TwoD routers. Two dimensional classifiers widely adopt the traditional Cisco Access Control List (ACL) structure (we call it ACL-like structure thereafter), e.g., CERNET2 is using this structure. In Figure 1, we show a typical table within ACL-like structure, where destination and source prefixes (having 4 bits for brevity) are concatenated as an entry in TCAM. For example, receiving a packet with destination address of 1011 and source address of 1111, router will forward the packet to 1.0.0.2, after matching destination prefix 101* and source prefix 11** according to the longest match first (LMF) rule. This ‘fat’ TCAM structure provides fast lookup speeds, however, this structure greatly increases the TCAM resources in TwoD routers, due to 1) it doubles the width of a TCAM entry, e.g., 288 bits (typical TCAM width) are needed within IPv6; 2) in the worst case, the number of TCAM entries can be , where and are the space of destination and source addresses. The ACL-like structure works well within a few entries, however, it becomes inefficient when the number of entries increases. If TwoD-IP routing is deployed, the number of entries will predictably increase more rapidly, e.g., CERNET2 wants to carry out policy routing between about 6,000 destination prefixes and 100 source prefixes, resulting in 600,000 entries in TCAM.

Destination prefix Source prefix Action
111* 111* Forward to 1.0.0.0
111* 100* Forward to 1.0.0.1
100* 111* Forward to 1.0.0.2
101* 11** Forward to 1.0.0.2
10** 11** Forward to 1.0.0.3
Table 1: Table with ACL-like structure in TCAM

In this paper, to relieve the contradictions, we put forward a new forwarding table structure called FIST (FIB Structure for TwoD-IP). The key idea of FIST is to store destination and source prefixes in two separate TCAM tables, and store other information in SRAM, which is much cheaper and less power consumptive than TCAM. In Figure 1, we need to store destination prefixes 111*, 100*, 101* and 10** in one TCAM table, and source prefixes 111*, 100*, 11** in another TCAM table. Through moving the redundancies from TCAM to SRAM, we can reduce the TCAM storage space, because 1) TCAM width can be reduced to be one half, e.g., 144 bits are enough within IPv6; 2) reducing the number of entries in TCAM, i.e., each prefix appears only once. In the worst case, there are TCAM entries. Trivial FIST may increase the SRAM storage space, thus we develop a set of techniques for better scalability. We show that the redundancies in SRAM can be largely removed, due to the flexibility in SRAM.

Within FIST, each destination prefix points to a row, each source prefix points to a column, and they both together point to a two dimensional array element (cell) in SRAM, through which we can compute the action (or next hop) information. When a packet arrives, we can match its destination and source in parallel in TCAM, and find the next hop information in SRAM. The lookup process can be pipelined, and the lookup time is comparable with current Internet routers.

However, within FIST, there may exist confliction, i.e., matching a wrong prefix, after removing the binding relation between destination and source prefixes in TCAM. For example, if a packet with destination address of 1011 and source address of 1111 arrives, destination prefix 101* and source prefix 111* will be matched by applying LMF rule in each separate TCAM table, however, there does not exist any entry with destination prefix 101* and source prefix 111*. To resolve such confliction, we pre-compute the right actions for all conflicted cases. Such pre-computation guarantees the correctness, but it becomes impractical when updates happen frequently, because it needs re-computation for all conflicted cases, and causes large number of accesses to SRAM once update happens. To support incremental updates, we propose a new data structure called colored tree, through which we can minimize the computation cost and number of accesses to memory.

We implement the FIST on a commercial router, Bit-Engine 12004. Through redesigning the hardware logic, we do not need new devices. We carry out comprehensive evaluations with the real implementation, using the real topology, FIB, prefix and traffic data from CERNET2. The results show that FIST-based TwoD router can achieve linecard speeds, save TCAM and SRAM storage space, and bring acceptable update burden.

2 Overview of TwoD Router Design

We want the performance (i.e., packet processing) of the TwoD Router to be comparable with the current Internet routers. We choose TCAM as our base line design as TCAM is the key factor for the fast speed of the current routers.

The immediate change that TwoD-IP routing brings to the picture is the forwarding table size. More specifically, the Forwarding Information Base (FIB) will tremendously increase. Note that a first thought might think that the routing table only doubles. This is not true, as for each destination address, it corresponds to different source address. A straightforward implementation, i.e., ACL-like structure, means the FIB table changes from {destination} {action} to {(destination, source)} {action}. This increases the FIB size by an order and a practical consequence is that TCAM cannot hold entries of such scale. Current TCAM storage is 1 million and current destination prefix number is 400,000 [4]. If TwoD-IP is implemented by a straightforward approach, even with 100 source prefixes, it is already far beyond the TCAM storage.

We solve this problem by proposing a novel forwarding table structure FIST (see Fig. 1). The key of FIST is a novel separation of TCAM and SRAM. TCAM contributes to fast lookup and SRAM contributes to a larger memory space. Overall, FIST consumes TCAM storage space.

Another difficulty is the update action. In principle, an update of a destination prefix in the TwoD router may incur an update for each source prefix associated with this destination prefix and vice versa. This indicates that the update of a single entry in TwoD router is, given a straightforward design, the same as updating a full table of the current router.

Suppose that there are 10,000 source prefixes, and 500 updates on destination prefixes per second. In the worst case, there are 5,000,000 updates on SRAM per second, which almost exceeds the speed of hardware (in BitWay 12004, linecards work at 100MHz, and linecards need 20 clock cycles for a read/write operation).

We try every aspect to reduce the update complexity. We formulate an optimal transformation problem where we want to minimize the total number of read/write for each update. To solve this problem we propose a colored tree to organize the entries and we prove that we can minimize the computation complexity and number of accesses to memory during update actions.

In the following paper, Section 3 introduces the FIST structure and we prove its correctness during packet forwarding. We further present the lookup process on FIST. We discuss the incremental update action in Section 5. In Section 6, we take some practical issues into consideration and improves the trivial FIST structure. Section 7 presents the implementation of FIST on a commercial router. Section 8 provides evaluation details and results. In Section 9 and 10, we discuss the scalability of FIST and introduce the related works. Finally, we present our conclusions in Section 11.

3 FIST Structure and Lookup

3.1 The TwoD matching rule

We first present the definition of the forwarding rules that is used in two dimensional routing. Let and denote the destination and source addresses, and denote the destination and source prefixes. Let denote an action, more specifically, the next hop. The storage structure should have entries of 3-tuple .

Definition 1.

TwoD matching rule: Assume a packet with and arrives at a router. The destination address should first match according to the LMF rule. The source address should then match according to the LMF rule among all the 3-tuple given that is matched. The packet is then forwarded to next hop .

Our rule is defined based on the following principles: 1) Avoid confliction: it has been shown [20] that if matching the source and the destination address with the same priority, the LMF rule cannot decide the priority. Even using the first-matching-rule-in-table tie breaker may result in loops and resolving the confliction is NP-hard. 2) Compatibility: Matching destination prefixes first emphasizes on connectivity and is compatible with previous destination-based architecture. More specifically, if no source prefix is involved, our rule naturally regresses to traditional forwarding rules. Note that our router design is symmetric if source prefix is matched first.

Figure 1: FIST: A forwarding table structure for TwoD-IP

3.2 FIST Design Details

3.2.1 FIST basics

The new structure FIST is made up of two tables stored in TCAMs and two tables stored in SRAM (see Fig. 1). One table in TCAM stores the destination prefixes (we call it destination table thereafter), and the other table in TCAM stores the source prefixes (we call it source table thereafter). One table in SRAM is a two dimensional table that stores the indexed next hop of each rule in TwoD-IP (we call it TD-table thereafter) and we call each cell in the array TD-cell (or in short cell if no ambiguity). Another table in SRAM stores the mapping relation of index values and next hops (we call it mapping-table thereafter).

For each rule , is stored in the destination table, is stored in the source table. For the cell in the TD-table, there stores an index value. From this index value, is stored in the corresponding position of mapping table. We store the index value rather than the next hop in the TD-table, because the next hop information is much longer.

As an example, in Fig. 1, for , is stored in the destination table and is associated with the row; and is stored in the source table and is associated with the column. In the TD-table, the cell that corresponding to column and row has index value 2. In the mapping table, the next hop that is related with index value 2 is .

Theorem 1.

The TCAM storage space of FIST is bits. The SRAM storage space of FIST is bits, where is the size of the mapping table.

Proof.

Because the destination table has entries, and the source table has entries, TCAM space is bits. Mapping-table stores the mapping relations between the index and the corresponding next hop interface. They indeed have an upper on the size because each router has a bound on the number of next hop interfaces. Let denote the number of interfaces of a router and represent the size of the entries associated with the interfaces. Then the size of mapping-table is less than . The size of mapping-table can be treated as a constant compared with the SRAM storage space. Therefore, we mainly consider the TD-table size in calculating the SRAM storage space. TD-table dominates the space of SRAM, and has cells of bits. ∎

From Theorem 1, we can see that FIST move the ‘multiplication’ to SRAM, rather than eliminate it. Such movement is worthwhile considering the following facts: 1) Capacity of TCAM is much smaller than that of SRAM; 2) TCAM is 10-100 times more expensive than SRAM; 3) TCAM consumes times more power than SRAM [19][8][31]. Besides, SRAM is more flexible than TCAM, thus reducing redundancies is more easily.

3.2.2 TD-cell Saturation

For the example in Fig. 1, if a packet with destination address 1011 and source address 1111 arrives at the router, rule should be matched. This is because according to LMF rule, the destination prefix 101* should be first matched. There are two rules (including the default rule) associated with the destination prefix 101*. Consequently, source prefix 11** will be matched. With the new structure, destination prefix 101* will be matched and source prefix 111* will be matched. However, the cell ( row and column) in TD-Table does not have any index value. Intrinsically, consider a packet that should match destination and source prefix pairs . If there exists a source prefix that is longer than , cell rather than will be matched.

To address the problem, we pre-compute and fill the conflicted cells, e.g., , with appropriate index value. The algorithm is as follows.

1 begin
2        foreach  do
3               if  then
4                      ;
5                      is a prefix of ;
6                      Find is a prefix of ;
7                      Fill the cell with .
8              
9       
10
Algorithm 1 TD-Saturation()

We show the TD-table after filling up all the conflicted cells in Figure 9.

Theorem 2.

FIST (with TD-Saturation()) correctly handle the rule defined in Definition 1.

Proof.

When a packet arrives, and matches according to FIST. If . Then this cell stores the index value of , which is the right one.

Else according to Algorithm TD-Saturation(), contains all rules given is matched. In Line 5, is a prefix of , thus the packet also match the rule . In line 6, because there does not exist where is longer than , is the longest match among all the rules given is matched. So should be set to be according to Definition 1. ∎

3.2.3 A Non-Homogeneous FIST Structure

We expect that in practice, many destination prefixes only have default next hops. It is thus wasteful to leave a row for the TD-table. To become more compatible to the current router structure and further reduce the SRAM space, we divide the forwarding table into two parts. In the first part each prefix points to a row in TD-table, and in the second part each prefix points directly to an index value. For example, in Fig. 1, destination prefix 11** does not need any specific source prefix, thus it is stored in the second part.

In our implementation, we logically divide the table into two parts by using a indicator bit to separate them. We illustrate more details in Section 7.

3.3 FIST Lookup

Figure 2: Lookup action in FIST

The lookup action is shown in Fig. 2. When a packet arrives, the router first extracts the source address and destination address . Using the LMF rule, the router finds the matched source and destination prefixes in both source and destination tables that reside in TCAMs. According to the matched entry, the source table will output a column address and the destination table will output a row address. Combined with the row and column addresses, the router can find a cell in the TD-table, and return an index value. Using the index value, the router looks up the mapping table, and returns the next hop information.

Theorem 3.

The look up speed of FIST is one TCAM clock cycle plus three SRAM clock cycles.

Proof.

Source and destination tables can be accessed in parallel. Thus one clock cycle of TCAM is enough. Getting the row and column address cost one SRAM clock cycle, Then the router will access TD-table, and mapping table, each cost one SRAM clock cycle. ∎

As a comparison, the conventional destination-based routing usually stores destination prefixes in one TCAM, and accesses both TCAM and SRAM for one time during a lookup process. Note that the SRAM clock cycle is much smaller than TCAM cycle [16], and the bottleneck of a router is normally during delivering packets through the FIFO, thus two more accesses in SRAM will not have a significant impact on throughput.

To minimize the additional impact on throughput, we develop a pipeline lookup process (see model in Fig. 3). When a packet arrives, the router first extracts the source and destination addresses, and hands them to the search engine. The router then looks up the source and destination address in parallel from the source and destination tables. Note that we can perform such parallel processing because we have saturated the TD-table. After the router obtains the SRAM addresses that point to the row and column values, the SRAM addresses are passed to a FIFO buffer, which resolves the un-matching clock-rates between TCAM and SRAM. Using the SRAM addresses, router looks up the SRAM that is used in conjunction with TCAM, to get the row and column. Then router makes use of the row and column values to lookup the TD-table, and obtains the index value, which is then used to lookup the mapping table. Finally, router looks up the mapping table and obtains the next hop information.

Figure 3: Lookup process pipeline of FIST

Pipelining itself is not new and almost all routers implement it today. Using the pipeline, the lookup speed of FIST can achieve one packet per TCAM clock rate.

Observation 1.

All implemented with pipelining, the lookup process of the FIST routers is the same as conventional routers.

4 Forwarding Table Compression

Let be the set of source prefixes, be the set of destination prefixes. Let (or ) be a mapping function that maps a destination (or source) prefix (or ) to the (or ) row (or column). Let denote the cell in the row and column. We use the 5-tuple to denote a forwarding table.

Definition 2.

is equivalent to , for any source address and destination address , matches in and in , in and in according to LMF rule, is satisfied.

For a given forwarding table, our objective is to find an equivalent forwarding table, which occupies minimum storage space, including both TCAM and SRAM.

4.1 Compression in TCAM Space

We first compress the storage space in TCAM, including destination and source tables. The size of destination and source tables can be measured by the number of destination and source prefixes in them.

Problem 1.

Optimal TCAM Compression: For , find an equivalent forwarding table such that the storage space in TCAM, i.e., is minimized.

We develop algorithm CompTCAM() to find the optimal TCAM compression. Our algorithm is based on the ORTC (Optimal Routing Table Constructor) algorithm [10], that computes the minimal one dimensional equivalent forwarding table in TCAM.

Intrinsically, our basic idea is to transform the two dimensional table into two conventional one dimensional tables, one is destination-based and the other is source-based. The action of each prefix in the destination-based (or source-based) table is the corresponding row (or column) vector in the TD-table. After transforming, the ORTC algorithm can be applied directly to compress these two tables.

Let be the row vector related with , be the column vector related with . Let be a mapping function that maps destination addresses (prefixes) to row vectors, be a mapping function that maps source addresses (prefixes) to column vectors. Let be the function by applying ORTC algorithm to the forwarding table, that has prefix set and action related with prefix is . The input of CompTCAM() is the original forwarding table and the output of CompTCAM() is the new forwarding table after compression.

Output : 
1 begin
2        ;
3        ;
4        ;
5        ;
6        ;
7        ;
8        ;
9       
10
Algorithm 2 CompTCAM()
Theorem 4.

Algorithm CompTCAM() computes the optimal compression, the complexity of CompTCAM() is .

Proof.

For the first part of the theorem, according to ORTC algorithm, for any destination address, for any source address, thus the new TwoD-IP forwarding table is equivalent to the original one. We next prove that CompTCAM() minimizes both destination and source tables by contradiction. We only give the proof for destination table minimization, proof for source table minimization is similar.

In CompTCAM(), according to ORTC algorithm in [10], if there exists another compression that produces and , and is smaller than the computed . Then there must be a destination address that matches , and . Thus there must be a source address , such that , will match a different index value in the new TD-table.

The complexity of ORTC algorithm is , where is the number of rules. In of CompCAM(), there are rules, the complexity of the basic comparison operation is . Thus, the complexity of CompTCAM() is . ∎

CompTCAM() needs a byte-by-byte comparison between rows and columns. To avoid these wasted comparisons, we can use fignerprint, that is a collision-resistant hash value computed over the rows/columns [41]

. We use SHA-1 as the collision-resistant hash function, and the collision probability is proved to be much smaller than hardware error rate

[30]. Within fingerprints, the complexity of CompTCAM() can be reduced to be .

Note that CompTCAM() minimizes destination and source tables in TCAM. At the same time, it reduce the size of TD-table in SRAM, i.e., the row (or column) corresponding to the eliminated destination and source prefixes will also be eliminated. Next, we try to optimize the storage space in SRAM.

4.2 Compression in SRAM Space

Within FIST structure, TD-table and mapping table reside in SRAM storage. Compared to TD-table, mapping table commonly occupies fixed and small storage space. Thus, we take TD-table as the dominant factor of SRAM storage.

To be storage efficient, we try to minimize the TD-table. We formulate the problems as following.

Problem 2.

Optimal TD-table Compression: For , find an equivalent forwarding table such that the storage space in TD-table, i.e., is minimized.

Theorem 5.

Finding the optimal compressed TD-table is NP-complete.

Proof.

It is obvious that the decision problem of validating a given TD-table is solvable in polynomial time. Therefore, the optimal TD-table compression problem is in NP class. To show this problem is NP-hard, we reduce the lossless data compression problem, which is known to be NP-complete [33], to it.

The lossless data compression problem is, given a string, find the minimal-length compressed form of the string. Here we extend the original problem to the two dimensional case (we call it two dimensional data compression problem), such that the input string can be a two dimensional string. Two dimensional data compression problem is also NP-complete, as one dimensional data compression is a special case of it. Note that two dimensional data compression problem is not equal to the optimal TD-table compression problem, as rows (columns) in TD-table can be reordered in TD-table.

Figure 4: NP-complete proof for the optimal compressed TD-table problem

Let be the storage size of the optimal compressed TD-table of TD-table . Given a two dimensional string. We construct a TD-table, as shown in Figure 4. The TD-table is composed of four sub TD-table, , , and . Each sub TD-table is independent from the others, i.e., each expressed by a separate symbol system. represents the two dimensional string. In , each column is independent from other columns, and is the only optimal compressed column, i.e., , where is a permutation matrix. In , each row is independent from other rows, and is the only optimal compressed row, i.e., , where is a permutation matrix. is an optimal compressed sub TD-table.

Then, we show that by finding an optimal compressed TD-table, we can find a lossless two dimensional compressed string. This is because if we permutate the sub TD-table , e.g., let the permutation matrix be , then one of and must also be permutated. Without loss of generality, let be permutated and the corresponding permutation matrix be . As . So permutation on will never lead to the optimal compressed TD-table. Thus if we find the optimal compressed TD-table, we can find the lossless compressed data by pick up the matrix in the top left corner of the optimal compressed TD-table.

4.2.1 Eliminating Duplicated Rows/Columns

Observation 2.

If (or ), we can merge rows (or columns) of and (or and ), by setting or (or or ).

The observation is true because FIST indirectly points to the index values through row (or column) numbers. If two rows (or columns) pointed by two destination (or source) prefixes are the same, we can eliminate one of them by making these two prefixes point to the same row (or column).

Based on this observation, we can eliminate the duplicated rows and columns within FIST structure. The complexity of this process is (because ). Within fingerprints, the complexity can be reduced to be . Actually, this process can be combined together with CompTCAM() to reduce computation time.

Theorem 6.

Eliminating the duplicated rows and columns computes the optimal TD-table compression.

Proof.

We first prove the equivalence. Without loss of generality, we eliminate row first. Let be the table after eliminating duplicated rows. We have . Let be the table after eliminating duplicated columns. is a new TD-table. We have . Thus, .

Then we prove the resulted table is the minimum one by contradiction. Assume there exists that is equivalent and . Without loss of generality, suppose that . Because does not have duplicated rows, there must exist and such that and (pigeonhole principle). So there must exist , such that and . Thus the assumption is wrong. The function of this part is same as rank computation of matrix. ∎

4.2.2 Fixed Block Deduplication

After eliminating the duplicated rows/columns, there still exists duplicated data in TD-table. For example, part of a row is the same with part of another row. To futher compress the TD-table, we apply fixed block deduplication, which is a common technique for data deduplication [26].

Fixed block deduplication is previously used to eliminate redundancies in data storage (e.g., file system). It breaks file into chunks that has fixed length, identifies redundant chunk, eliminates all but only one copy, and creates logical pointer to these chunks so that users can access them as needed [12].

Figure 5: Storage structure and deduplication process for fixed block deduplication
Figure 6: Example of fiexed block deduplication for TD-table

Our basic idea it to cut the rows in TD-table into fiexed width chunks, called narrow rows, i.e., rows that are shoter than the original rows in TD-table. And we eliminate all duplicated narrow rows, thus only one copy of each narrow row will be preserved.

As shown in Figure 5, After deduplicating, we store the narrow rows in a narrow TD-table. Each entry of the narrow TD-table is an indexed next hop, the same with the cell in TD-table. We also set up catalog table, the entry of which points to a row number in narrow TD-table. Catalog table mapps the narrow rows in TD-table to narrow TD-table. For example, in Figure 6, the TD-table derived from the example in Figure 1 can be deduplicated into a narrow TD-table combined with a catalog table. We can see that in Figure 6, the narrow row in the solid cycle can be transformed to be the row in the narrow TD-table, and the narrow row in the dashed cycle can be transformed to be the row in the narrow TD-table.

We also show the deduplication process in Figure 5. We scan the TD-table, and extract all narrow rows from it. For each narrow row, we first compute the fingerprint of it using SHA-1 function. With bloom filter [6], we can judge the narrow row is a duplicated one. If it is not, then we insert the narrow row into the narrow TD-table, and fill the entry in the catalog table. Else if it is, then we search in a data structure called narrow row index, that organizes all detected ¡fingerprint, narrow row number¿ pairs. Using the search result, we just fill the narrow row number in the corresponding position in the catalog table.

Figure 7: Partial lookup process within the deduplicated storage structure

After using the deduplicated storage structure, the lookup process has to be updated. Let be the width (number of cells) of a narrow row. We show the part of the new lookup process in Figure 7, which can replace the TD-table lookup step in Figure 2 and form the whole picture of the new lookup process. After obtaining the row address and column address, the router can find an entry in the catalog table, and return a new row address in the narrow TD-table. Using the new row and original column address, the router lookups the narrow TD-table, and return an index value. Because of the random access property of the fixed block deplication method, the new lookup process still achieve constant lookup time. Although it adds one more lookup in SRAM, the influnence on the lookup speed is trivial, especially within the pipelined lookup model.

5 FIST Update

Although TD-Saturation() guarantees the correctness of FIST. It needs re-computation for all conflicted cells, and re-written of them in SRAM when update happens. Note that although the update is necessary, not all cell need to be cleared and re-written in this update process. In this section, our objective is to minimize the number of cell updates. We use a function to denote the TD-table, let be the index value of cell .

Problem 3.

Optimal transformation: Given a TD-table and an update, find a new TD-table , such that is minimized.

To achieve this, we will first build a data structure called color tree to organize the cells. With this color tree, we will develop algorithms for insertion and deletion where only part of the cells will be updated. We will then prove that our algorithms indeed minimize the computation cost and the number of cell rewrites.

5.1 A Color Tree Structure

Each destination node has a colored tree. The tree is constructed using all source prefixes in the source table. There are black nodes and white nodes. Intrinsically, each black nodes represent the cells that are directly set and white nodes represent the conflicted cells, i.e., the cells that are not directly set, but are filled up based on algorithms, e.g., TD-Saturation().

Let be the colored tree for , let be the set of black nodes, let be the set of white nodes. For example, in Figure 9, we show , the colored tree for destination prefixes 101*. In it, and .

To compute the optimal transformation upon an update, we first define domain of of a black node in colored trees. Formally,

Definition 3.

In a colored tree , the domain of a black node is , where satisfies: 1) is a prefix of ; 2) , where is a prefix of and is a prefix of .

For example, in Figure 9, the domain of **** . Intuitively, the domain of a black node is the largest sub-tree that roots at itself and does not contain any other black nodes.

Theorem 7.

When updating rule , the cell set is minimum cell set that should be changed, and all cells in it should be set to be the index value of .

Proof.

We prove the theorem by contradiction. Assume there exists another cell set is smaller than the above cell set, indicating that the index value of one cell , where , is not set to be the index value of . Then if a packet matches and within FIST, obviously, it should match the rule. Then the cell is set with a wrong index value. ∎

From Theorem 7, we can see that, through computing the domain, we can compute the optimal transformation when an update arrives. In the next subsection, we show two update algorithms that compute the optimal transformation.

Figure 8: TwoD array after setting all conflicted cells
Figure 9: Colored tree for Figure 1

5.2 FIST Update Algorithms

Here, we define insertion action Insert() and deletion action Delete(). Update action Update(, , ) can be seen as a deletion followed by an insertion.

Before illustrating these algorithms, we introduce a lemma that simplifies the updating process.

Lemma 1.

If is the parent of in , and , then cells and have the same index value.

Proof.

If , then and belong to the domain of the same black node. If , then belong to the domain of . Thus according to Theorem 7, the lemma is proved. ∎

Algorithm Insert() inserts a rule given the TD-table. If (or ) is not in the destination (or source) table, router should assign an unused row (or column). When a column is assigned, we should first find , which is the parent of in the source tree . According to Lemma 1, and have the same index value for all . Thus, we copy the column corresponding to to the column corresponding to . After initializing, through computing the domain, we can find the cells that should be changed. Finally, (or ) should be inserted into destination (or source) table if it does not exist.

Algorithm Delete() delete the rule related with and given the TD-table. At first, a black node in is set to be white, thus the nodes in domain now belongs to a new domain. For example, in Fig. 10, after deleting rule , node is set to be white in colored tree . And nodes and , which belong to the domain of before deletion, now belong to the domain of in . Thus cells and should be set to be 1, which is the index value of . After deletion, if there does not exist any rule related with (or ) any more, we should delete it from destination (or source) table. And then reclaim the row (or column) resources.

Figure 10: Example of Deletion:
1 begin
2          if  does not exist in destination table then
3                   Assign a row in TD-table;
4                   Copy index value of to cells in the row;
5                  
6         if  does not exist in source table then
7                   Assign a column in TD-table;
8                   parent of in ;
9                   copy the column of to the column of ;
10                  
11          index value of , ;
12          if  () is not in source (destination) table then
13                   Insert () into source (destination) table;
14         
15
Algorithm 3 Insert()
1 begin
2          parent of in ;
3          index value of , ;
4          if  then
5                   Delete from destination table;
6                   Reclaim the row of ;
7                  
8         if  then
9                   Delete from source table, Reclaim the column of ;
10                  
11         
12
Algorithm 4 Delete()
Theorem 8.

Insert(, , , ) and Delete( ), , ) compute the optimal transformation.

Proof.

The theorem is an immediate result of Theorem 7. ∎

Thus, the update action causes minimum computation cost, and brings least accesses to TD-table in SRAM. Beside, with the prevalence of dual-port SRAM, by reading through one port and writing through the other111current dual-port SRAM can resolve the read-write collision, i.e., read during write operation at the same cell [9]., update of TD-table does not have to lock the lookup process. We can also prove that the update action of FIST is also consistent, i.e., for each rule insertion or deletion, a packet can only match the rule that would be matched before or after the insertion or deletion [38]. Due to page limit, we omit it here.

6 Practical Considerations

6.1 Reducing Update Burden on TD-table

Although TD-table update will not influence the lookup process, a single rule insertion/deletion may still cause write operations at SRAM. An update process, in the worst case, will cause updating on all cells in a row of TD-table. For example, if we update in Figure 1 with index value 1, then all cells in the row should be updated with index value 1.

When is very large, it may exceed the ability of SRAM to handle these updates. For example, if there are 10,000 source prefixes, the network produces over 500 updates on the default next hops of different destination prefixes per second. In the worst case, there will be over 5 million/second write operations into TD-table, which exceed the maximum clock rate of SRAM.

The main reason for the large number of update operations on TD-table is that, the default next hop of all destination prefixes is stored as full wildcard in source table. First, the full wildcard resides at the root node of the source tree, once updated, it will cause a lot of updates subsequently. Second, the default next hop changes frequently, because it has to change when connectivity information of the corresponding destination prefix changes.

Thus, we propose to isolate default next-hop from source table, i.e., it is not stored in source table. Rather than being matched when the full wildcard is hit in source table, the default next-hop is matched when none entry in source table is matched. In Section 7, we will illustrate this in detail.

After removing the full wildcard from the source table, we believe the update frequency of TD-table will be low, with the following two facts: 1) the update of non-connectivity rule will be slow, i.e., it does not have to respond instantly to the changes of network topology; 2) most prefixes in the current forwarding tables are near leaf nodes in prefix trees [3], indicating that we only need to update a few cells during most rule updates.

After removing the full wildcard, the source tree may be divided into source forest, which has a similar definition with source tree except its forest structure. For example, in Fig 12, we show the source forest after removing the full wildcard. We can also define colored forest (denoted by ) and re-define domain in a similar way.

However, unlike in the colored tree, where each node has a black ancestor at least, because the root node is black. In colored forest, a node could have none black ancestor without black root node. Thus, in colored forest, a white node may do not belong to the domain of any black node. For example, in Fig 12, the shaded white node 100* and 101* do not belong to the domain of any black node. For the white node that do not belong to the domain of any black node in , cell is invalid, i.e., the cell does not have any index value and should not be matched. Fig. 12 shows the TD-table after removing the full wildcard from source table.

After that, we can revise update action, including insertion and deletion, by replacing “tree” with “forest”.

Figure 11: Source forest after removing the full wildcard
Figure 12: TwoD array after removing the full wildcard in source table

7 Implementation

As a proof-of-concept, we implement the FIST forwarding table structure on a commercial router, Bit-Engine 12004, which supports four linecards. In each linecard, there are a CPU board (BitWay CPU8240 that works at 100MHz), two TCAM chips (IDT 74K62100, can accommodate 512K IPv4 entries at most), an FPGA chip (Altera EP1S25-780), and several cascaded SRAM chips (IDT 71T75602) associated with the TCAM chips. Inside the FPGA chip, there exists internal SRAM memory.

Our implementation is based on existed hardware, and does not need any new hardware. We re-design the hardware through rewriting about 1500 lines of VHDL code (not including C code) of the original destination-based version.

7.1 Router Framework

In Fig. 13, we show the framework of our router design. The major changes are in data plane. In data plane, the FPGA receives the packets from the interface module, extracts the packet head, and request the TCAM module. Due to resource limit, we place both destination and source table in different blocks of one TCAM, and the FPGA requests the TCAM module twice (the first in destination table, and the second in source table) to access these two tables. Although this will increase the delay per lookup in our implementation, many processors (e.g., NetLogic NL10K) now support two lookups in parallel, thus this will not become the bottleneck of lookup process in the future.

The TCAM module will output the matched prefix, and through the TCAM associated SRAM, FPGA will get the matched result, e.g., row or column address of matched prefix. Then the FPGA will compute the address of the cell in TD-table, which resides in a block of an internal SRAM of the FPGA. After getting the index, FPGA accesses the mapping table, which resides in another block of the SRAM of FPGA. Then FPGA gets the next hop information, and delivers the packets to the next processing module, switch co-process module, which will switch the packet to the right interface.

We also design the control interfaces for control plane to access and update the forwarding table. In the control plane, we store destination prefixes in a patricia trie [17], source prefixes in another patricia trie. We store the row and column addresses in the nodes of each patricia trie, and also store each rule in a two dimensional array.

Figure 13: The framework of router design
Figure 14: Implementation of FIST on the linecard of Bit-Engine 12004

7.2 A Scalable FIST Design

We implemented the FIST structure, as shown in Figure 1, on the linecard. Besides, for better scalability, we incorporated the improvements mentioned in Section 3.2.3 and 6.1, such that FIST can accommodate more destination/source prefixes and allow more frequent updates. Within the improvements, the format of the SRAM units pointed by source table remains the same, i.e., storing only the column address. However, the format of the SRAM units pointed by destination table changes: 1) it has an indicator bit, which is set only if there is a row in TD-table for the corresponding destination prefix, such that we can reduce the SRAM space of TD-table (see Section 3.2.3); 2) it stores the index value of the default next hop for the corresponding destination prefix, such that the burden on TD-table caused by updates can be reduced (see Section 6.1).

(a) Format of SRAM units pointed by source table
(b) Format of SRAM units pointed by destination table
Figure 15: The format of the SRAM units pointed by TCAM entries

Within the new structure, the lookup process also changes. After TCAM matching in destination and source tables and obtaining the SRAM unit corresponding to the matched prefixes. Router checks the indicator bit, if the indicator bit is unset, then router gets the index value of the default next hop directly. Else if none source prefix gets matched, then router gets the index value of the default next hop. Else if a source prefix is matched, then router accesses the cell (assume are the matched destination and source prefixes) in TD-table. If the cell is invalid, then router gets the index value of the default next hop, else the router gets the index value of cell . Using the obtained index value, router looks up in mapping table, and gets the next hop information. We show the new lookup process in Figure 16, note that compared to the original lookup process in Figure 2, all new steps are processed in CPU, indicating that there is none additional accesses to TCAM or SRAM.

Figure 16: Lookup process of our implementation

7.3 Fixed Block Deduplication

We use bloom filter to accelerate the deduplication process. Within bloom filter, there exists a summary vector [41], which is a vector of bits.

8 Evaluation

8.1 Evaluation Setup

Figure 17: Evaluation environment

In Figure 17, we show the connection diagram of the evaluation environment. There are three components, a PC host (in this paper, the CPU of the PC host is Intel Core2 Duo T6570) that acts as the control plane of a router, a 4GE linecard that has been equipped with both ACL-like and FIST forwarding table structure and a traffic generator (IXIA 1600). Using optical fibers, the linecard is connected with the traffic generator, and using a serial cable, the linecard is connected with the PC host. The traffic generator sends packets of minimum 64 bytes (including 18 bytes of Ethernet Header) at full speeds, i.e., 4Gbps. The linecard receives the packets, lookups them in the forwarding table, and sends them back to the traffic generator. The traffic generator can summarize the sending and receiving rate.

We control the forwarding table by the PC host through the serial cable. We update the forwarding table using Algorithm and through the pre-defined interfaces to access hardware on the PC host. We test update at different frequency, i.e., 100, 1,000, and 10,000 updates per second. The TCAM memory is constructed according to the L-algorithm [32], i.e., prefixes of the same length are clustered together and there exists free space between different clusters, to guarantee fast updates in TCAM. We pre-allocate 1000 positions for each prefix cluster of different length initially.

8.2 Data Sets

To evaluate our FIST structure, we consider two scenarios, and generate forwarding table data sets, update sequence data sets within these scenarios.

Figure 18: CERNET2 topology

8.2.1 Policy Routing in CERNET2

CERNET2 has two international exchange centers connecting to the Internet, Beijing (CNGI-6IX) and Shanghai (CNGI-SHIX). However, during operation, we found that CNGI-6IX is very congested with an average throughput of 1.18Gbps in February 2011; and CNGI-SHIX is much more spared with a maximal throughput of 8.3Mbps at the same time. We want to move the out-going International traffic of three universities, i.e., THU (in Beijing, with 38 prefixes), HUST (in Wuhan, with 18 prefixes) and SCUT (in Guangzhou, with 28 prefixes) to CNGI-SHIX (Shanghai portal).

In this scenario, we collects the prefix and FIB information from CERNET2. There are 6973 prefixes in the FIB of CERNET2, and among them there are 6406 foreign prefixes. We construct three policy forwarding tables on three routers, i.e., Beijing, Wuhan and Guangzhou (we call each forwarding table PR-BJ, PR-WH, PR-GZ).

To obtain the update sequence, we set the initial two dimensional rule set to be empty, and add all rules into the forwarding table at some time point. We generate the update sequence on the router of Wuhan in this way, to simulate a common scenario, where ISPs decide to carry out a policy at some time point. We show the number of rules in each forwarding table, and number of updates in each update sequence in Table LABEL:sub@tab-rule-number.

Forwarding table # of Rules
PR-BJ 250366
PR-GZ 186306
PR-WH 365674
LB-MO 7118
LB-AF 7342
LB-NI 7410
(a) Number of rules in each forwarding table
Update sequence # of Updates
PR 365674
LB 475773
(b) Number of updates in each update sequence
Table 2: Data sets overview

8.2.2 Load Balancing in CERNET2

To further balance the load in between CNGI-6IX and CNGI-SHIX, we need a more dynamic load balancing mechanism in the future. We collect about one Tera-Bytes of traffic data during one month (Jan, 2012) from three routers (i.e., Beijing, Shanghai and Wuhan) by NetFlow. In Figure 19 (the Y-axis has been anonymized), we also show the bandwidth utilization of both CNGI-6IX and CNGI-SHIX during the month. We can see that CNGI-6IX is much more congested than CNGI-SHIX.

Figure 19: Bandwidth utilization of CNGI-6IX and CNGI-SHIX in CERNET2

We first process the traffic data, such that each out-going international micro-flow, that is identified by their source and destination addresses, is aggregated into a macro-flow, that is identified by an source and an destination prefix (here, we use the LMF rule for aggregation). Then we try to redistribute each macro flow to different exchange centers, such that load is optimally balanced. The problem can be reduced to Multi-Processor Scheduling problem [5] and is NP-hard. To solve the problem, we use the greedy first-fit algorithm, which assigns each macro flow to the exchange center with the least utilization, and achieves an approximation factor of 2.

We construct three load balancing forwarding tables, each at a different time points, i.e., 6:00 morning, 2:00 afternoon and 10:00 evening during Jan 15, 2012 on the router of Wuhan (we call each forwarding table LB-MO, LB-AF, LB-EV). We also show the number of each forwarding table in Table (b)b, in which we can see that LB-EV is the largest one, because more traffic should be move to CNGI-SHIX when 10:00 at night is the peak traffic point during one day. We also generate the update sequence by computing a new load balancing scheme every hour.

8.3 Evaluation Results

8.3.1 Forwarding Table Size

We evaluate the storage space that FIST consumes for all forwarding tables, and the storage space after compression and after adopting non-homogeneous structure. As a comparison, we also set the ACL-like structure as a benchmark. In Figure 20, we show the size of each forwarding table, which can be separated into TCAM and SRAM storage, within different storage structures.

Trivial FIST and ACL-like Structure: In Figure LABEL:sub@fig-tcam, we can see that for all forwarding tables, FIST consumes only half of the TCAM space that ACL-like structure consumes. For the data sets within the policy routing scenario, FIST costs much less TCAM storage, e.g., within PR-WH, FIST consumes about 1Mb, while ACL-like structure consumes more than 72Mb TCAM storage space. This is because in our policy routing scenario, the forwarding table is very dense, i.e., many rule share the same destination or source prefix. FIST store only once for each destination or source prefix, while ACL-like structure may store multiple times for the same destination (or source) prefix if it is associated with multiple source (or destination) prefixes.

In Figure LABEL:sub@fig-sram, we can see that, within PR-BJ, PR-GZ and PR-WH, FIST consumes less SRAM space than ACL-like structure. However, within LB-MO, LB-AF and LB-EV, FIST consumes more SRAM space ACL-like structure. This is because in the policy routing scenario, the forwarding table is much denser, the rules are congregated in a few source and destination prefixes, i.e., the prefixes of THU, HUST and SCUT. However, in the load balancing scenario, rules span across many destination and source prefixes, thus the FIST structure consumes much more SRAM space.

Compression: In Figure LABEL:sub@fig-compression-tcam and LABEL:sub@fig-compression-sram, we show the consumed TCAM and SRAM storage space within FIST after compression, first by Compress-DS() and then by Compress-TD(). We also compress the forwarding tables within ACL-like structure, i.e., minimize the number of rules. Note that within ACL-like structure, we can not further reduce SRAM storage space after minimizing the TCAM storage space. In Figure LABEL:sub@fig-compression-tcam, we can see that about 20%-30% TCAM storage space can be saved through compression. After Compress-DS(), the TCAM storage space consumed by FIST is still much smaller than the TCAM storage space consumed by ACL-like structure. Compress-TD() does not effect on TCAM storage, as it only modifies the row (or column) number of destination (or source) prefixes. In Figure LABEL:sub@fig-compression-sram, we show the consumed SRAM storage space after compression. We can see that the percentage of SRAM that can be saved by Compress-SD() and compressing ACL-like forwarding table is similar with the percentage of TCAM that can be saved, i.e., 20%-30%. However, Compress-TD() has a considerable effect on the SRAM storage space within FIST, this is because there are high redundancies in the TD-table. For example, on PR-WH, we carry out the same policy on all source prefixes in source table, thus their corresponding columns in TD-table can be merged.

Non-Homogenous Structure: In Figure LABEL:sub@fig-non-homo-tcam, we show the consumed TCAM storage space within non-homogeneous structure. Non-homogeneous FIST structure does not save TCAM storage space, because non-homogeneous structure only separates the destination table into two parts. Non-homogeneous ACL-like structure does save TCAM storage space, especially for the load balancing scenario. This is because the width of an TCAM entry can be reduced after store destination only rules separately. However, because the width of a TCAM entry is fixed, we can only physically (instead of logically) divide the table into two parts within ACL-like structure. In contrast, within FIST, we can flexibility logically divide the table into two parts.

In Figure LABEL:sub@fig-non-homo-sram, we show the consumed SRAM storage space within non-homogeneous structure. Within non-homogeneous FIST structure, SRAM storage space can be saved. Within PR-BJ, PR-GZ and PR-WH, about 7% SRAM space can be saved after adopting non-homogeneous structure, because about 7% destination prefixes are not foreign prefixes, and does not have to be moved. However, for LB-MO, LB-AF and LB-EV, the SRAM space can be reduce to be 3% of the SRAM space consumed by homogeneous structure. This is because in the load balancing scenario, only traffic of a small number of destination prefixes have to be diverted to another path. For example, within LB-EV, only traffic towards 59 destination prefixes has to be diverted. Within non-homogeneous ACL-like structure, SRAM storage space is not saved. After adopting non-homogeneous structure, FIST costs less SRAM storage than ACL-like structure.

Combine Non-Homogenous Structure with Compression: In Figure LABEL:sub@fig-final-tcam and LABEL:sub@fig-final-sram, we apply both non-homogenous structure and compression techniques to all forwarding tables. The resulting tables get smaller than all previous tables. Here, we focus on SRAM storage space because non-homogenous structure has no effect on TCAM storage space. We can see that the improvement is small compared to compression only, this is because the TD-table is already very small, and negligible as compared to other consumed SRAM storage space. However, the improvement is quite large compared to non-homogenous structure only, because there still exist high redundancies after adopting non-homogenous structure.

(a) TCAM storage space for FIST and ACL-like sturcture
(b) TCAM storage space after compression
(c) TCAM storage space with non-homogeneous structure
(d) TCAM storage space after compression, with non-homogeneous structure
(e) SRAM storage space for FIST and ACL-like sturcture
(f) SRAM storage space after compression
(g) SRAM storage space with non-homogeneous structure
(h) SRAM storage space after compression, with non-homogeneous structure
Figure 20: Size of each forwarding table

8.3.2 Lookup Speed and Update

Lookup Speed: In Figure 21, we show the lookup speed without update. We can see that without update, both sending and receiving rates reach line speeds (note that Ethernet frame contains 8 bytes of preamble and 12 bytes of gap, thus the maximum sending rate is Gbps). We also look into the data traces, and find there does not exist packet loss.

Figure 21: Lookup speed without update

Number of Accesses to TCAM During Update:

(a) Number of accesses to TCAM per 100 updates
(b) Lookup speed with updates for policy routing
(c) Lookup speed with updates for load balancing
Figure 22: Lookup speeds with updates

To evaluate the update burden, i.e., the influence of update on lookup speed for each update sequence. We first evaluate the number of accesses to TCAM, as TCAM accesses dominates the interruption period during updates. We also compare FIST with ACL-like structure. In Figure LABEL:sub@fig-access-tcam-frequency, we show the number of accesses to TCAM of FIST per 100 updates. We can see that PR brings only a few accesses to TCAM, this is because in our policy routing case, all destination prefixes already exist in TCAM, and carrying out the policy routing only needs to assign rows to the destination prefixes in destination table, and initially insert source prefixes into source table. After 243,500 updates, PR does not need any access to TCAM, because all destination (or source) prefixes already exist in destination (or source) table. LB also introduces only a few accesses to TCAM, because there exists many overlapping destination and source prefixes at different time points, thus they do not have to be updated in destination and source tables each time. In contrast with FIST, ACL-like structure introduces much more accesses to TCAM, e.g., 15,596 accesses to TCAM during 100 updates maximally. This is because 1) there are more rules in forwarding table within ACL-like structure; 2) within FIST, we only have to guarantee the order of destination/source prefixes with the same length in their respective destination/source table. However, within ACL-like structure, we have to guarantee the order of (destination, source) prefix pairs with the same length (for both destination and source prefixes) in a common table.

In Figure LABEL:sub@fig-evaluation-policy and LABEL:sub@fig-evaluation-load, we show the lookup speed, i.e., receiving rate on the traffic generator, of FIST within different update frequency during 5 minutes (when the update frequency is 5000, or 50000 updates/sec, the update process will be terminated earlier). In Figure LABEL:sub@fig-evaluation-policy, we can see that within FIST structure, no matter at which frequency, updates has almost no influence on lookup in the policy routing scenario. This is because PR cause only a few accesses to TCAM with each update, and brings a little interruption time during lookup. In Figure LABEL:sub@fig-evaluation-policy, we also compare the results of FIST and ACL-like structures, we can see that within ACL-like structure, updates have greater influence on lookup, e.g., the receiving rate is degrade by about 7% maximally when there are 50,000 updates per second.

In Figure LABEL:sub@fig-evaluation-load, we can see that within FIST structure, when the update frequency is low, i.e., 500 updates per second. However, when the update frequency is high, e.g., 50,000 updates per second, the receiving rate is degraded by about 2%. This is because in our load balancing scenario, each updates cause more accesses to TCAM. Even when update frequency is 5,000 updates per second, there still exists some time point when the lookup speed is degraded. In Figure LABEL:sub@fig-evaluation-load, we can see that within ACL-like structure, even at the lowest frequency, i.e., 500 updates per second, the performance is still degraded by about 0.1%.

We conclude that our FIST structure will not introduce high update burden on lookup speed. In the policy routing scenario, although there are may be millions of update when ISP operators decide to carry out some policies, the update can be completed in a short time, e.g., less than 20 seconds when there are 1 million updates, without having influence on lookup. Besides, in most cases, policy routing does not have to be implemented in a real-time way. In the load balancing scenario, we perform updates every hour, we show the number of updates needed per hour in Figure 23. The trend of updates per hour in Figure 23 is similar with the trend of traffic in CNGI-6IX in Figure 19, because we need to move more traffic to CNGI-SHIX when CNGI-6IX is more congested. We can see that the maximum number of updates per hour is about 1,300, which can be completed within one second without having influence on lookup.

Figure 23: Number of updates for load balancing

Number of Accesses to SRAM During Update: In Figure 25, we show the number of accesses to SRAM within incremental update and TD-Saturation(). We can see that for both policy routing and load balancing scenario, incremental update causes much less accesses to SRAM. This is because during each update, TD-Saturation() has to reset all conflicted cells while incremental update only has to reset the dependent cells that must be changed, which is a subset of all conflicted cells. For example, in the load balancing scenario, incremental update causes 600 accesses to SRAM at most per 100 updates, while TD-Saturation causes 10,814 accesses to SRAM at most per 100 updates. In the policy routing scenario, incremental update causes only 100 accesses to SRAM per 100 updates, this is because in the forwarding table of the policy routing scenario, the source prefixes are composed of prefixes from two universities, i.e., THU and HUST. Prefixes from THU are totally disjoint, i.e., none source prefix is a prefix of another, and prefixes from HUST are disjoint except for two prefixes (240c::/28 and 240c:3::/32). Thus update a cell in TD-table will bring almost none conflicted cells.

In Figure (b)b, we also show the computation time per 100 updates for both incremental update and TD-Saturation(). The result is similar with Figure 25, because more accesses to SRAM indicates more cells that have to be computed. Thus incremental update cost much less time per update, compared to TD-Saturation().

In Figure 25, we show the number of accesses to SRAM with and without isolating default next hop. We only consider the load balancing scenario, because policy routing is a special case where all nodes in the colred tree of any destination prefix are black, thus isolating default next hop has no effect. In the load balancing scenario, we randomly insert 100 updates on the default next hops of destination prefixes, after each hour when load balancing is carried out. In Figure 25, we can see that with isolation, each update on default next hop bring none access to SRAM, because we only have to update in the TCAM. However, without isolation, each 100 updates brings about 10,000 accesses to SRAM, because we also have to update the dependent cells in TD-table without isolation.

(a) Number of accesses to SRAM per 100 updates (b) Computation time per 100 updates
Figure 24: Comparison between incremental updates and TD-Saturation()
Figure 25: Comparison between isolation and non-isolation of default next hop

9 Discussion about Scalability

We admit that the trivial FIST will bring scalability issues in SRAM, if both the destination and source tables are very large. Current largest SRAM chip in the market is 144Mb (288Mb SRAM is on the roadmap of major vendors) [13], other memory products such as RLDRAM can provide similar performance (allows 16 bytes reading with random access time of 15 ns with memory denominations of 576 Mbit/chip [14]. Suppose multiple chips (linecards of Bit-Engine 12004 support four SRAM chips) is used, and 576Mb storage space is available for TD-table, if there are 10,000 destination prefixes, then TD-table can accommodate at most 7550 source prefixes. It is obviously impractical within current 400,000 destination prefixes.

However, the situation can be improved because 1) using non-homogenous structure can exclude most destination prefixes from destination table; 2) in the real world, different prefixes usually share the same policy, e.g., prefixes belong the the same university in CERNET2 should be equally treated. They can be compressed to the granularity of coarser granularity, rather than prefixes; 3) we can enforce restrictions when adding a row or column into the TD-table. Beside, we are making continuous efforts to eliminate the redundancies in TD-table.

update scalability

we admit that in certain circumstances,

10 Related Work

Packet classification is an important topic throughout the history of the Internet. With increasing demands from users and ISPs for better and more flexible services, more research works focus on higher dimensional classification[7] [29]. In layer-4, multi-dimensional classification is a familiar topic due to security and other reasons [15][18]. In layer-3, more and more routing schemes make routing decisions based on both source and destination addresses, such as NIRA [40], customer-specific routing [11]. In this paper, our focus is on two dimensional classification in layer-3, i.e., designing a TwoD router.

Hardware-based, especially TCAM-based solutions are the de facto standard for the Internet routers [23]. TCAM-based solutions are limited by the capacity of TCAM [24], despite their constant lookup time. To reduce the TCAM storage space, various compression schemes have been studied [19][22]. In [34], optimal two dimensional routing table compression is studied. Most works related with hardware-based multi-dimensional classifiers are on the basic of traditional Cisco ACL structure, which is ‘fat’ in TCAM and ‘thin’ in SRAM.

In [25], a novel TCAM structure is proposed for firewall, it moves the majority information from expensive TCAM to cheaper SRAM. However, it needs multiple sequential lookups in TCAM, and extending the width of TCAM entries, while TCAM chips storing forwarding table have limited spare bits (in CERNET2, the TCAM width is set to be 144, only 16 bits are spared). Such that it is not fit for our TwoD router design.

TCAM-based solutions need multiple accesses to memory during an update [21]. Nowadays, the update frequency can reach tens of thousands per second [28], which seriously impedes the lookup speeds. To solve this problem, [38][27] propose to keep the classification table lock-free, i.e., lookups will not be interrupted by update. In this paper, we borrow their ideas during designing the update scheme.

11 Conclusion

In this paper, we put forwarded a new forwarding table structure called FIST of TwoD routers, where forwarding decisions is based on both destination and source addresses. Our focus is to accommodate the increasing number of rules in TwoD routers, which is also a practical concern of CERNET2 during deploying TwoD-IP routing. Through making a novel separation between TCAM and SRAM, FIST can significantly reduce the scarce TCAM storage space and keep fast lookup speed.

FIST stores destination and source prefixes in two separate TCAM tables. Combined with the matching results of both results, we can find the next hop information for an arriving packet. Through pre-computation, we can resolve the potential confliction. By proposing a new data structure called colered tree, we designed the incremental updating algorithm, that can minimize the computation complexity and number of accesses to memory.

We implement the TwoD router within FIST on the linecard of a commercial router. Our design is incremental, and does not need any new devices. We also made comprehensive with the real design and data sets from CERNET2. The results showed that FIST can greatly reduce the TCAM storage space, and will not increase SRAM storage space in our scenarios.

References

  • [1] F. Baboescu and G. Varghese (2005) Scalable packet classification. IEEE/ACM Trans. Netw. 13 (1), pp. 2–14. Cited by: §1.
  • [2] F. Baboescu, P. Warkhede, S. Suri, and G. Varghese (2006) Fast packet classification for two-dimensional conflict-free filters. Comput. Netw. 50 (11), pp. 1831–1842. Cited by: §1.
  • [3] A. Basu and G. Narlikar (2005) Fast incremental updates for pipelined forwarding engines. IEEE/ACM Trans. Netw. 13, pp. 690–703. Cited by: §6.1.
  • [4] BGP routing table analysis reports. Note: http://bgp.potaroo.net Cited by: §2.
  • [5] J. Blaz̊ewicz, M. Drabowski, and J. Weglarz (1986) Scheduling multiprocessor tasks to minimize schedule length. IEEE Trans. Comput. 35 (5), pp. 389–393. Cited by: §8.2.2.
  • [6] A. Broder and M. Mitzenmacher (2003) Network applications of bloom filters: a survey. Internet Mathematics, pp. 636–646. Cited by: §4.2.2.
  • [7] Y. Chang (2009) Efficient multidimensional packet classification with fast updates. Computers, IEEE Transactions on 58 (4), pp. 463 –479. Cited by: §10.
  • [8] Y. Chiba, Y. Shinohara, and H. Shimonishi (2010-09) Source flow: handling millions of flows on flow-based nodes. In Proc. ACM SIGCOMM’10, New Delhi, India. Cited by: §3.2.1.
  • [9] Cyclone handbook. Note: www.altera.com/literature/hb/cyc/cyc_c51007.pdf Cited by: footnote 1.
  • [10] R. P. Draves, C. King, S. Venkatachary, and B. N. Zill (1999-03) Constructing optimal ip routing tables. In Proc. IEEE INFOCOM’99, New York, NY. Cited by: §4.1, §4.1.
  • [11] J. Fu and J. Rexford (2008-12) Efficient ip-address lookup with a shared forwarding table for multiple virtual routers. In Proc. ACM CoNEXT’08, Madrid, Spain. Cited by: §10.
  • [12] D. Geer (2008) Reducing the storage burden via data deduplication. Computer 41 (12), pp. 15 –17. Cited by: §4.2.2.
  • [13] C. Hermsmeyer, H. Song, R. Schlenk, R. Gemelli, and S. Bunse (2009) Towards 100g packet processing: challenges and technologies. Bell Lab. Tech. J. 14 (2), pp. 57–79. Cited by: §9.
  • [14] K.Fall, G.Iannaccone, S.Ratnasamy, and P.Godfrey (2006) Routing tables: is smaller really much better?. BT Technology Journal 24, pp. 119–129. Cited by: §9.
  • [15] H. Kim, K. Claffy, M. Fomenkov, D. Barman, M. Faloutsos, and K. Lee (2008-12) Internet traffic classification demystified: myths, caveats, and the best practices. In Porc. ACM CoNEXT’08, Madrid, Spain. Cited by: §1, §10.
  • [16] J. Kim, M. Ko, H. Kang, and J. Kim (2009-06) A hybrid ip forwarding engine with high performance and low power. In Proc. ICCSA’09, Seoul, Korea. Cited by: §3.3.
  • [17] D. E. Knuth (1998) The art of computer programming, volume 3: (2nd ed.) sorting and searching. Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA. External Links: ISBN 0-201-89685-0 Cited by: §7.1.
  • [18] S. Lee, H. Kim, D. Barman, S. Lee, C. Kim, T. Kwon, and Y. Choi (2011) NeTraMark: a network traffic classification benchmark. SIGCOMM Comput. Commun. Rev. 41 (1), pp. 22–30. Cited by: §10.
  • [19] A.X. Liu, C.R. Meiners, and E. Torng (2010) TCAM razor: a systematic approach towards minimizing packet classifiers in tcams. Networking, IEEE/ACM Transactions on 18 (2), pp. 490 –500. Cited by: §10, §3.2.1.
  • [20] H. Lu and S. Sahni (2005) Conflict detection and resolution in two-dimensional prefix router tables. IEEE/ACM Trans. Netw. 13 (6), pp. 1353–1363. Cited by: §1, §3.1.
  • [21] L. Luo, G. Xie, Y. Xie, L. Mathy, and K. Salamatian (2012-03) A hybrid ip lookup architecture with fast updates. In Proc. IEEE Infocom’12, Orlando, FL. Cited by: §10.
  • [22] C.R. Meiners, A.X. Liu, and E. Torng (2009-10) Bit weaving: a non-prefix approach to compressing packet classifiers in tcams. In Proc. IEEE ICNP’09, Orlando, Florida. Cited by: §10.
  • [23] C.R. Meiners, A.X. Liu, and E. Torng (2010) Hardware based packet classification for high speed internet routers. Springer. Cited by: §1, §10.
  • [24] C. R. Meiners, A. X. Liu, E. Torng, and J. Patel (2011-10) Split: optimizing space, power, and throughput for tcam-based classification. In Proc. ACM/IEEE ANCS’11, Brooklyn, NY. Cited by: §1, §10.
  • [25] C. R. Meiners, J. Patel, E. Norige, E. Torng, and A. X. Liu (2010-08) Fast regular expression matching using small tcams for network intrusion detection and prevention systems. In Proc. USENIX Security’10, Washington, DC. Cited by: §10.
  • [26] D. T. Meyer and W. J. Bolosky (2012) A study of practical deduplication. Trans. Storage 7 (4), pp. 14:1–14:20. Cited by: §4.2.2.
  • [27] T. Mishra, S. Sahni, and G. Seetharaman (2011-07) PC-duos: fast tcam lookup and update for packet classifiers. In Proc. IEEE ISCC’11, Kerkyra, Greece. Cited by: §10.
  • [28] T. Mishra and S. Sahni (2010-06) DUOS - simple dual tcam architecture for routing tables with incremental update. In Proc. IEEE ISCC’10, Riccione, Italy. Cited by: §10.
  • [29] Y. Qi, L. Xu, B. Yang, Y. Xue, and J. Li (2009-04) Packet classification algorithms: from theory to practice. In Proc. IEEE Infocom’09, Rio de Janeiro, Brazil. Cited by: §10.
  • [30] S. Quinlan and S. Dorward (2002-01) Venti: a new approach to archival data storage. In Proc. USENIX FAST’02, Monterey, CA. Cited by: §4.1.
  • [31] Router fib technology. Note: http://www.firstpr.com.au/ip/sram-ip-forwarding/router-fib/ Cited by: §3.2.1.
  • [32] D. Shah and P. Gupta (2001) Fast updating algorithms for tcams. IEEE Micro 21 (1), pp. 36–47. Cited by: §8.1.
  • [33] J. A. Storer and T. G. Szymanski (1982) Data compression via textual substitution. Journal of the ACM 29 (4), pp. 928–951. Cited by: §4.2.
  • [34] S. Suri, T. Sandholm, and P. Warkhede (2003) Compressing two-dimensional routing tables. Algorithmica 35, pp. 287–300. Cited by: §1, §10.
  • [35] B. Vamanan, G. Voskuilen, and T. N. Vijaykumar (2010-08) EffiCuts: optimizing packet classification for memory and throughput. In Proc. ACM SIGCOM’10, New Delhi, India. Cited by: §1.
  • [36] G. Varghese (2005) Network algorithmics: an interdisciplinary approach to designing fast networked devices. Morgan Kaufmann, Waltham, MA. Cited by: §1.
  • [37] P. Wang (2009) Scalable packet classification with controlled cross-producting. Computer Networks 53 (6), pp. 821 – 834. Cited by: §1.
  • [38] Z. Wang, H. Che, M. Kumar, and S.K. Das (2004) CoPTUA: consistent policy table update algorithm for tcam without locking. Computers, IEEE Transactions on 53 (12), pp. 1602 – 1614. Cited by: §10, §5.2.
  • [39] M. Xu, J. Wu, S. Yang, and D. Wang (2012-03) Two dimensional ip routing architecture. Note: Internet Draftdraft-xu-rtgwg-twod-ip-routing-00.txt Cited by: §1.
  • [40] X. Yang, D. Clark, and A. W. Berger (2007) NIRA: a new inter-domain routing architecture. IEEE/ACM TRANSACTIONS ON NETWORKING. Cited by: §10.
  • [41] B. Zhu, K. Li, and H. Patterson (2008-02) Avoiding the disk bottleneck in the data domain deduplication file system. In Proc. USENIX FAST’08, San Jose, California. Cited by: §4.1, §7.3.