MapReduce is the most popular programming model for large data processing with parallel and distributed environment . Since the large clusters usually provide services to different users all over the Internet, it is hard to preserve the privacy of user’s data with traditional network structure .
There are some existing works focus on the privacy problem in MapReduce . These works usually bring some additional overhead to influence the performance of MapReduce. Adopting software defined network (SDN) or the OpenFlow protocol is another way to preserve privacy between users in the MapReduce network, which can distinguish network links between each users .
Privacy leakage might take place in the Layer 2 switches, including the virtual switch in cloud environment, by the sniffer and ARP poisoning-based attack . SDN based network virtualization is an efficient method to isolated network to resist the potential data leakage in the data forwarding path. While virtualized the whole network is complex and resource-intensive, it is much appropriate that isolating the network forwarding paths in the specific network level.
SDN is an approach of computer networking, which allows the network administrator to manage network services through abstraction of hardware level functionality . All hardware devices are managed by a centralized controller, and the network applications are deployed on this controller to implement different network functions. In the privacy preserving, the network is divided to distinguished virtual networks for each user .
In this paper, we propose a framework named Software Defined Privacy Preserving MapReduce Network (SDPMN), in which the entire network are divided to distinguished groups and each groups are isolated in the Layer 2 switches by SDN technology. In this framework, all packets are forwarded by SDN switches directly with different rules. We design this framework with existing SDN technology and Hadoop structure.
SDN enabled switches usually equip ternary content addressable memory (TCAM) for forwarding network packets . Since it is power hungry, expensive and takes up quite a bit of silicon space, the storage space of TCAM in each device is limited. Ordinarily, as it is hard to place all rules in the switches, many rules are stored in the controller with the performance drawback .
In SDPMN, since forwarding paths are implemented by rules in the related SDN switches directly, the space for storing these rules is not enough with existing SDN hardware. Therefore, with the design of the SDPMN framework, we state rule placement optimization problem to maximize the application number with limited rule space. After that, we propose a heuristic algorithm to solve this problem. To evaluate the efficiency of our algorithm, we take some simulation and analyze the result to compare the existing method.
The main contributions of this paper are summarized as follows.
We propose a privacy preserving scenario in the leaf switches to forward application packets by the SDN forwarding rules directly instead of the traditional ARP protocols.
We model the privacy preserving MapReduce network scenario and state the problem to find an optimized rule placement that maximum application groups for preserve user privacy are stored in the switches with minimum performance loss.
We design a heuristic algorithm and find the solution for the placement problem.
The rest of this paper is summarized as follows. We discuss the motivation of the work in Section III. In section IV, we introduce the framework design and the rule placement problem statement. We also introduce our heuristic algorithm to solve the placement problem in this section. In section V, we evaluate the demonstration of our design to verify our methodology. In the last section, we conclude our work and discuss the future work.
Ii Related Works
In this section, we introduce some existing works focusing on the isolation and security of MapReduce network. First, we discuss the works on the network virtualization. Then, we introduce some works on the privacy preserving in MapReduce networks.
Network virtualization is a technology to isolate the network to different virtual networks for different users or applications.
Bavier et al.  propose the virtual network infrastructure, which allows network researchers to evaluate their protocols and services in a realistic environment with a high degree network control. From their work, the virtualized network performs good efficiency to support experiments and other applications. This work only focuses on the IP level without any consideration on the Level 2 switches.
ShadowNet  is another network evaluation infrastructure based on the network virtualization. Similarly, it also uses the router virtualization and designs a network controller to manage the virtual networks. Meanwhile, the network virtualization can provide safe environment for users. However, they did not describe the details how to ensure testing safely.
Webb et al.  introduces a network that individual application can control the topology switching for deciding best path to route data among their nodes. In their paper, it is mentioned that a isolation problem to isolate networks by limited hardware capacity. They use an allocation to find a topology that connects the task’s nodes that is minimally shared with other tasks.
Drutskoy et al.  design an architecture named FlowN, which can give each application or task the illusion of its own address space, topology, and controller. Actually, the FlowN architecture is a network virtualization without the SDN context. Even it mentioned that they improve the network scalability, they did not describe the security and rule placement for many applications.
The network virtualization is very important to provide isolated virtual network for different application. The problem is, they focus on how to create virtual networks, which needs substantial resource like the space of TCAM or others.
Privacy is a major topic of concern whenever large amounts of sensitive data are used in MapReduce. Even though there are many privacy preserving works to protect the data privacy, a few of works focus on the security issue of MapReduce.
Airavat  is privacy preserving system that provides strong security and privacy guarantees for distributed computations. By integration of mandatory access control and differential privacy, Airavat makes data providers control their security policy to protect their privacy. Since it is an efficient work to resist the leakage of data privacy, there is no consideration about the potential leakage in the network.
Wei et al.  proposed a secure scheme named SecureMR which focuses at protecting the computation integration issue of MapReduce. SecureMR detects malicious behavior of mappers by sending same tasks to multiple mappers, and compares the result consistency. This work needs the master node and reduce be secure, it seems a little hard if some malicious users exist in the MapReduce cluster.
Huang et al.  proposes a trusted services by verifying the MapReduce result. They focus on detecting cheating services under the MapReduce environment based on watermark injection and random sampling methods. While the privacy leakage is hard to influence the computation result, we do not consider result verification an efficient method to preserve privacy.
In this section, we first describe the threat model to introduce the privacy preserving problem in the multiple user MapReduce networks. Then, with this threat model, we discuss the privacy preserving MapReduce network.
Iii-a Threat Model
For better understanding the privacy problem in MapReduce networks, we describe an example to introduce the threat model. This example shows that a MapReduce network will leak data privacy between different users. As shown in Fig. 1, there are two users in a MapReduce network: one has some privacy data and the other one is a potential malicious user. Both of them submit their applications to this MapReduce data center. In the traditional network, the MapReduce nodes, including Map nodes and Reduce nodes, are connected by many Layer 2 access switches (leaf switches) . These access switches are connected by some interconnection switches or more complex network.
First, the MapReduce system such as Hadoop assigns some nodes as working nodes for the normal user and the malicious user. After assignment, the normal user sends the private data to the working nodes. After that, these data are translated to the switches (, and ) in MapReduce procedures. However, it is possible that the working nodes of the malicious user also connect to some of these switches (, and ). Since the leaf switches are usually Layer 2 switches, which are connected by Ethernet or other cheap solutions, the working nodes for malicious user, who uses sniffer or other attacks like Man-in-the-middle attack, can get the packets transferred in these switches. Even though the main implementations of MapReduce like Hadoop adopt secure connections between working nodes, it is possible to get the privacy data by some existing attack methods .
Iii-B Privacy Preserving with SDN
|Port 1||Node 1|
|Port 2||Node 2|
|Port 3||Node 3|
|Port 4||Node 4|
|IF IP = 10.0.0.1 THEN MAC 00:00:00:01, PORT 1|
|IF IP = 10.0.0.2 THEN MAC 00:00:00:02, PORT 2|
|IF IP = 10.0.0.3 THEN MAC 00:00:00:03, PORT 3|
|IF IP = 10.0.0.4 THEN MAC 00:00:00:04, PORT 4|
SDN is a network structure that all switches are controlled by the controller. As a result, if we deploy SDN switches as all switches in the network, the path of each packet can be defined by the applications rather than the basic network protocols. In the MapReduce network, with SDN network, we can isolate the network by the forwarding rules .
To study the SDN network for privacy preserving in MapReduce network, we use an example to illustrate isolation by forwarding rules as shown in Fig. 2. There are node 1, 2, 3 and 4, connect to an SDN switch. The connected port of each node is described in the table shown in TABLE I. To preserve the privacy, we use four rules to limit the potential sniffer or other attacker in the node beside the destination. With the rules illustrated in the table shown in TABLE II, each packet is only forwarded to the port connected to the destination IP address. As an example, each packet sent to node 2 with the destination IP address 10.0.0.2 will be forwarded to port 2 after setting the correct MAC address.
Unlike the traditional network virtualization, since the potential threats are not relevant to the upper network level such as IP packet forwarding, we only use the SDN rules to isolate network in the Layer 2 switches. Considering the nodes connected to each leaf switch are fixed with the number of available ports, it is acceptable to place the rules for network isolation between applications.
Iv Framework Design and Problem statement
In this section, we first introduce the design of SDPMN framework. With this framework, we state this problem to optimize the maximum groups with limited rule space in each switch.
Iv-a Framework Design
From the main structure shown in Fig. 3, the framework consists of two parts: group management and the SDN application.
For each application, group management gets the information about all mapper, reducer and other related nodes in the network. After that, group management divides these nodes to groups. The nodes for one application are organized to one group. When the user submits an application to the Job Tracker, Job Tracker informs group management the application information. The name node also connects group management to inform the information about mappers and reducers.
With the information of applications and their mappers and reducers, group management maintains a table to store the correspondences between applications and servers. When the nodes of application are changed by the Job Track, the new node information will be transferred to group management to update the group table dynamically.
The SDN application generates rules to isolate the network in Layer 2 switches with the group table from group management and the network topology. In this SDN application, one module is used for the management of network topology and the other modules generate rules and manages these rules.
Since the network topology will be changed frequently in some virtual machine based networks, the topology management will update the relationship between switches, nodes and groups dynamically. When the topology is updated, rule management generates the newest rules. Since the switch space is limited to store all needed rules, the SDN application selects parts of rules for placement to switches. For the groups whose rules are not placed in the switches, it will bring additional overhead to forward the packets in these groups to guarantee the privacy. Since it is relevant to the controller implementation and the network design , we do not discuss this overhead in the placement. At last, the SDN application updates the subset of rules to the corresponding switches. We introduce the rule placement problem in the final part of this section.
Iv-B Problem Statement
|Node set for each MapReduce application|
|Node in set|
|Switch set for each MapReduce application|
|Switch in set|
|Capacity of each switch|
|Rule space cost for each rule|
|Number of application groups supported by switches|
Now we discuss the rule placement problem in SDPMN framework. Since existing data centers are not dedicated for MapReduce usage, it is no need to control forwarding of each packet. For each MapReduce application, the nodes for Mapper and Reducer are different. Therefore, for these different node sets, the network should be reconfigured with different rules to define the forwarding path of each packet. We propose a MapReduce driving privacy preserving network shown in Fig. 4.
In this network, we first define , , …, to denote the node set for each MapReduce application. For each node set, , we use to denote each node included in . In Fig. 4, the node set of Application 1 is and the node set of Application 2 is . With these node sets, we define , , … , to denote the set of leaf switches for each application. For example, in Fig. 4, the switch set of Application 1 is and the switch set of Application 2 is . For each switch set , we also define to denote each switch included in . We use value to denote the connections between nodes and switches as (1).
For privacy preserving, we consider the number of rules for each node in each application is the same and use to denote this rule count. Since the hardware space of each SDN switch is usually too expansive to support enough rules, as shown in (2), we define the number to denote the capacity of each switch and the capacity of each switch cannot support rule placement of all applications.
Therefore, in the rule placement, we can only place parts of rules of the switches with the best efficiency. We define set to denote the node sets whose rules are placed in switches. With this set, the rule number of each switch is less than its capacity as (3).
From the model of existing works, we consider that each group has the same security priority to simplify the placement problem . Therefore, with the capacity limitation, we state the rule placement problem in the SDPMN that, with given node sets of each application, find a set in which rules of each node are placed in switches and maximize the size of .
Iv-C Algorithm Design
To solve the placement problem, we propose a heuristic algorithm with greedy strategy shown in Algorithm 1. Since each group has the same privacy priority, the main procedure of this algorithm is to place the rules from small group to large group.
First, we sort all node sets by the size of their corresponding switch sets in line 1. After sorting, we define the empty set in line 2 as the result of the algorithm. From line 3 to line 26, we use a for loop to place rules of each node set iteratively. In this loop, we select the who has the smallest size of the corresponding switch set in unallocated node sets. For each node in , we traverse each switch in the corresponding switch set from line 7 to 16. In line 9, we check whether the switch has enough space for rule placement of . If there is not enough space for placing rules of , it means the rule placement for is failed. With this failure, the algorithm breaks the loop and places rules of the next node set. If success, the capacity of the switch is decreased by . From line 23 to line 25, if the rules of nodes in are all placed, is put to the result set .
From this algorithm, we can easily get the worst time complexity is . For most of the applications, since the number of nodes is much smaller than the scale of the nodes in the entire data center, the complexity can be simplified to .
In this section, we evaluate the performance of the greedy algorithm discussed in section IV-C by simulation. To simulate large scale network topology, we implement some python scripts with networkx library. We defined two different network in the simulation: one network is for simulating the networks in normal MapReduce data center and the other network is for simulating the cloud environment. We also use two types of node scale to simulate the light workload and heavy workload.
First, we introduce the simulation setting of these different networks and node scale. Usually, in the cloud environment, there are more nodes connected to each leaf switch than the normal data center. For the normal network, we set 20 nodes connected to each leaf switch. For the cloud environment, we set 400 nodes connected to each leaf switch. Meanwhile, the number of nodes of the normal network is set to 1000 while the number of nodes in the cloud environment is set to 20000. Both of these networks have the same number of leaf switches which is set to 50. We define two types of node scale for light workload and heavy workload. For the light workload, the number of working nodes is set from 10 to 100 in average distribution. For the heavy workload which needs more working nodes, the number is set from 100 to 200 in the average distribution. To simplify the simulation, we set the cost of each rule to 1 and the maximum space of each switch is set to 1000. This simplification is from the existing SDN switches. Usually, to implement a forwarding rule, it needs about several entries in OpenFlow protocol and the entries supported by normal switch are a few of thousands .
With the simulation settings, we evaluate the performance of our algorithm in these networks with different node scale. We first test the performance of the algorithm with the normal data center setting. Since there is no related algorithm for this problem, we use a random placement as the comparison. The performance is shown in Fig. 5. With light workload, the greedy placement shows much better than random placement with different rule space shown in Fig. 5(a). The maximum number of supported groups is near 2000 with the greedy placement when the rule space of each switch is set to 1000. However, when the workload becomes heavy and needs more working nodes, the supported groups are much less as shown in Fig. 5(b). Although the greedy placement performs better than the random placement, the maximum supported groups is less than 160, which is only 8% of the result with light workload by the same rule space.
With more nodes increased in the cloud data center network condition, the performance of rule placement is almost the same with the normal network condition shown in Fig. 6. With light workload setting, the maximum supported number of application groups by the greedy placement are near the 6 times than the number by the random placement shown in Fig. 6(a). When the workload increased, the supported groups by the greedy algorithm placement are no more than 140 with 1000 rule space in each switch shown in Fig. 6(b). From the result of the simulation, with the light workload of small scale of working nodes in each group, the greedy placement has good efficiency in both normal data center and cloud environment network. For the heavy workload, it seems better to use some task placements rather than simple rule placement optimization.
Unlike the random placement, in our greedy algorithm, the size of each group is considered. Therefore, more groups with small size can be placed firstly. For the small size groups, since the probability of overlapping between groups is small, the number of supported groups is near to the ideal placement. However, with the average size of groups increased and the frequently overlapping between groups, the number of groups supported by the network is greatly decreased. This is the reason why the groups supported in the network by the greedy placement are only two times than that of random placement.
Vi Conclusion and Future Work
In this paper, we propose a framework named SDPMN. In this framework, the network forwarding is guaranteed by SDN rules to avoid the potential threats that leak privacy data to malicious users in the same network. Since the rule space of each switch is limited, we state the problem to maximize the number of application groups whose rules are placed in the SDN switches. We propose a heuristic algorithm with the greedy strategy and evaluate this algorithm by simulation. From the simulation results, with the greedy placement algorithm, the given network can support more application groups than random placement. Since the application groups and users are varied in different periods, we will discuss the dynamic placement in the future work. Meanwhile, if the application needs more nodes from the data center network, the rule space of each switch is hard to support many application groups. Considering there are many repeating and redundant rules between application groups, we will take some optimization in the future work.
This work is supported by National 973 Basic Research Program of China under grant No. 2014CB340600.
-  J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Communication of the ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.
-  M. Zhou, R. Zhang, W. Xie, W. Qian, and A. Zhou, “Security and privacy in cloud computing: A survey,” in Proceedings of Sixth International Conference on Semantics Knowledge and Grid (SKG ’10), Nov 2010, pp. 105–112.
-  F. D. McSherry, “Privacy integrated queries: An extensible platform for privacy-preserving data analysis,” in Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD ’09), New York, NY, USA: ACM, 2009, pp. 19–30.
-  I. Roy, S. T. V. Setty, A. Kilzer, V. Shmatikov, and E. Witchel, “Airavat: Security and privacy for mapreduce,” in Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI’10), Berkeley, CA, USA: USENIX Association, 2010, pp. 20–20.
-  J. C. Mogul, J. Tourrilhes, P. Yalagandula, P. Sharma, A. R. Curtis, and S. Banerjee, “Devoflow: Cost-effective flow management for high performance enterprise networks,” in Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks (Hotnets-IX.), New York, NY, USA: ACM, 2010, pp. 1:1–1:6.
-  S. Y. Nam, D. Kim, and J. Kim, “Enhanced arp: preventing arp poisoning-based man-in-the-middle attacks,” IEEE Communications Letters, vol. 14, no. 2, pp. 187–189, February 2010.
-  N. McKeown, “Software-defined networking,” INFOCOM keynote talk, Apr, 2009.
-  N. Chowdhury and R. Boutaba, “Network virtualization: state of the art and research challenges,” IEEE Communications Magazine, vol. 47, no. 7, pp. 20–26, July 2009.
-  N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner, “Openflow: Enabling innovation in campus networks,” SIGCOMM Comput. Commun. Rev., vol. 38, no. 2, pp. 69–74, Mar. 2008.
-  Z. A. Qazi, C.-C. Tu, L. Chiang, R. Miao, V. Sekar, and M. Yu, “Simple-fying middlebox policy enforcement using sdn,” in Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM (SIGCOMM ’13), New York, NY, USA: ACM, 2013, pp. 27–38.
-  A. Bavier, N. Feamster, M. Huang, L. Peterson, and J. Rexford, “In vini veritas: Realistic and controlled network experimentation,” in Proceedings of the 2006 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM ’06.), New York, NY, USA: ACM, 2006, pp. 3–14.
-  X. Chen, Z. M. Mao, and J. Van Der Merwe, “Shadownet: A platform for rapid and safe network evolution,” in Proceedings of the 2009 Conference on USENIX Annual Technical Conference (USENIX ’09), Berkeley, CA, USA: USENIX Association, 2009, pp. 3–3.
-  K. C. Webb, A. C. Snoeren, and K. Yocum, “Topology switching for data center networks,” in Proceedings of the 11th USENIX Conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services (Hot-ICE ’11), Berkeley, CA, USA: USENIX Association, 2011, pp. 14–14.
-  D. Drutskoy, E. Keller, and J. Rexford, “Scalable network virtualization in software-defined networks,” IEEE Internet Computing, vol. 17, no. 2, pp. 20–27, March 2013.
-  W. Wei, J. Du, T. Yu, and X. Gu, “Securemr: A service integrity assurance framework for mapreduce,” in Proceedings of Annual Computer Security Applications Conference, 2009. (ACSAC ’09), Dec 2009, pp. 73–82.
-  C. Huang, S. Zhu, and D. Wu, “Towards trusted services: Result verification schemes for mapreduce,” in Proceedings of 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’12), May 2012, pp. 41–48.
-  M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commodity data center network architecture,” in Proceedings of the ACM SIGCOMM 2008 Conference on Data Communication (SIGCOMM ’08), New York, NY, USA: ACM, 2008, pp. 63–74.
-  S. Khan and K. Hamlen, “Hatman: Intra-cloud trust management for hadoop,” in Proceedings of IEEE 5th International Conference on Cloud Computing (CLOUD ’12), June 2012, pp. 494–501.
-  S. Shin, P. A. Porras, V. Yegneswaran, M. W. Fong, G. Gu, and M. Tyson, “Fresco: Modular composable security services for software-defined networks.” in Proceedings of 20th Network & Distributed System Security Symposium (NDSS ’13), 2013.
-  M. Yu, J. Rexford, M. J. Freedman, and J. Wang, “Scalable flow-based networking with difane,” in Proceedings of the ACM SIGCOMM 2010 Conference (SIGCOMM ’10), New York, NY, USA: ACM, 2010, pp. 351–362.
-  M. Moshref, M. Yu, A. Sharma, and R. Govindan, “vcrib: Virtualized rule management in the cloud,” in Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Ccomputing (HotCloud ’12), Berkeley, CA, USA: USENIX Association, 2012, pp. 23–23.
-  A. R. Curtis, J. C. Mogul, J. Tourrilhes, P. Yalagandula, P. Sharma, and S. Banerjee, “Devoflow: Scaling flow management for high-performance networks,” in Proceedings of the ACM SIGCOMM 2011 Conference (SIGCOMM ’11), New York, NY, USA: ACM, 2011, pp. 254–265.