Recognizing the Tractability in Big Data Computing

10/03/2019 ∙ by Jianzhong Li, et al. ∙ Harbin Institute of Technology 0

Due to the limitation on computational power of existing computers, the polynomial time does not works for identifying the tractable problems in big data computing. This paper adopts the sublinear time as the new tractable standard to recognize the tractability in big data computing, and the random-access Turing machine is used as the computational model to characterize the problems that are tractable on big data. First, two pure-tractable classes are first proposed. One is the class PL consisting of the problems that can be solved in polylogarithmic time by a RATM. The another one is the class ST including all the problems that can be solved in sublinear time by a RATM. The structure of the two pure-tractable classes is deeply investigated and they are proved PL^i⊊PL^i+1 and PL⊊ST. Then, two pseudo-tractable classes, PTR and PTE, are proposed. PTR consists of all the problems that can solved by a RATM in sublinear time after a PTIME preprocessing by reducing the size of input dataset. PTE includes all the problems that can solved by a RATM in sublinear time after a PTIME preprocessing by extending the size of input dataset. The relations among the two pseudo-tractable classes and other complexity classes are investigated and they are proved that PT⊆P, 'T^0_Q⊊PTR^0_Q and PT_P = P.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to the limitation on computational power of existing computers, the challenges brought by big data suggest that the tractability should be re-considered for big data computing. Traditionally, a problem is tractable if there is an algorithm for solving the problem in time bounded by a polynomial (PTIME) in size of the input. In practice, PTIME no longer works for identifying the tractable problems in big data computing.

Example 1

Sorting is a fundamental operation in computer science, and many efficient algorithms have been developed. Recently, some algorithms are proposed for sorting big data, such as Samplesort and Terasort. However, these algorithms are not powerful enough for big data since their time complexity is still in nature. We performed Samplesort and Terasort algorithms on a dataset with size 1 peta bytes, 1PB for short. The computing platform is a cluster of 33 computation nodes, each of which has 2 Intel Xeon CPUs, interconnected by a 1000 Mbps ethernet. The Samplesort algorithm took more than 35 days, and the Terasort algorithm took more than 90 days.

Example 2

Even using the fastest solid state drives in current market, whose I/O bandwidth is smaller than 8GB per second [1], a linear scan of a dataset with size 1 PB, will take 34.7 hours. The time of the linear scan is the lower bound of many data processing problems.

Example 1 shows that PTIME is no more the good yardstick for tractability in big data computing. Example 2 indicates that the linear time, is still unacceptable in big data computing.

This paper suggests that the sublinear time should be the new tractable standard in big data computing. Besides the problems that can be solved in sublinear time directly, many problems can also be solved in sublinear time by adding a one-time preprocessing. For example, searching a special data in a dataset can be solved in time by sorting the dataset first, a time preprocessing, where is the size of the dataset.

To date, some effort has been devoted to providing the new tractability standard on big data. In 2013, Fan et al. made the first attempt to formally characterize query classes that are feasible on big data. They defined a concept of -tractability, a query is -tractable if it can be processed in parallel polylogarithmic time after a PTIME preprocessing [9]. Actually, they gave a new standard of tractability in big data computing, that is, a problem is tractable if it can be solved in parallel polylogarithmic time after a PTIME preprocessing. They showed that several feasible problems on big data conform their definition. They also gave some interesting results. This work is impressive but still need to be improved for the following reasons. (1) Different from the traditional complexity theory, this work only focuses on the problem of boolean query processing. (2) The work only concerns with the problems that can be solved in parallel polylogarithmic time after a PTIME preprocessing. Actually, many problems can be solved in parallel polylogarithmic time without PTIME preprocessing. (3) The work is based on parallel computational models or parallel computing platforms without considering the general computational models. (4) The work takes polylogarithmic time as the only standard for tractability. Polylogarithmic time is a special case of the sublinear time, and it is not sufficient for characterizing the tractbility in big data computing.

Similar to the -tractability theory [9], Yang et al. placed a logarithmic-size restriction on the preprocessing result and relaxed the query execution time to PTIME and introduced the corresponding -tractability [13]. They clarified that a short query is tractable if it can be evaluated in PTIME after a one-time preprocessing with logarithmic-size output. This work just pursued Fan et al.’s methodology, and there is no improvement on Fan et al.’s work [9]. Besides, the logarithmic restriction on the output size of preprocessing is too strict to cover all query classes that are tractable on big data.

In addition, computation model is the fundamental in the theory of computational complexity. Deterministic Turing machine (DTM) is not suitable to characterize sublinear time algorithms since the sequential operating mode makes only the front part of input can be read in sublinear time. For instance, searching in an ordered list is a classical problem that can be solved in logarithmic time. However, if DTM is used to describe the computation procedure of it, the running time comes to polynomial. To describe sublinear time computation accurately and make all of the input can be accessed in sublinear time, random access is very significant.

This paper is to further recognize the tractability in big data computing. A general computational model, random-access Turing machine, is used to characterize the problems that are tractable on big data, not only query problems. The sublinear time, rather than polylogarithmic time, is adopted as the tractability standard for big data computing. Two classes of tractable problems on big data are identified. The first class is called as pure-tractable class, including the problems that can be solved in sublinear time without preprocessing. The second class is called as pseudo-tractable class, consisting of the problems that can be solved in sublinear time after a PTIME preprocessing. The structure of the two classes is investigated, and the relations among the two classes and other existing complexity classes are also studied. The main contributions of this paper are as follows.

(1) To describe sublinear time computation more accurately, the random-access Turing machine (RATM) is formally defined. RATM is used in the whole work of this paper. It is proved that the RATM is equivalent to the deterministic Turing machine in polynomial time. Based on RATM, an efficient universal random-access Turing machine is devised. The input and output of are and respectively, where is the encoding of a RATM , is an input of , and is the output of on input . Moreover, if halts on input within steps, then halts within steps, where is a constant.

(2) Using RATM and taking sublinear time as the tractability standard, the classes of tractable problems in big data computing are defined. First, two pure-tractable complexity classes are defined. One is a polylogarithmic time class , which is the set of problems that can be solved by a RATM in polylogarithmic time. The another one is a sublinear time class , which is the set of all decision problems that can be solved by a RATM in sublinear time. Then, two pseudo-tractable classes, and , are first defined. is the set of all problems that can be solved in sublinear time after a PTIME preprocessing by reducing the size of input dataset. is the set of all problems that can be solved in sublinear time after a PTIME preprocessing by extending the size of input dataset.

(3) The structure of the pure-tractable classes is investigated deeply. It is first proved that , where is the class of the problems that can be solved by a RATM in time and is the size of the input. Thus, a polylogarithmic time hierarchy is obtained. It is proved that . This result shows that there is a gap between polylogarithmic time class and linear time class. It is also proved that reduction [4] is closed for and . The first -complete problem and the first -complete problem are given also.

(4) The relations among the complexity classes , , , [13] and is studied. They are proved that and . Finally, it is concluded that all problems in can be made pseudo-tractable.

The remainder of this paper is organized as follows. Section 2 formally defines the complexity model RATM, proves that RATM is equivalent to DTM and there is an efficient URATM, and defines the problem in big data computing. Section 3 defines the pure-tractable classes, and investigates the structure of the pure-tractable classes. Section 4 defines the pseudo-tractable classes, and studies the relations among the complexity classes , , [13] and . Finally, Section 5 concludes the paper.

2 Preliminaries

To define sublinear time complexity classes precisely, a suitable computation model should be chosen since sublinear time algorithms may read only a miniscule fraction of its input and thus random access is very important. The random-access Turing machine is chosen as the foundation of the work in this paper. This section gives the formal definition of the random-access Turing machine. They are proved that the RATM is equivalent to the determinate Turing machine (DTM) in polynomial time and there is a universal random-access Turing machine. Finally, a problem in big data computing is defined.

2.1 Random-access Turing Machine

random-access Turing machine A random-access Turing machine (RATM) is a -tape Turing machine with an input tape and an output tape and is additionally equipped with binary index tapes that are write-only. One of the binary index tapes is for ’s read-only input tape and the others for the ’s work tapes. Note that . The formally definition of RATM is as follows.

Definition 1

A RATM is a 8-tuple , where

: The finite set of states.

: The finite set of input symbols.

: The finite set of tape symbols, and .

: , where .

: The start state of .

: The blank symbol.

: The accepting state.

: The random access state. If enters state , will move the heads of all non-index tapes to the cells described by the respective index tapes automatically.

Assuming the first tape of a RATM is the input tape, if is in state , are the symbols currently being read in the non-index tapes of , and the related transition function is , will replace with , where , write on the corresponding index tape, move heads Left, Right, or Stay in place as given by , and enter the new state .

The following lemmas state that RATM is equivalent to the deterministic Turing machine (DTM).

Lemma 1

For a Boolean function and a time-constructible function [2] ,

(1) if is computable by a DTM within time , then it is computable by a RATM within time , and

(2) if is computable by a RATM within time , then it is computable by a DTM within time .

Proof

(1) is easy to be proved since RATM can simulate DTM step by step through omitting the random access ability of RATM. To prove (2), we can construct a -tape DTM to simulate a -tape RATM . uses tapes to simulate the index tapes of , and the other tapes to simulate the non-index tapes of . If the contents on a non-index tape of is , the corresponding tape of will contain , where is the address of on tape of . Since stops in steps, there are at most non-blank symbols on each tape of . Therefore, the length of the corresponding tape of is . simulates as follows.

(1) If does not enter the random access state , just acts like ;

(2) If enters , then first moves the heads of its non-index tapes to the leftmost, then moves from left to right to find the symbol , where is equal to the address on the corresponding index tape of .

Since the maximum length of ’s tapes is and the running time of is , the running time of is at most . ∎

Corollary 1

If a Boolean function is computable by a RATM within time , then it is computable by a DTM within time .

Proof

simulates in the same way as in the proof of lemma 1. Since the maximum length of ’s tapes is when the runtime of is and the running time of is , the running time of is at most . ∎

2.2 The Universal Random-access Turing Machine

Just like DTM, RATM can be encoded by a binary string. The code of a RATM is denoted by . The encoding method of RATM is the same as that of DTM [8]. The encoding of RATM makes it possible to devise a universal random-access Turing machine (URATM) with input and outputs , where is the input of a RATM , is the code of , and is the output of on . Before the formal introduction of URATM, we first present two lemmas.

Lemma 2

For every function , if is computable in time by a RATM using alphabet , then it is computable in time by a RATM using alphabet .

Proof

Let be a -tape RATM which computes in time . We define a -tape RATM in the follows to compute in time .

Let ’s non-index tapes and index tapes be numbered from to , ’s non-index tapes, and index tapes be numbered from to . Let is the least number satisfying bits. Every symbol of be encoded using binary code with length .

Thus, the th non-index tape of simulates the th non-index tape of using the binary codes above for , that is, there are cells in ’s th non-index tape for every cell in ’s th non-index tape.

utilizes a non-index tape and an index tape to simulate an index tape of . More precisely, ’s th non-index tape is used to make a calculation on the contents of ’s th index tape. And ’s th index tape works like the th index tape of . The concrete operations will be introduced later. If the contents of ’s th index tape of is , the contents of ’s th non-index tape and th index tape are . And let , , and .

The initial configuration of is as follows.

(1) Input , which is encoded, is stored in input tape, and the state of is .

(2) is stored in ’s last non-index tapes and first index tapes.

To simulate one step of from state , starts from state and acts as follows.

(1) uses steps to read from its first non-index tapes. After that, the state of becomes , where denotes the code of .

Assume that there is a transition function of is .

(2) uses steps to write on its first non-index tapes except the input tape since the input tape is read-only.

(3) uses steps to write on its index tapes. More precisely, if or , moves the head of the th non-index tape locations to the right and checks whether there is . If so, it writes , and moves locations to the left and writes . If not, it moves locations to the left and writes . If , moves the head of the th non-index tape locations to the right and checks whether there is . If so, it moves one location to the left and writes , then it moves locations to the left and writes .

(4) uses steps to move the heads of non-index tapes according to and uses one step to move the heads of index tapes according to .

(5) Then enters .

Note that uses steps to simulate one step of . Thus, the total number of steps of is at most where is a constant depending on the size of the alphabet. ∎

Lemma 3

For every function , if is computable in time by a RATM using tapes, then it is computable in time by a 5-tape RATM .

Proof

If is a -tape RATM that computes in time, we describe a -tape RATM computing in time .

uses its input tape and output tape in the same way as does. The first work tape of named main work tape is used to simulate the contents of ’s work tape. The second work tape of named main index tape is used to simulate the contents of ’s index tape of work tapes. The last work tape of name usual movement tape is used to store the positions of heads of ’s work tapes. Note that each of the three work tapes has tracks, each which simulates one work tape of .

The symbol in one cell of ’s main work tape is in , each corresponding to a symbol on a work tape of . The symbol in one cell of ’s main index tape and usual movement tape is in , each corresponding to a symbol on an index tape of . Each track of the main index tape has a symbol in to indicate the head position of corresponding index tape. Each track of the usual movement tape records the head’s location of the corresponding work tape. Hence, we have , , , , and .

The initial configuration of is as follows.

(1) Input is stored in the input tape, and the state of is

(2) Main work tape, main index tape and output tape are empty.

(3) is stored in the usual movement tape since the heads of ’s work tapes are at the first blank symbol whose index is .

For a computational step of in state , starts from the state to simulate as follows.

(1) Read and from input and output tapes respectively.

(2) uses times reads to simulate the parallel reads of to gather the symbols on work tapes of :

(2.1) Read the head positions on the th track from the usual movement tape and stores them into the index tape of the main work tape;

(2.2) Enter the random access state to read the corresponding symbol of the th track from the main work tape for ;

After that, is in state .

If there is a transition function of , , then continues its work as follows.

(3) For the input tape, moves the head according to , writes on the index tape and moves the heads of index tape according to . For the output tape, writes and moves the head according to , writes on the index tape and moves the heads of index tape according to .

(4) uses times writes to write to the main work tape and modify usual movement tape:

(4.1) Read th track from the usual movement tape into the index tape of the main work tape,

(4.2) Enter the random access state and then write the symbol to the cell, and

(4.3) If is , reduce from the number on the th track of the usual movement tape. If is , add to the number on the th track of the usual movement tape. If is , do nothing.

(5) uses times writes to write to the main index tape:

(5.1) Scan main index tape from left to right to find the head position on the th track, then writes the on it.

(5.2) If is , replace the symbol before with . If is , replace the symbol behind with . If is , replace the symbol at with .

(6) Change state to . Then, if is , copy the main index tape to the usual movement tape.

Now, we analyze the number of steps that uses to simulate one step of . Since the running time of is , we can assume that the max length of work tapes used in computation is without loss of generality. Thus, the max length of indexes written on the index tape by is . uses the main work tape to simulate work tapes of , and uses the main index tape and the usual movement tape to simulate index tapes of and head positions of ’s work tapes. Thus, the length of main work tape is at most and the length of main index tape and usual movement tape are at most . Recall the operations mentioned above, read a track on the usual movement tape and write to the index tape take at most steps. The random access takes only one step. And the modification of the usual movement tape takes at most steps. And the copy from the main index tape to the usual movement tape takes at most steps. It can be seen that uses steps to simulate one steps of . Since takes steps in total, and thus the total number of steps taken by is , where is a constant depending on the number of tapes. ∎

Theorem 2.1

There exists a universal random-access Turing machine , whose input is and outputs is , where is an input of , is the code of , and is the output of on . Moreover, if halts on input in steps then halts on input in steps, where is a constant depending on .

Proof

From Lemma 2 and Lemma 3, we only need to design a URATM to simulate any 10-tape RATM. Let be a -tape RATM defined as

where the input of is , is an input of a 10-tape RATM with alphabet , and is the code of .

will use its input tape, output tape and the first eight work tapes in the same way does. In addition, the transition functions of are stored in the first extra work tape of . The current state of and symbols read by are stored in the second extra work tape of .

simulates one computational step of on input as follows.

(1) stores the state of , and the symbols read from input tape, output tape and the first eight work tape to the second extra work tape.

(2) scans the table of ’s transition function stored in the first extra work tape to find out the related transition function.

(3) replaces the state stored in the second work tape to the new state, writes symbols and moves heads. If the new state is random access state, then enters .

Now, we analyze the number of steps that uses to simulate one step of a 10-tape RATM . In step (1), takes one step to read symbols from its input, output and the first eight work tapes. In step (2), makes a linear scan to find the related transition function, and it takes steps, where is the size of the transition function of . In step (3), uses two steps to write symbols to its input, output and the first eight work tapes and moves heads of them. Since halts on input within steps, then will halts in steps.

Since any -tape RATM that stops in steps can be simulated by a 10-tape RATM with alphabet in steps, any -tape RATM can be simulated by in steps, i.e. , where depending on . ∎

Theorem 2.1 is an encouraging result which can help us to investigate the structure of sublinear time complexity classes.

2.3 Problems in Big Data Computing

To reflect the characteristics in big data computing, a problem in big data computing is defined as follows.

INPUT: big data , and a function on .

OUTPUT: .

Unlike the traditional definition of a problem, the input of a big data computing problem must be big data, where big data usually has size greater than or equal to 1 PB. The problem defined above is often called as big data computing problem. The big data set in the input of a big data computing problem may consists of multiple big data sets. The problems discussed in the rest of paper are big data computing problems, and we will simply call them problems in the rest of the paper.

3 Pure-tractable Classes

In this section, we first give the formal definitions of the pure-tractable classes and , and then investigate the structure of the pure-tractable classes.

3.1 Polylogarithmic-tractable Class

As mentioned in [12], the class consists of all problems that can be solved by a RATM in time, where is the length of the input. was underestimated before [3] [4]. However, it is very impotent in big data computing [9] and there are indeed many interesting problems in this class [3]. In this section, we propose the complexity class by extending the to characterize problems that are tractable on big data, and inspired by and hierarchy, we use to reinterpret as a hierarchy.

Definition 2

The class consists of decision problems that can be solved by a RATM in polylogarithmic time.

Definition 3

For each , the class consists of decision problems that can be solved by a RATM in , where is the length of the input.

According to the definition, is equivalent to . It is clear that , which forms the hierarchy. The following Theorem 3.1 shows that for .

Lemma 4

[7] There is a logarithmic time RATM, which takes as input and generates the output encoded in binary such that .

Theorem 3.1

For any .

Proof

We prove this theorem by constructing a RATM such that .

According to Lemma 4, there exists a RATM , which takes as input and outputs the binary form of in time.

Since is a polynomial time constructible function for any , there exists a DTM that takes , whose length is , as input and outputs the binary form of in time .

By combining and , we can construct a RATM that works as follows. Given an input , first simulates on and outputs the binary form of . Then, simulates ’s on the binary form of . The total running time of is .

Now, we are ready to construct . On any input , works as follows:

(1) simulates the computation of on input and the computation of on input simultaneously.

(2) Any one of and halts, halts, and the state entered by is determined as follow:

(a) If halts first and enters the accept state, then halts and enters the reject state.

(b) If halts first and enters the reject state, then halts and enters the accept state.

(c) If halts first and enters state , then halts and enters .

The running time of is at most , so .

Assume that there is a time RATM such that . Since , there must be a number such that for each . Let be a string representing the machine whose length is at least . Such string exists since a string of a RATM can be added any long string behind the encoding of the RATM. We have . It means that halts before , which is in contradiction with (a) and (b) of ’s work procedure. ∎

3.2 Sublinear-tractable Class

To denote all problems can be solve in sublinear time, the complexity class is proposed in this subsection. And the relation between and is investigated. We first give the formal defintiion of .

Definition 4

The class consists of the decision problems that can be solved by a RATM in time, where is the size of the input.

There are indeed many problems that can solved in time, such as searching in a sorted list, point location in two-dimensional Delaunay triangulations, and checking whether two convex polygons intersect mentioned in [5] [6] [11].

To understand the structure of pure-tractable classes, we study the relation between and . Theorem 3.2 shows that contains properly. This result indicates that there is a gap between polylogarithmic time class and linear time class.

Theorem 3.2

Proof

First, we define RATIME() to be the class of problems that can be solved by a RATM in time . It is obviously that RATIME(). Hence, RATIME() = RATIME(). Similar to the proof of Theorem 2, we show that there is a RATM such that RATIME() .

We first construct a RATM , which halts in steps on input with size . first computes binary form of according to Lemma 4. Then enumerates binary number from to. In the th enumeration step, the binary number is first enumerated. Then, computes and makes a comparison with . halts if and only if . The maximum number enumerated by is , the enumeration takes time, the multiplication takes time and the comparison takes time. So the running time of is .

We construct as follows. On an input ,

(1) simulates the computation of on input and the computation of on input simultaneously.

(2) Any one of and halts, halts, and the halt state of is determined as follow:

(a) If halts first and enters the accept state, then halts and enters the reject state.

(b) If halts first and enters the reject state, then halts and enters the accept state.

(c) If halts first and enters state , then halts and enters state .

The running time of is at most , so RATIME().

Assume that for some there is a time RATM such that . Since , there must be a number such that for each . Then, let be a string representing the machine whose length is at least . Such string exists since a string of a RATM can be added any long string behind the encoding of the RATM. We have . It means that halts before , which generates a contradiction with (a) and (b) of ’s work procedure. ∎

3.3 Reduction and Complete Problems

In this subsection, we first give the definition of reduction [4]. Then, it is proved that and is closed under the reduction. Moreover, the -completeness and -completeness are defined under reduction. Finally, a -complete problem and a -complete problem are given.

Definition 5

[4] A polynomial reduction from a problem to a problem is a reduction if the language the th bit of is is in .

The definition of reduction is different from the reductions what we use before. It requires that the checking time of a specific location is logarithmic, but the total time of reduction is bounded by polynomial time. The following two theorems show that and are closed under reduction.

Theorem 3.3

If and there is a reduction from to then .

Proof

Let be the RATM which solves in time . Let be the reduction from to . We construct a RATM which solves in time. On an input , simulates as follows.

(1) For moves of , which do not read input tape, directly simulates .

(2) For moves of , which read input tape, assuming the th symbol of input is reading, checks whether the th symbol of is for each in input tape symbol set of . When it finds the correct symbol , it continues the simulation of .

Assume that stops on input in steps, in which steps do not read input tape, and steps need to read input tape.

To simulate the steps, which do not read input tape, needs steps.

Assume that the th step of read input tape. To simulate the th step, needs to get the input symbol in by scanning the input tape symbol set of , which needs steps by the definition of the reduction, where is the size of input tape symbol set of . Thus, to simulate the steps of , which read input tape, needs since .

In summary, needs at most , that is steps. ∎

From the theorem 3.3, we can directly derive the following corollary 2.

Corollary 2

is closed under reduction.

Theorem 3.4

is closed under reduction.

Proof

Let be the RATM which solves in time . Let be the reduction from to . We construct a RATM which solves in time. On an input , simulates as follows.

(1) For moves of , which do not read input tape, directly simulates .

(2) For moves of , which read input tape, assuming th symbol of input is reading, checks whether the th symbol of is for each in input tape symbol of . When it finds the correct symbol , it continues the simulation of .

Assume that stops on input in steps, in which steps do not read input tape, and steps need to read input tape.

To simulate the steps, which do not read input tape, needs steps.

Assume that the th step of read input tape. To simulate the th step, needs tp get the input symbol in by scanning the input tape symbol set of , which needs steps by the definition of the reduction, where is the size of input tape symbol set of . Thus, to simulate the steps of , which read input tape, needs since .

In summary, needs at most , that is steps. ∎

The definitions of -completeness and -completeness under reduction are given in the following.

Definition 6

A problem is -hard under reduction if there is a reduction from to for all in . A problem is -complete under reduction if and is -hard.

Definition 7

A problem is -hard under reduction if there is a reduction from to for all in . A problem is -complete under reduction if and is -hard.

Bounded Halting Problem (BHP) is -complete. We show a sublinear version of BHP, and prove that it is -complete and -complete.

Sublinaer Bounded Halting problem (SBHP) :

INPUT: the code of a RATM , ’s input and a string , where .

OUTPUT: Does machine accepts within moves?

Theorem 3.5

SBHP is -complete.

Proof

First, we show SBHP is in . The URATM can be used to simulate steps of on input , if enters accept state, accepts , else rejects . If the running time of on input is for some , the running time of is according to Theorem 2.1. Thus SBHP is in since

Next, we prove that there is a reduction from to SBHP for all . Since , there exist a RATM and an integer such that accepts in time if and only if . It is simple to transform any to an instance in SBHP by letting , where is the code of .

It remains to prove that there is RATM such that can decide whether the th bit of is a symbol in time. works as follows.

(1) computes the length of and and stores them in binary.

(2) calculates .

(3) determines the th symbol of as follows.

If , outputs the th symbol of .

If , outputs the th symbol of .

If , outputs .

Scince all numbers are encoded in binary, the needs at most .

It is clear that is computable in polynomial time. ∎

By changing to of SBHP, where is the running time of on input , we derive the following theorem.

Theorem 3.6

SBHP is -complete.

Proof

First, we prove that SBHP is in . The URATM can be used to simulate steps of on input , if enters accept state, accepts , else rejects . If the running time of on input is for some in , the running time of is according to Theorem 2.1. Thus, SBHP is in since .

Next, we prove that there is a reduction from to SBHP for all . Since , there exist a RATM and a function such that accepts in time if and only if . It is simple to transform any to an instance in SBHP by , where in the code of .

It remains to prove that there is a RATM such that can decide whether the th bit of is a symbol in time. works as follows.

(1) computes the length of and and stores them in binary.

(2) calculates .

(3) determines the th symbol of as follows.

If , outputs the th symbol of .

If , outputs the th symbol of .

If , outputs .

Since all numbers are encoded in binary, needs at most steps.

It is clear that is computable in polynomial time. ∎

4 Pseudo-tractable Classes

In this section, we study the big data computing problems that can be solved in sublinear time after a PTIME preprocessing. We propose two complexity classes, and , and investigate the relations among , and other complexity classes. For easy to understand, we first review the definition of a problem in big data computing, that is,

INPUT: big data , and a function on .

OUTPUT: .

4.1 Pseudo-tractable Class by Reducing

We will use to express the pseudo-tractable class by reducing , which is defined as follows.

Definition 8

A problem is in the complexity class if there exists a PTIME preprocessing function such that for big data ,

(1) and .

(2) can be solved by a RATM in time.

Data is preprocessed by a preprocessing function in polynomial time to generate another structure . Besides PTIME restriction on , it is required that the size of is smaller than . This allows many of previous polynomial time algorithms to be used. For example, can be solved by a quadratic polynomial time algorithm if and , and the time needed for solving is . To make less than , can be data compression, sampling, etc.

The following simple propositions show the time complexity of the problem after preprocessing.

Proposition 1

If there is a preprocessing function and a constant such that for any , and there is a algorithm to solve in time of polynomials of degree , where , then can be solved in time, and thus is in .

Proposition 2

If there is a preprocessing function and a constant such that for any , and there is a PTIME algorithm to solve , then