Sentence Compression via DC Programming Approach

02/13/2019 ∙ by Yi-Shuai Niu, et al. ∙ Shanghai Jiao Tong University 0

Sentence compression is an important problem in natural language processing. In this paper, we firstly establish a new sentence compression model based on the probability model and the parse tree model. Our sentence compression model is equivalent to an integer linear program (ILP) which can both guarantee the syntax correctness of the compression and save the main meaning. We propose using a DC (Difference of convex) programming approach (DCA) for finding local optimal solution of our model. Combing DCA with a parallel-branch-and-bound framework, we can find global optimal solution. Numerical results demonstrate the good quality of our sentence compression model and the excellent performance of our proposed solution algorithm.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent years have been known by the quick evolution of the artificial intelligence (AI) technologies, and the sentence compression problems attracted the attention of researchers due to the necessity of dealing with a huge amount of natural language information in a very short response time. The general idea of sentence compression is to make a summary with shorter sentences containing the most important information while maintaining grammatical rules. Nowadays, there are various technologies involving sentence compression as: text summarization, search engine and question answering etc. Sentence compression will be a key technology in future human-AI interaction systems.

There are various models proposed for sentence compression. The paper of Jing [3] could be one of the first works addressed on this topic with many rewriting operations as deletion, reordering, substitution, and insertion. This approach is realized based on multiple knowledge resources (such as WordNet and parallel corpora) to find the pats that can not be removed if they are detected to be grammatically necessary by using some simple rules. Later, Knight and Marcu investigated discriminative models[4]

. They proposed a decision-tree model to find the intended words through a tree rewriting process, and a noisy-channel model to construct a compressed sentence from some scrambled words based on the probability of mistakes. MacDonald

[12] presented a sentence compression model using a discriminative large margin algorithm. He ranks each candidate compression using a scoring function based on the Ziff-Davis corpus using a Viterbi-like algorithm. The model has a rich feature set defined over compression bigrams including parts of speech, parse trees, and dependency information, without using a synchronous grammar. Clarke and Lapata [1] reformulated McDonald’s model in the context of integer linear programming (ILP) and extended with constraints ensuring that the compressed output is grammatically and semantically well formed. The corresponding ILP model is solving in using the branch-and-bound algorithm.

In this paper, we will propose a new sentence compression model to both guarantee the grammatical rules and preserve main meaning. The main contributions in this work are: (1) Taking advantages of Parse tree model and Probability model, we hybridize them to build a new model that can be formulated as an ILP. Using the Parse tree model, we can extract the sentence truck, then fix the corresponding integer variables in the Probability model to derive a simplified ILP with improved quality of the compressed result. (2) We propose to use a DC programming approach called PDCABB (an hybrid algorithm combing DCA with a parallel branch-and-bound framework) developed by Niu in [17] for solving our sentence compression model. This approach can often provide a high quality optimal solution in a very short time.

The paper is organized as follows: The Section 2 is dedicated to establish hybrid sentence compression model. In Section 3, we will present DC programming approach for solving ILP. The numerical simulations and the experimental setup will be reported in Section 4. Some conclusions and future works will be discussed in the last section.

2 Hybrid Sentence Compression Model

Our sentence compression model is based on an Integer Linear Programming (ILP) probability model [1], and a parsing tree model. In this section, we will give a brief introduction of the two models, and propose our new hybrid model.

2.1 ILP Probability Model

Let x = be a sentence with words.111Punctuation is also deemed as word. We add =‘start’ as the start token and =‘end’ as the end token.

The sentence compression is to choose a subset of words in x for maximizing its probability to be a sentence under some restrictions to the allowable trigram combinations. This probability model can be described as an ILP as follows:

Decision variables: We introduce the binary decision variables , 222 with stands for the set of integers between and . for each word as: if is in a compression and otherwise. In order to take context information into consideration, we introduce the context variables such that: , we set if starts a compression and otherwise; , we set if the sequence , ends a compression and otherwise; and , we set if sequence , , is in a compression and otherwise. There are totally binary variables for .

Objective function: The objective function is to maximize the probability of the compression computed by:

where stands for the probability of a sentence starting with , denotes the probability that successively occurs in a sentence, and means the probability that ends a sentence. The probability is computed by bigram model, and the others are computed by trigram model based on some corpora.

Constraints: The following sequential constraints will be introduced to restrict the possible trigram combinations:

Constraint 1 Exactly one word can begin a sentence.


Constraint 2 If a word is included in a compression, it must either start the sentence, or be preceded by two other words, or be preceded by the ‘start’ token and one other word.


Constraint 3 If a word is included in a compression, it must either be preceded by one word and followed by another, or be preceded by one word and end the sentence.


Constraint 4 If a word is in a compression, it must either be followed by two words, or be followed by one word and end the sentence.


Constraint 5 Exactly one word pair can end the sentence.


Constraint 6 The length of a compression should be bounded.


with given lower and upper bounds of the compression and .

Constraint 7 The introducing term for preposition phrase (PP) or subordinate clause (SBAR) must be included in the compression if any word of the phrase is included. Otherwise, the phrase should be entirely removed. Let us denote the index set of the words included in PP/SBAR leading by the introducing term , then


ILP probability model: The optimization model for sentence compression is summarized as a binary linear program as:


with binary variables and linear constraints.

The advantage of this model is that its solution will provide a compression with maximal probability based on the trigram model. However, there is no information about syntactic structures of the target sentence, so it is possible to generate ungrammatical sentences. In order to overcome this disadvantage, we propose to combine it with the parse tree model presented below.

2.2 Parse Tree Model

A parse tree is an ordered, rooted tree which reflects the syntax of the input language based on some grammar rules (e.g. using CFG syntax-free grammar). For constructing a parse tree in practice, we can use a nature language processing toolkit NLTK [18] in Python. Based on NLTK, we have developed a CFG grammar generator which helps to generate automatically a CFG grammar based on a target sentence. A recursive descent parser can help to build a parse tree.

For example, the sentence “The man saw the dog with the telescope.” can be parsed as in Figure 1. It is observed that a higher level node in the parse tree indicates more important sentence components (e.g., the sentence S consists of a noun phrase NP, a verb phrase VP, and a symbol SYM), whereas a lower node tends to carry more semantic contents (e.g., the proposition phrase PP is consists of a preposition ‘with’, and a noun phrase ‘the telescope’). Therefore, a parse tree presents the clear structure of a sentence in a logical way.

Figure 1: Parse tree example

Sentence compression can be also considered as finding a subtree which remains grammatically correct and containing main meaning of the original sentence. Therefore, we can propose a procedure to delete some nodes in the parse tree. For instance, the sentence above can be compressed as “The man saw the dog.” by deleting the node PP.

2.3 New Hybrid Model: ILP-Parse Tree Model

Our proposed model for sentence compression, called ILP-Parse Tree Model (ILP-PT), is based on the combination of the two models described above. The ILP model will provide some candidates for compression with maximal probability, while the parse tree model helps to guarantee the grammar rules and keep the main meaning of the sentence. This combination is described as follows:

Step 1 (Build ILP probability model): Building the ILP model as in formulation (8) for the target sentence.

Step 2 (Parse Sentence): Building a parse tree as described in subsection 2.2.

Step 3 (Fix variables for sentence trunk): Identifying the sentence trunk in the parse tree and fixing the corresponding integer variables to be in ILP model. This step helps to extract the sentence trunk by keeping the main meaning of the original sentence while reducing the number of binary decision variables.

More precisely, we will introduce for each node of the parse tree a label taking the values in . A value represents the deletion of the node; represents the reservation of the node; whereas indicates that the node can either be deleted or be reserved. We set these labels as compression rules for each CFG grammar to support any sentence type of any language.

For the word , we go through all its parent nodes till the root S. If the traversal path contains , then ; else if the traversal path contains only , then ; otherwise will be further determined by solving the ILP model. The sentence truck is composed by the words whose are fixed to . Using this method, we can extract the sentence trunk and reduce the number of binary variables in ILP model.

Step 4 (Solve ILP): Applying an ILP solution algorithm to solve the simplified ILP model derived in Step 3 and generate a compression. In the next section, we will introduce a DC programming approach for solving ILP.

3 DC Programming approach for solving ILP

Solving an ILP is in general NP-hard. A classical and most frequently used method is branch-and-bound algorithm as in [1]. Gurobi [2]

is currently one of the best ILP solvers, which is an efficient implementation of branch-and-bound combing various techniques such as presolve, cutting planes, heuristics and parallelism etc.

In this section, we will present a Difference of Convex (DC) programming approach, called DCA-Branch-and-Bound (DCABB), for solving this model. DCABB is initially designed for solving mixed-integer linear programming (MILP) proposed in [13], and extended for solving mixed-integer nonlinear programming [14, 15] with various applications including scheduling [8], network optimization [20], cryptography [10] and finance [9, 19] etc. This algorithm is based on continuous representation techniques for integer set, exact penalty theorem, DCA and Branch-and-Bound algorithms. Recently, the author developed a parallel branch-and-bound framework (called PDCABB) [17] in order to use the power of multiple CPU and GPU for improving the performance of DCABB.

The ILP model can be stated in standard matrix form as:


where , , and . Let us denote the linear relaxation of defined by . Thus, we have the relationship between and as

The linear relaxation of () denoted by is

whose optimal value denoted by is a lower bound of ().

The continuous representation technique for integer set consists of finding a continuous DC function333A function is called DC if there exist two convex functions and (called DC components) such that . such that

We often use the following functions for with their DC components:
function type expression of DC components of piecewise linear , quadratic trigonometric ,

Based on the exact penalty theorem [6, 11], there exists a large enough parameter such that the problem () is equivalent to the problem ():


The objective function in () is also DC with DC components and defined as where and are DC components of . Thus the problem () is a DC program which can be solved by DCA described in Algorithm 1.

0.3 Input: Initial point ; large enough penalty parameter ; tolerance .
Output: Optimal solution and optimal value ;
2Initialization: Set .
3Step 1: Compute ;
4Step 2: Solve ;
5Step 3: Stopping check:
6 if  or  then
7       ; ; return;
9       ; Goto Step 1.
10 end
Algorithm 1 DCA for ()

The symbol denotes the subdifferential of at which is fundamental in convex analysis. The subdifferential generalizes the derivative in the sense that is differentiable at if and only if reduces to the singleton .

Concerning on the choice of the penalty parameter , we suggest using the following two methods: the first method is to take arbitrarily a large value for ; the second one is to increase by some ways in iterations of DCA (e.g., [14, 19]). Note that a smaller parameter yields a better DC decomposition [16].

Concerning on the numerical results given by DCA, it is often observed that DCA provides an integer solution which is also an upper bound solution for the problem (). Therefore, DCA is often proposed for upper bound algorithm in nonconvex optimization. More details about DCA and its convergence theorem can be found in [7, 5]. Combing DCA with a parallel-branch-and-bound algorithm (PDCABB) proposed in [17], we can globally solve ILP. The PDCABB algorithm is described in Algorithm 2. More details about this algorithm as the convergence theorem, branching strategies, parallel node selection strategies will be discussed in full-length paper.

0.3 Input: Problem (); number of parallel workers ; tolerance ;
Output: Optimal solution and optimal value ;
1 Initialization: ; . Step 1: Root Operations Solve to obtain its optimal solution and set ; if  is infeasible then
2       return;
3 else if  then
4       ; ; return;
5 end
6Run DCA for () from to get ; if  then
7      ;
9      ;
10 end
11Step 2: Node Operations (Parallel B&B) while  do
12       Select a sublist of with at most nodes in ; Update ; parallelfor  do
13             Solve and get its solution and lower bound ; if  is feasible and  then
14                   if  then
15                         ; ;
16                   else
17                         if  then
18                               Run DCA for from to get its solution ; if  and  then
19                                     ; ;
21                         else
22                               Branch into two new problems and ; Update ;
23                         end
25                   end
28       end
30 end while
Algorithm 2 PDCABB

4 Experimental Results

In this section, we present our experimental results for assessing the performance of the sentence compression model described above.

Our sentence compression model is implemented in Python as a Natural Language Processing package, called ‘NLPTOOL’ (actually supporting multi-language tokenization, tagging, parsing, automatic CFG grammar generation, and sentence compression), which implants NLTK 3.2.5[18] for creating parsing trees and Gurobi 7.5.2[2] for solving the linear relaxation problems and the convex optimization subproblems in Step 2 of DCA. The PDCABB algorithm is implemented in C++ and invoked in python. The parallel computing part in PDCABB is realized by OpenMP.

4.1 F-score evaluation

We use a statistical approach called F-score

to evaluate the similarity between the compression computed by our algorithm and a standard compression provided by human. F-score is defined by :

where and represent for precision rate and recall rate as:

in which denotes for the number of words both in the compressed result and the standard result; is the number of words in the standard result but not in the compressed result; and counts the number of words in the compressed result but not in the standard result. The parameter , called preference parameter, stands for the preference between precision rate and recall rate for evaluating the quality of the results. is a strictly monotonic function defined on with and . In our tests, we will use as F-score. Clearly, a bigger F-score indicates a better compression.

4.2 Numerical Results

Table 1 illustrates the compression result of sentences obtained by two ILP compression models: our new hybrid model (H) v.s. the probability model (P). Penn Treebank corpus (Treebank) provided in NLTK and CLwritten corpus (Clarke) provided in [1] are used for sentence compression. We applied Kneser-Ney Smoothing for computing trigram probabilities. The compression rates 444The compression rate is computed by the length of compression over the length of original sentence. are given by , and . We compare the average solution time and the average F-score for these models solved by Gurobi and PDCABB. The experiments are performed on a laptop equipped with Intel i-U GHz CPU ( cores) and GB RAM.

Corpus+Model Solver compression rate compression rate compression rate
F-score (%) Time (s) F-score (%) Time (s) F-score (%) Time (s)
Treebank+P Gurobi 56.5 0.099 72.1 0.099 79.4 0.081
PDCABB 59.1 0.194 76.2 0.152 80.0 0.122
Treebank+H Gurobi 79.0 0.064 82.6 0.070 81.3 0.065
PDCABB 79.9 0.096 82.7 0.171 82.1 0.121
Clarke+P Gurobi 70.6 0.087 80.2 0.087 80.0 0.071
PDCABB 81.4 0.132 80.0 0.128 81.2 0.087
Clarke+H Gurobi 77.8 0.046 85.5 0.052 82.4 0.041
PDCABB 79.9 0.081 85.2 0.116 82.3 0.082
Table 1: Compression results

It can be observed that our hybrid model often provides better F-scores in average for all compression rates, while the computing time for both Gurobi and PDCABB are all very short within less than 0.2 seconds. We can also see that Gurobi and PDCABB provided different solutions since F-scores are different. This is due to the fact that branch-and-bound algorithm find only approximate global solutions when the gap between upper and lower bounds is small enough. Even both of the solvers provide global optimal solutions, these solutions could be also different since the global optimal solution for ILP could be not unique. However, the reliability of our judgment can be still guaranteed since these two algorithms provided very similar F-score results.

The box-plots given in Figure 2 demonstrates the variations of F-scores for different models with different corpora. We observed that our hybrid model (Treebank+H and Clarke+H) provided better F-scores in average and is more stable in variation, while the quality of the compressions given by probability model is worse and varies a lot. Moreover, the choice of corpora will affect the compression quality since the trigram probability depends on corpora. Therefore, in order to provide more reliable compressions, we have to choose the most related corpora to compute the trigram probabilities.

Figure 2: Box-plots for different models v.s. F-scores

5 Conclusion and Perspectives

We have proposed a hybrid sentence compression model ILP-PT based on the probability model and the parse tree model to guarantee the syntax correctness of the compressed sentence and save the main meaning. We use a DC programming approach PDCABB to solve our sentence compression model. Experimental results show that our new model and the solution algorithm can produce high quality compressed results within a short compression time.

Concerning on future works, we are very interested in designing a suitable recurrent neural network for sentence compression. With deep learning method, it is possible to classify automatically the sentence types and fundamental structures, it is also possible to distinguish the fixed collocation in a sentence and make these variables be remained or be deleted together. Researches in these directions will be reported subsequently.


  • [1] Clarke J, Lapata M.: Global inference for sentence compression: An integer linear programming approach. Journal of Artificial Intelligence Research 31:399–429 (2008).
  • [2] Gurobi 7.5.2.
  • [3] Jing.: Sentence reduction for automatic text summarization. In Proceedings of the 6th Applied Natural Language Processing Conference, pp. 310–315 (2000).
  • [4] Knight K, Marcu D.: Summarization beyond sentence extraction : A probalistic approach to sentence compression. Artificial Intelligence 139:91–107 (2002).
  • [5] Le Thi H.A.:
  • [6] Le Thi H.A., Pham D.T., and Muu L.D.: Exact penalty in dc programming. Vietnam J. Math. 27(2) (1999).
  • [7] Le Thi H.A., Pham D.T.: The dc (difference of convex functions) programming and dca revisited with dc models of real world nonconvex optimization problems. Ann. Oper. Res. 133: 23–46 (2005).
  • [8] Le Thi H.A., Nguyen Q.T., Nguyen H.T., Pham D.T.: Solving the earliness tardiness scheduling problem by DC programming and DCA. Math. Balk. 23(3–4), 271–288 (2009)
  • [9] Le Thi H.A., Moeini M., Pham D.T.: Portfolio selection under downside risk measures and cardinality constraints based on DC programming and DCA. Comput. Manag. Sci. 6(4), 459–475 (2009)
  • [10]

    Le Thi H.A., Le, H.M., Pham D.T., Bouvry P.: Solving the perceptron problem by deterministic optimization approach based on DC programming and DCA.

    Proceeding in INDIN 2009, Cardiff. IEEE (2009)
  • [11] Le Thi H.A., Pham D.T., and Huynh V.N.: Exact penalty and error bounds in dc programming. J. Global Optim 52(3) (2012).
  • [12] MacDonald D.: Discriminative sentence compression with soft syntactic constraints. In Proceedings of EACL, pp. 297–304 (2006).
  • [13] Niu Y.S, Pham D.T.: A DC Programming Approach for Mixed-Integer Linear Programs. Modelling, Computation and Optimization in Information Systems and Management Sciences, Communications in Computer and Information Science. 14:244–253 (2008).
  • [14] Niu Y.S: Programmation DC & DCA en Optimisation Combinatoire et Optimisation Polynomiale via les Techniques de SDP – Codes et Simulations Numériques. Ph.D. thesis, INSA-Rouen, France (2010).
  • [15] Niu Y.S., Pham D.T.: Efficient DC programming approaches for mixed-integer quadratic convex programs. Proceedings of the International Conference on Industrial Engineering and Systems Management (IESM2011). pp. 222–231 (2011).
  • [16] Niu Y.S.: On Difference-of-SOS and Difference-of-Convex-SOS Decompositions for Polynomials. (2018) arXiv:1803.09900.
  • [17] Niu Y.S.: A Parallel Branch and Bound with DC Algorithm for Mixed Integer Optimization, The 23rd International Symposium in Mathematical Programming (ISMP2018), Bordeaux, France. (2018).
  • [18] NLTK 3.2.5: The Natural Language Toolkit.
  • [19] Pham D.T, Hoai An L.T, Pham V.N, Niu Y.S.: DC programming approaches for discrete portfolio optimization under concave transaction costs. Optimization letters 10(2):261–282 (2016).
  • [20] Schleich J., Le Thi H.A., Bouvry P.: Solving the minimum m-dominating set problem by a continuous optimization approach based on DC programming and DCA. J. Comb. Optim. 24(4), 397–412 (2012)