The use of very large neural networks as Deep Learning methods that are inspired by the human brain system has recently dominated most of the researchers’ work in several domains to help in improving the results and make it more desirable for people. Machine Translation, Self-driving cars, Robotics, Digital Marketing, Customer Services, and Better Recommendations are some applications for deep learning. In more recent years, deep learning  has positively and significantly impacted the field of image recognition specifically, allowing much more flexibility. In this research, we attempt to utilize image/video captioning [18, 24] methods and Natural Language Processing systems to generate a sentence as a title for a long video that could be useful in many ways. Using an automated system instead of watching many videos to get titles could be time-saving. It can also be used in the cinema industry, search engines, and supervision cameras to name a few. We present an example of the overall process in Figure 1.
Image and video captioning with deep learning are used for the difficult task of recognizing the objects and actions in an image or video and creating a succinct meaningful sentence based on the contents found. Text summarization 
is the task of generating a concise and fluent summary for a document(s) while preserving key information content. This paper proposes an architecture by utilizing the image/video captioning system and text summarization methods to make a title and an abstract for a long video. For constructing a story about a video, we extract the key-frames of the video which give more information, and then we feed those key-frames to the captioning system to make a caption or a document for them. For the captioning system, different methods as encoder-decoder or generative adversarial networks exist that propose different object detection methods. Also for the text summarization, we use both Extractive and abstractive methods to generate the title and the abstract respectively. We provide more details in the next sections.
The main contribution of this research is to explore the possibility of making a title and a concise abstract for a long video by utilizing deep learning technologies to save time through automation in many application domains. In the rest of the article, we describe the different parts of our proposed architecture which are image/video captioning and text summarization methods and provide a literature review for each component. Then, we explain the methodology of the proposed architecture and how it works. We present a proof of concept through experiments using publicly available data-sets. The article is concluded with a discussion of the results and our future work.
2 Definition and Related Work
The proposed architecture in this paper consists of two main components, namely Image/Video captioning and Text summarization methods, each having different parts. In this section, we dissect each component. Also, we review some previous work about image/video captioning and text summarization that support parts of our proposed architecture.
2.1 Image/Video Captioning:
Describing a short video in natural language is a trivial task for most people, but a challenging one for machines. From the methodological perspective, categorizing the models or algorithms is challenging because it is difficult to assert the contributions of the visual features and the adopted language model to the final description. Automatically generating natural language sentences describing a video clip generally has two components: extracting the visual information, as Encoder; and describing it in a grammatically correct natural language sentence as Decoder. With a convolutional neural network, the objects and features are extracted from the video frames, then a neural network is used to generate a natural sentence based on the available information, on which an image captioning method would be utilized for captioning the frames.
In the field of image captioning, Aneja et al.  developed a convolutional image captioning technique with existing LSTM techniques and also analyzed the differences between RNN based learning and their method. This technique contains three main components. The first and the last components are input/output word embeddings respectively. However, while the middle component contains LSTM or GRU units in the RNN case, masked convolutions are employed in their CNN-based approach. This component is feed-forward without any recurrent function. Their CNN with attention (Attn) achieved comparable performance. They also experimented with an attention mechanism and attention parameters using the conv-layer activations. The results of the CNN+Attn method were increased relative to the LSTM baseline. For better performance on the MSCOCO they used ResNet features and the results show that ResNet boosts their performance on the MSCOCO. The results on MSCOCO with ResNet101 and ResNet152 were impressive.
In video captioning, Krishna et al. 
, however, presented Dense-captioning, which focuses on detecting multiple events that occur in a video by jointly localizing temporal proposals of interest and then describing each with natural language. This model introduced a new captioning module that uses contextual information from past and future events to jointly describe all events. They implemented the model on the ActivityNet Captions dataset. The captions that came out of ActivityNet shift sentence descriptions from being object-centric in images to action-centric in videos. Ding et al. proposed novel techniques for the application of long video segmentation, which can effectively shorten the retrieval time. Redundant video frame detection based on the Spatio-temporal interest points (STIPs) and a novel super-frame segmentation are combined to improve the effectiveness of video segmentation. After that, the super-frame segmentation of the filteblue long video is performed to find an interesting clip. Keyframes from the most impactful segments are converted to video captioning by using the saliency detection and LSTM variant network. Finally, the attention mechanism is used to select more crucial information to the traditional LSTM. Generative Adversarial Networks help to have more flexibility in these methods . Therefore, we can see that Sung Park et al.  applied Adversarial Networks in their framework. They propose to apply adversarial techniques during inference, designing a discriminator which encourages multi-sentence video description. They decouple a discriminator to evaluate visual relevance to the video, language diversity and fluency, and coherence across sentences on the ActivityNet Captions dataset.
Sequence models like recurrent neural network (RNN)
have been widely utilized in speech recognition, natural language processing, and other areas. Sequence models can address supervised learning problems like machine translation
, name entity recognition, DNA sequence analysis, video activity recognition, and sentiment classification.LSTM
, as a special RNN structure, has proven to be stable and powerful for modeling long-range dependencies in various studies. LSTM can be adopted as a building block for complex structures. The complex unit in Long Short Term Memory is called a memory cell. Each memory cell is built around a central linear unit with a fixed self-connection. LSTM is historically proven to be more powerful and more effective than a regular RNN since it has three gates (forget, update, and output). Long Short Term Memory recurrent neural networks can be used to generate complex sequences with long-range structure [14, 26].
2.2 Text Summarization:
Automatic text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning . Extractive and Abstractive are the two main categories of summarization algorithms.
Extractive summarization systems form summaries by copying parts of the input. Extractive summarization is implemented by identifying the important sections of the text, processing, and combining them to form a meaningful summary. Abstractive summarization systems generate new phrases, possibly rephrasing or using words that were not in the original text. Abstractive summaries are generated by interpreting the raw text and generating the same information in a different and concise form by using complex neural network-based architectures such as RNNs and LSTMs. Paulus et al. 
proposed a neural network model with a novel intra-attention that attends over the input and continuously generated output separately and a new training method that combines standard supervised word prediction and reinforcement learning (RL). Also, Roul et al.
introduced the landscape of transfer learning techniques for NLP with a unified framework that converts every language problem into a text-to-text format.
Text summarization can be further divided into two categories: single and multi-text summarization. In single text summarization , the text is summarized from one document whereas Multi-document text summarization systems are able to generate reports that are rich in important information, and present varying views that span multiple documents.
The proposed architecture consists of two different, complementary processes: Video Captioning and Text Summarization. In the first process for video captioning, the system gets a video as an input, then generates a story for the video. The generated story will feed to the second process as a document and it summarizes the document to a sentence and an abstract. Figure 2 shows the complete process of the suggested architecture. Further, we explain the details of each part.
3.1 Video to Document Process:
Image/video description is the automatic generation of meaningful sentences that describes the events in an image/video (frames). A video consists of many frames each representing an image. Some of the images/frames give much information and some are just basically repeating a scene. Therefore, we select some key-frames that include more information. The in-between frames are just repeating with subtle changes. A sequence of key-frames defines which movement the viewer will see. Therefore, the order of the key-frames on the video or animation defines the timing of the movement.
One of our contributions in this research is doing some experiments by selecting different key-frames to have a story for long videos to see if we can have the same extracted information. So, one task is to get the key-frames and process them to be captioned, instead of using all the frames of the video to save time and resources for getting the same result. See Figure 3 for an illustration of the frames, key-frames, and in-between frames.
The captioning part consists of two phases: Encoder and Decoder. The Encoder part extracts the image information using convolutional neural networks like object detection methods to extract the objects and actions and then put them in a vector. ResNet, DenseNet, RCNN series, Yolo and … can be used as object detection methods. Then the vector enters the decoder phase. The Decoder gets the vector and then with RNN methods generates a meaningful caption for the image. These two phases could work simultaneously. Figure 4 illustrates the captioning process. Captions are evaluated using the BLEU, METEOR, CIDEr, and other metrics [7, 25, 27]. These metrics are common for comparing the different image and video captioning models and have varying degrees of similarity with human judgment .
3.2 Document to Title Process:
For generating and assigning a title to the video clip, we use an extractive text summarization technique. To keep it simple, we are using an unsupervised learning approach to find the sentence similarity and rank them. The process is that we give the produced document as an input, it splits the whole document into sentences, then it removes stop words, builds a similarity matrix, generates rank based on the matrix, and at the end, it picks the top sentences for a descriptive title. Figure 5 shows an example of the extractive text summarization system. Also for having an abstract, we implemented and used the abstractive text summarization method for the video . Abstractive summarization methods interpret and examine the text by using advanced natural language techniques in order to generate a new shorter text that conveys the most important information from the original text .
The main goal of our experiments is to evaluate the utility of the proposed architecture as a proof of concept. For implementing our idea, first, we need to get a story as a document from the image/video captioning model. So, for a given video we explore the frames, then feed the selected key-frames to the system to get the description. Implementing this part, captions have been generated by the by Luo et al. 
method. The encoder has been trained with the COCO dataset. In fact, we utilized an image captioning system for this part. Some of the videos have been selected from the YouTube-8M dataset composed of almost 8 million videos totaling 500K hours of video and some from the COCO dataset . The captioning method has been trained and evaluated on the COCO dataset, which includes 113,287 images for training, 5,000 images for validation and another 5,000 held out for testing. Each image is associated with five human captions.
For the image encoder in the retrieval and FC captioning model, Resnet-101 is used. For each image, the global average pooling of the final convolutional layer output is used, results in a vector of dimension 2048. The spatial features are extracted from the output of a Faster R-CNN with ResNet-101 , trained with the object and attribute annotations from Visual Genome . Both the FC features and Spatial features are pre-extracted, and no fine-tuning is applied to image encoders. For captioning models, the dimension of LSTM hidden state, image feature embedding, and word embedding are all set to 512. The retrieval model uses GRU-RNN to encode text. The word embedding has 300 dimensions and the GRU hidden state size and joint embedding size are 1024 . The captions generated with this model describe valuable information about the frames. However, richer and more diverse sources of training signal may further improve the training of caption generators.
The Text Summarization method that has been used in the first experiment is extractive and single summarization. First, we read the generated document from the previous process. Then, we generate a Similarly Matrix across the sentences. We then rank the sentences in the similarity matrix. And at the end, we sort the rank and pick the top sentence. Figure 6 shows some experiments that have been done. The videos are selected from the YouTube-8M dataset  and some from the COCO dataset . The reader can find all the guidance and code here111https://github.com/sohamirian/VideoTitle for replicating the experiments for each part of the process.
The purpose of this research is to propose an architecture that could generate an appropriate title and a concise abstract for a video by utilizing an image/video caption systems and text summarization methods to help in several domains such as search engines, supervision cameras, and the cinema industry. We utilized deep learning systems as captioning methods to generate a document describing a video. We then use extractive text summarization methods to assign a title and abstractive text summarization methods to create a concise abstract to the video. We explained the components of the proposed framework and conducted experiments using videos from different datasets. The results prove that the concept is valid. However, the results could become better by applying more improved frameworks of image/video captioning and text summarization methods.
In our future work, we plan to explore more recent techniques of image/video captioning systems to generate a more natural story to describe the video clips. Therefore, the text summarization system could generate a better title using the extractive text summarization algorithms and a better abstract using the abstractive text summarization algorithms.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research.
-  Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S.: Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016)
-  Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B., Kochut, K.: Text summarization techniques: a brief survey. arXiv preprint arXiv:1707.02268 (2017)
-  Amirian, S., Rasheed, K., Taha, T.R., Arabnia, H.R.: Image captioning with generative adversarial network. In: 2019 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 272–275 (2019)
Amirian, S., Rasheed, K., Taha, T.R., Arabnia, H.R.: A short review on image
caption generation with deep learning.
In: The 23rd International Conference on Image Processing, Computer Vision and Pattern Recognition (IPCV‘19), World Congress in Computer Science, Computer Engineering and Applied Computing (CSCE’19), pp. 10–18. IEEE (2019)
Amirian, S., Wang, Z., Taha, T.R., Arabnia, H.R.: Dissection of deep learning
with applications in image recognition.
In: Computational Science and Computational Intelligence; ”Artificial Intelligence” (CSCI-ISAI); 2018 International Conference on. IEEE, pp. 1132–1138 (2018)
-  Aneja, J., Deshpande, A., Schwing, A.G.: Convolutional image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5561–5570 (2018)
-  Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
-  Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
-  Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
-  Ding, S., Qu, S., Xi, Y., Wan, S.: A long video caption generation algorithm for big video data retrieval. Future Generation Computer Systems 93, 583–595 (2019)
-  Dubey, P.: Text Summarization. https://github.com/edubey/text-summarizer. Accessed 2020-04-04
-  Goutham, R.: Simple abstractive text summarization with pretrained T5- Text-To-Text Transfer Transformer. https://towardsdatascience.com/simple-abstractive-text-summarization-with-pretrained-t5-text-to-text-transfer-transformer-10f6d602c426. Accessed 2020-06-06
-  Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)
-  Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)
-  Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 706–715 (2017)
-  Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123(1), 32–73 (2017)
-  Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision, pp. 740–755. Springer (2014)
-  Luo, R., Price, B., Cohen, S., Shakhnarovich, G.: Discriminability objective for training descriptive captions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6964–6974 (2018)
-  Paulus, R., Xiong, C., Socher, R.: A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304 (2017)
-  Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019)
-  Roul, R.K., Mehrotra, S., Pungaliya, Y., Sahoo, J.K.: A new automatic multi-document text summarization using topic modeling. In: International conference on distributed computing and internet technology, pp. 212–221. Springer (2019)
-  Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: Crowdsourcing data collection for activity understanding. In: European Conference on Computer Vision, pp. 510–526. Springer (2016)
-  Soans, N., Asali, E., Hong, Y., Doshi, P.: Sa-net: Robust state-action recognition for learning from observations. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 2153–2159 (2020)
-  Sung Park, J., Rohrbach, M., Darrell, T., Rohrbach, A.: Adversarial inference for multi-sentence video description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0 (2019)
-  Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575 (2015)
-  Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R.,
Bengio, Y.: Show, attend and tell: Neural image caption generation with
In: International conference on machine learning, pp. 2048–2057 (2015)