Large Scale Audio-Visual Video Analytics Platform for Forensic Investigations of Terroristic Attacks

11/28/2018 ∙ by Alexander Schindler, et al. ∙ LIquA AIT Austrian Institute of Technology GmbH 0

The forensic investigation of a terrorist attack poses a huge challenge to the investigative authorities, as several thousand hours of video footage need to be spotted. To assist law enforcement agencies (LEA) in identifying suspects and securing evidences, we present a platform which fuses information of surveillance cameras and video uploads from eyewitnesses. The platform integrates analytical modules for different input-modalities on a scalable architecture. Videos are analyzed according their acoustic and visual content. Specifically, Audio Event Detection is applied to index the content according to attack-specific acoustic concepts. Audio similarity search is utilized to identify similar video sequences recorded from different perspectives. Visual object detection and tracking are used to index the content according to relevant concepts. The heterogeneous results of the analytical modules are fused into a distributed index of visual and acoustic concepts to facilitate rapid start of investigations, following traits and investigating witness reports.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The presented platform is a result of the project Flexible, semi-automatic Analysis System for the Evaluation of Mass Video Data (FLORIDA) and is further developed in the project VICTORIA. The aim of these projects is to facilitate the the work of investigators after a terrorist attack. In such events investigating video data is a major resource to spot suspects and to follow hints by civilian witnesses. From past attacks it is known that confiscated and publicly provided video content can sum up to thousands of hours (e.g. more than 5.000 hours at the Boston marathon bombing attack). Being able to promptly analyze mass video data with regards to content is increasingly important for complex investigative procedures, especially for those dealing with crime scenes. Currently, this data is analyzed manually which requires hundreds or thousands of hours of investigative work. As a result, extraction of first clues from videos after an attack takes a long time. Additionally, law enforcement agencies (LEA) may not be able to process all the videos, leaving important evidence and clues unnoticed. This effort continues to increase when evidence videos of civilian witnesses are uploaded multiple times. The prompt analysis of video data, however, is fundamental – especially in the event of terrorist attacks – to prevent immediate, subsequent attacks. The goal of this platform is to provide legally compliant tools for LEAs that will increase their effectiveness in analyzing mass video data and speed up investigative work. These tools include modules for acoustic and visual analysis of the video content, where especially the audio analysis tools provide a fast entry point to an investigation because most terroristic attacks emit characteristic sound events. An investigator can start viewing videos at such events and then progress forward or backward to identify suspects and evidences.

The remainder of this work is structured as follows: Section 2 provides an overview of related work, Section 3 details the audio analysis, Section 4 the video analysis module and Section 5 the scalable platform. Section 6 summarizes the accompanying ethical research before we provide conclusions and an outlook to future work in Section 7.

2 Related Work

  • [noitemsep, topsep=5pt, leftmargin=0pt]

  • Audio Analysis: The audio analysis methods of the presented platform include modules for Audio Event Detection and Audio Similarity Retrieval. Audio Event Detection (AED) systems combine detection and classification of acoustic concepts. Developments in this field have recently been driven by the annual international evaluation campaign Detection and Classification of Acoustic Scenes and Events and its associated workshop [1]

    . Most recent AED approaches are based on deep convolutional neural networks

    [2], or recurrent convolutional neural networks [3] which can also be efficiently trained on weakly labeled data [4]. Such an approach is also taken for the AED module described in Section 3. Audio Similarity has been extensively studied especially in the research field of Music Information Retrieval (MIR) [5]

    . Similarity estimations are generally based on extracting audio features from the audio signal and calculating feature variations using a metric function

    [6]. A similar approach is followed in Section 3. Recent attempts to learn audio embeddings and similarity functions with neural networks has shown promising results [7].

  • Video Analysis

    Video analytics software makes surveillance systems more efficient, by reducing the workload on security and management authorities. Computer vision problems such as image classification, object detection and object tracking have traditionally been approached using hand-engineered features and machine learning algorithms design, both of which were largely independent

    [8]

    . Over the last recent years, Deep Learning methods have been shown to outperform previous state-of-the-art machine learning techniques, with computer vision one of the most prominent cases

    [9]

    . In contrast to previous approaches, deep neural networks (DNN) learn automatically the features required for tasks such as object detection and tracking. Among the various network architectures that were discovered and employed for computer vision tasks, the convolutional neural network (CNN) and recurrent neural network (RNN) were found to be best suitable for object classification, detection and tracking

    [8, 9].

  • Large Scale Workflow Management and Information Fusion: A comprehensive overview of early work-flow management (WfMS) and business process management (BPM) systems is provided by [10]. BPM is generally concerned with describing and controlling the flow of inter-dependent tasks whereas WfMSs aim at facilitating fully automated data flow oriented work-flows which can be described through a Directed Acyclic Graph (DAG). State-of-the-art implementations of scientific DAG work-flow systems, designed to run computationally intensive tasks on large, complex and heterogeneous data, are Taverna 111https://taverna.incubator.apache.org/, Triana 222http://www.trianacode.org/, or Kepler 333https://kepler-project.org/. Popular systems are Pegasus 444https://pegasus.isi.edu/ due to its direct relation to Grid- and Cloud Computing, Kepler due its Hadoop integration as well as Hadoop itself. Recent developments in this area are domain specific languages, such as the functional language Cuneiform [11] offering deep integration with Apache Hadoop and a high flexibility in connecting with external environments. Further frameworks (languages and execution engine) derived from a Big Data context are Pig Latin [12] (part of the Hadoop Ecosystem) and Apache Spark [13] (in-memory processing framework). They are widely used in the scientific and commercial context to create work-flows for processing large data sets, but they are general purpose data analysis frameworks rather than specifically built to model work-flows. Table 1

    provides an overview on selected open source DAG frameworks, classified by high level requirements that are of major importance specifically regarding audio-visual content management and archiving

    [14]. The fusion of heterogeneous sensor data and multi-modal analytical results is still underrepresented in literature. A system combining results from various visual-analytical components for combined visualization is presented in [15].

Airflow55footnotemark: 5 Mistral66footnotemark: 6 Score77footnotemark: 7 Spiff88footnotemark: 8 Oozie99footnotemark: 9 Pinball1010footnotemark: 10 Azkaban1111footnotemark: 11 Luigi121212https://github.com/airbnb/airflow
Workflow description language Python, Jinja YAML DSL YAML
XML, JSON,
Python
XML Python
Built-in Job types,
custom jobs
Python
Flow control and conditionals NO YES YES YES NO Minimum NO NO
Distributed task execution YES YES YES NO NO NO NO NO
Reliability and fault tolerance YES YES YES NO YES YES YES YES
Hadoop integration NO NO NO NO YES YES YES YES
Extensibility and integration Utilities Python Python, Java NO NO
Pluggable
Templates
Plugins
CLI
integration
Planning and Scheduling YES YES NO NO YES YES YES NO
Monitoring and visualization Web-UI NO NO NO Web-UI Web-UI Web-UI WF Graph Visualizer
Table 1: Overview of open source DAG work-flow systems
1212footnotetext: https://wiki.openstack.org/wiki/Mistral1212footnotetext: https://github.com/CloudSlang/score1212footnotetext: https://github.com/knipknap/SpiffWorkflow/wiki1212footnotetext: http://oozie.apache.org1212footnotetext: https://github.com/pinterest/pinball1212footnotetext: https://azkaban.github.io1212footnotetext: https://github.com/spotify/luigi

3 Audio Analysis

Audio analysis is one of the key components of this platform. Due to the destructive intention of a terroristic act, this often emits one or more loud acoustic events which are captured from microphones disregarding the direction the sound originates from. Besides the higher perceptive field of acoustic information, many relevant events are non-visual or happen too fast to be captured by standard cameras (e.g. alarms, screams, gunshots). Thus, we apply audio analysis to index the video content according audible events and to provide an entry point for the investigations.

Figure 1:

Audio Event Detection example result (Bombing at Boston Marathon 2013). Top chart: Log-scaled Mel-Spectrogram of the audio signal. Middle chart: Probabilities for different acoustic events. Bottom chart: Explosion of the bomb on the left side, arrival of emergency vehicles from the center to the right of the chart.

  • [noitemsep, topsep=5pt, leftmargin=0pt]

  • Audio Event Detection: The Audio Event Detection (AED) and recognition module is intended to be one of the primary entry points for investigations. Reports by civilian witnesses often refer to acoustic events (e.g. ’there was a loud noise and then something happened’). By indexing loud noises such as explosions, investigators can immediately pre-select videos where explosions are detected. This can be extended to the type of weapon used in the attack such as gunshots emitted by firearms and horns by trucks. The developed audio event detection and recognition method is based on deep neural networks. More specifically, the approach is a combination of the models we have developed and successfully evaluated in the Detection and Classification of Acoustic Scenes and Events (DCASE) [16] international evaluation campaign [2, 17, 18], and the approach presented in [19]

    . The applied model uses Recurrent Convolutional Neural Networks with an attention layer. In a first step, the audio signal is extracted from the video containers, decoded and re-sampled to 44.100Hz single channel audio. 437.588 samples (9.92 seconds) are used as input, which are transformed to log-scaled Mel-Spectrograms, using 80 Mel-bands and a Short-Term Fourier transformed (STFT) window size of 2048 samples with 1024 samples hop length. This preprocessing is directly performed on the GPU using the

    Kapre signal processing layer [20]. The normalized, decibel transformed input is processed by a rectified linear convolution layer with 240 filter kernels of shape 30x1. Using global average pooling on the feature maps, audio embeddings with 240 dimensions are learned. This transformation is applied sequentially along the temporal axis of the Mel-Spectrogram, resulting in 428 audio embeddings (one for each STFT window). This 428x240 embedding space

    is used as input for a stack of three bi-directional Gated Recurrent Units (GRU)

    [21] followed by a rectified linear fully connected layer as well as a sigmoid fully connected layer with the number of units corresponding to the number of the to be predicted classes. A sigmoid scaled attention layer was further applied to each input frame of which was multiplied with as well as with the final prediction f the model. The final output of the model are probabilities for the presence for each of nine predefined sound events including Gunshot, Explosion, Speech, Emergency vehicle and Fire Alarm (see Figure 2a). The model was trained on a preprocessed subset of the Audioset dataset [22]. Preprocessing contained flattening of ontological hierarchies, resolving semantic overlaps, removing out-of-context classes (e.g. Music), re-grouping of classes and a final selection of task-relevant classes.

  • Audio Similarity Search: Indexing videos according to predefined categories provides a fast way to start an investigation but it is limited by the type and number of classes defined and undefined events such as train passing cannot be detected. To overcome this obstacle and to facilitate the search for any acoustic pattern, an acoustic similarity function is added to search for videos with similar audio content. The approach to estimate the audio similarity is based on [23] where audio features are extracted, including Statistical Spectrum Descriptors and Rhythm Patterns [24], and distances are calculated between all extracted features using late fusion to merge the results. For this system, these features are extracted for each 6 seconds of audio content of every video file and the distances are calculated between all these features, facilitating a sub-segment similarity search. Further differences to [23] include omitting normalization by grouping features by their unit as well as using correlation distance for the Rhythm Patterns feature-set, which showed better performance in preceding experiments. The audio similarity search serves several goals. First, if a suspect cannot be identified in a certain video, this function can be applied to identify video segments with similar acoustic signatures such as an emergency vehicle passing by, but any other sequence of sounds could be significant as well. Further, the recorded audio signal can be used for instant localization. Similar sound patterns have been recorded in near proximity to the emitting sources and thus the result of a similarity search provides video results for a referred location (see Figure 2a-c).

Figure 2: Audio-visual analysis results example: a) reference video with detected audio events Gunshots (green) and visual objects Person, Car (red bounding-box) and segment selected for similarity search (orange). b) video containing most similar audio sequence (orange) c) Second most similar sounding video segment (orange). d) further relevant videos ranked by audio similarity.

4 Video Analysis

  • [noitemsep, topsep=0pt, leftmargin=0pt]

  • Generic Object Detection and Classification Object detection and classification identifies semantic concepts in video frames, including segmentation of the identified regions with bounding boxes and their labeling with the classified category such as car or person

    . This enables fast search queries that help identify specific scene content and therefore reduce the workload on law enforcement authorities. In recent years, Deep Neural Networks (DNNs) have shown outstanding performance on image detection and classification tasks, replacing disparate parts such as feature extraction, by learning semantic representations and classifiers directly from input data, including the capability to learn more complex models than traditional approaches, and powerful object representations, without the need to hand design features

    [9, 25]. YOLO (You Only Look Once) detector [25] is one of the most popular CNN based detection algorithms, trained on over 9000 different object categories and real-time performance capacity [25]. Evaluating different DNN based object detectors, we concluded that YOLO provides the best trade-off between accuracy and runtime behavior. The object detection module developed for the scalable forensic platform is based on the YOLO detector. It has been optimized to fit into the distributed environment and to store the results in the distributed database index. Figure 2 provides example outputs of this module.

  • Multi-class Multi-target Tracking Visual tracking is a challenging task in computer vision due to target deformations, illumination variations, scale changes, fast and abrupt motion, partial occlusions, motion blur, and background clutter [26]. The task of multi-target tracking consists of simultaneously detecting multiple targets at each time frame and matching their identities in different frames, yielding a set of target trajectories over time. Given a new frame, the tracker associates the already tracked targets with the newly detected objects (“tracking-by-detection” paradigm). Multi-target tracking is more challenging than the single target case, as interaction between targets and mutual occlusions of targets might cause identity switches between similar targets. There has been only little work related to multi-target tracking, presumably due to the following difficulties: First, deep models require huge amounts of training data, which is not yet available in the case of multi-target tracking. Second, both the data and the desired solution can be quite variable. One is faced with both discrete (target labels) and continuous (position and scale) variables, unknown size of input and output, and variable lengths of video sequences [27]. Finally, we note that no neural network based trackers were yet published which handle the general multi-class multi-target case.

    DNN based multi-target trackers are trained either on appearance features [26, 28, 27], or on some combination of appearance, motion and interaction features [29]. In [28]

    , appearance-based association between targets and new objects is learned by CNNs, based on single frames. More robustly, appearance features are first learned on single frames using CNNs, and then, the long-term temporal dependencies in the sequence of observations are learned by RNNs, more precisely by Long-Short-Term-Memory (LSTM) networks

    [26, 29]. The jointly trained neural networks then compute the association probability for each tracked target and newly detected object. Finally, an optimization algorithm is used to find the optimal matching between targets and new objects [26, 29]. The developed system is an approach to a real-time multi-class multi-target tracking method, trained and optimized on the specific object categories needed in a forensic crime-scene and post-attack scenario investigation. Currently we employ an appearance based tracker as in [28]. Additional work in progress aims to add additional features such as targets motion and mutual interaction [29], as well as learning temporal dependencies as in [26][29]. Due to various difficulties mentioned earlier in this section, we currently do not attempt to build and train one network for multi-class multi-target tracking. Rather, for each class, a multi-target tracker as in [28] is trained separately. During runtime, one tracker instance per class is activated, where all trackers are running in parallel. Figure 3) shows exemplary detection/tracking results for person and car classes in various typical criminal/terror scenes.

  • Integration within the Connected Vision framework The generic object detection and multi-class multi-target tracking methods are integrated in a novel framework developed by AIT, denoted as Connected Vision [30], which provides a modular, service-oriented and scalable approach allowing to process computer vision tasks in a distributed manner. The objective of Connected Vision is to create a video computation toolbox for rapid development of computer vision applications. To solve a complex computer vision task, two or more modules are combined to build a module chain. In our case, the modules are (i) video import, (ii) generic object detector, and (iii) multiple instances of a multi-target tracker, one module for each class of objects. Each Connected Vision module is an autonomous web-service that communicates via a Representational State Transfer (REST) interface, collects data from multiple sources (e.g. real-world physical sensors or other modules’ outputs), processes the data according to its configuration, stores the results for later retrieval and provides them to multiple consumers. The communication protocol is designed to support live (e.g. network camera) as well as archived data (e.g. video file) to be processed.

Figure 3: Exemplary visual detection/tracking results for person and car classes in various typical criminal/terror scenes).

5 Scalable Analysis Platform

Figure 4: Data Flow Model of the FLORIDA Scalable Analytics Platform

Two of the main goals of the developed platform are a) in the case of an attack scenario, it should be possible to probe the provided video media material (mass data) as quickly as possible and b) Law Enforcement Agency (LEAs) investigators should gain better insight by screening the most essential material from different perspective and by focusing on specific events through the help of an integrated Scalable Analysis Platform (SAP). The FLORIDA SAP integrates the developed advanced analysis modules and performs the tasks video data ingestions, data preparation and preprocessing, feature extraction and model fitting. These fitted analysis models are then applied to the preprocessed video content. This is implemented on an Apache Hadoop131313http://hadoop.apache.org/ platform. The analysis results are stored in an Apache HBASE141414https://hbase.apache.org/ database with an GraphQL151515https://graphql.org/ layer on top to provide dynamic access for the clients and the visualization of the calculated results. The hardware cluster consists of seven compute nodes and a name node which acts as the orchestrator of the Cloudera Hadoop platform. The underlying commodity hardware (HW) consists of Dell R320 rack servers, Xeon® CPU, E5-2430, 2.2 GHz with 6 cores / 12 threads each and about 63 GB of available RAM and a storage capacity of 11TB of disk space (HDD) per data node. Server administration, cluster configuration, update management as well as software distribution on the nodes is automated via Ansible 161616Ansible https://www.ansible.com/

. Tasks for rolling out the Zookeeper, Hive or HBASE configuration, distributing the /etc/hotsts files as well as installing the audio and video feature extraction tools and their software dependencies (such as Python scripts, Linux packages as ffmpeg and libraries as TensorFlow) are specified in the YAML syntax. ToMaR

[31] is a generic MapReduce wrapper for third-party command-line and Java applications that were not originally designed for the usage in an HDFS environment. It supports tools that read input based on local file pointers or stdin/stdout streams uses control-files to specify individual jobs. The wrapper integrates the applications in such a way that the required files for execution on the respectively integrated worker nodes (nodes 0..5) are locally copied into the Hadoop cache from HDFS and the application can thus be executed in parallel to the number of active nodes and the results are then written back to HDFS. The individual components (ToMaR jobs, shell scripts for import of JSON data into and generation of HBASE tables via HIVE, etc.) are combined into master workflows using Apache Oozie Workflow Scheduler for Hadoop 171717 http://oozie.apache.org/ and event triggers. (see Figure 4) This requires the setup of ZooKeeper for handling failover or orchestration of the components. In order to execute Hive and HBase in a distributed environment, Hadoop must be executed in Fully Distributed Mode and requires the setup a so-called metastore for state information on Hive and Oozie for Hadoop. ZooKeeper is set up in the SAP as ZooKeeper ensemble on node node2, node5, and the master name node.

6 Accompanying ethical research

The FLORIDA project is part of KIRAS, an Austrian research promotion program managed by the Austrian Research Promotion Agency (FFG), the national funding agency for industrial research and development in Austria. In KIRAS projects, an integrative approach is mandatory, which is based not only on technological solutions but also on a social science and humanities approach. For this reason, in FLORIDA accompanying ethical research are conducted.

  • [noitemsep, topsep=0pt, leftmargin=0pt]

  • Ethics of security and ethics of technology: On the background of fundamental ethical principles, analyses in FLORIDA primarily focus on ethics of security and ethics of technology, but also include other ethical subdivisions, e.g. data, information, Internet and media ethics. In the context of security ethics questions about the ’right’ or ’wrong’ use of a certain security technology are reflected upon, for example by asking about the concept of security used, by linking security actions back to the question of ’good living’ and by pointing out alternative options for actions that include comprehensive social and societal responsibility.[32] By taking technical ethics into account the focus is directed to the ethical reflection of conditions, purposes, means and consequences of technology and scientific-technical progress.[33] This leads to questions relating to fundamental ethical principles such as security or freedom, but also to the choice, responsibility or compatibility of used technology: ’Is the chosen technology good?’, ’Is the chosen technology safe?’ or ’What are the consequences of using this technology?’

  • Ethical criteria and questions for security systems: Based on these theoretical reflections and on knowledge from past civil security research projects such as THEBEN181818THEBEN (Terahertz Detection Systems: Ethical Monitoring, Evaluation and Determination of Standards) was a research project 10-2007 to 12-2010) within the framework of the program for civil security research in Germany: https://www.uni-tuebingen.de/en/11265, MuViT191919MuViT (Mustererkennung und Video-Tracking) was a research project (10-2007 to 12-2010) within the framework of the program for civil security research in Germany: http://www.uni-tuebingen.de/de/49647 or PARIS202020PARIS (PrivAcy pReserving Infrastructure for Surveillance) was a research project (01-2013 to 02-2016) with partners from France, Belgium and Austria, funded by the 7th Framework Program for Research and Technological Development: https://www.paris-project.org/, ethical criteria for security systems were defined and an ethical catalog with around 80 questions was compiled, which can be used in the funding, development and usage of security systems. During FLORIDA’s accompanying research, the ethical criteria served as a framework for evaluating the planned use cases and the various development stages of the technological prototype. Some of the questions were the basis for qualitative interviews with potential users at Austrian police organization and intelligence agencies on the one hand and the technical developers of FLORIDA on the other. Here, for example, questions were asked about the ethical pre-understanding, the ethical risks in development or ethical responsibility, but also about specific technological challenges such as the prevention of a scenario creep, i.e. a step by step scenario extension of a system or parts of a system. For FLORIDA, this would mean to prevent individual functions that were only developed for a post-attack scenario from being expanded step by step later and finally being used for an observation scenario. On a technical level, this seems to be difficult or even impossible to prevent in general. Strategies that can be applied here rather focus on the basic handling of organizations or companies in the use of security-related systems, for example adequate authorization systems for the use of functions or the implementation of a security certification system that contains clear guidelines for possible functional extensions. FLORIDA, however, also includes some technological details that make a scenario creep at least more difficult. To give an example: In the audio analysis only predefined audio events are classified and implemented that can occur during attacks (e.g. shots, sirens or detonations) and the system is trained with them. This function of FLORIDA is therefore useful for a specific scenario like terror attacks, while it remains largely useless in other scenarios like an observation. Moreover, subsequent adaptations are difficult to carry out because complex and time-consuming learning processes are what make the functionality of audio analysis possible here.

7 Conclusions and Future Work

The described platform integrates audio-visual analysis modules on a scalable platform and enables a fast start and rapid progress of investigations with the audio event detection module serving as a fast entry point. From these events, investigators can navigate through the video content, by either using the visual tracking modules on identified persons or objects, or use the audio similarity search to find related video content, increasing the chances for identification.

As part of future work we intend to focus on Audio Synchronization, because time information provided with video meta-data can not be considered accurate if the capturing equipment is not synchronized with a unified time-server. Personal cameras for example commonly reset their internal clock to a hard-coded time-stamp after complete battery drain. To align video content with unreliable time information, audio synchronization is applied. Audio features sensitive to peaking audio events are applied to extract patterns which are significant for a recorded acoustic scene. These patterns are then matched by minimizing the difference of their feature values over sliding windows. To find clusters of mutually synchronous videos, audio similarity retrieval is combined with audio synchronization. Finally, mutual offsets are calculated between the videos of a cluster which are used to schedule synchronous playback of the videos.

Acknowledgements This article has been made possible partly by received funding from the European Union’s Horizon 2020 research and innovation program in the context of the VICTORIA project under grant agreement no. SEC-740754 and the project FLORIDA, FFG Kooperative F&E Projekte 2015, project no. 854768.

References

  • [1] Dimitrios Giannoulis, Emmanouil Benetos, Dan Stowell, Mathias Rossignol, Mathieu Lagrange, and Mark D Plumbley. Detection and classification of acoustic scenes and events: An ieee aasp challenge. In Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on, pages 1–4. IEEE, 2013.
  • [2] Thomas Lidy and Alexander Schindler.

    CQT-based convolutional neural networks for audio scene classification.

    In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), pages 60–64, September 2016.
  • [3] Sharath Adavanne, Giambattista Parascandolo, Pasi Pertilä, Toni Heittola, and Tuomas Virtanen. Sound event detection in multichannel audio using spatial and harmonic features. Technical report, DCASE2016 Challenge, September 2016.
  • [4] Ivan Kukanov, Ville Hautamäki, and Kong Aik Lee. Recurrent neural network and maximal figure of merit for acoustic event detection. Technical report, DCASE2017 Challenge, 2017.
  • [5] Peter Knees and Markus Schedl. Music similarity and retrieval: an introduction to audio-and web-based strategies, volume 36. Springer, 2016.
  • [6] Elias Pampalk, Arthur Flexer, Gerhard Widmer, et al. Improvements of audio-based music similarity and genre classificaton. In ISMIR, volume 5, pages 634–637. London, UK, 2005.
  • [7] Jaehun Kim, Julián Urbano, Cynthia Liem, and Alan Hanjalic. One deep music representation to rule them all?: A comparative analysis of different representation learning strategies. arXiv preprint arXiv:1802.04051, 2018.
  • [8] Suraj Srinivas, Ravi Kiran Sarvadevabhatla, and Konda Reddy Mopuri. A taxonomy of deep convolutional neural netwprks for computer vision, 2016.
  • [9] Athanasios Voulodimos, Nikolaos Doulamis, Anastasios Doulamis, and Eftychios Protopapadakis. Deep Learning for Computer Vision: A Brief Review. In Hindawi Computational Intelligence and Neuroscience, volume 2018, pages 1–13. Article ID 7068349.
  • [10] L. D. Xu. Enterprise systems: State-of-the-art and future trends. IEEE Transactions on Industrial Informatics, 7(4):630–640, Nov 2011.
  • [11] Jörgen Brandt, Marc Bux, and Ulf Leser. Cuneiform: a functional language for large scale scientific data analysis. In EDBT/ICDT Workshops, 2015.
  • [12] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: a not-so-foreign language for data processing. In Proc of int. conf. on Management of data (SIGMOD ’08), pages 1099–1110. ACM, 2008.
  • [13] Matei Zaharia, N. M. Mosharaf Chowdhury, Michael Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. Technical Report UCB/EECS-2010-53, EECS Department, University of California, Berkeley, May 2010.
  • [14] Gayathri Nadarajan, Y.-H. Chen-Burger, and James Malone. Semantic-based workflow composition for video processing in the grid. 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI’06), pages 161–165, 2006.
  • [15] Ching Tang Fan, Yuan Kai Wang, and Cai Ren Huang. Heterogeneous information fusion and visualization for a large-scale intelligent video surveillance system. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(4):593–604, 2017.
  • [16] Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Tut database for acoustic scene classification and sound event detection. In 24th Europ. Signal Proc Conf (EUSIPCO), 2016.
  • [17] Alexander Schindler, Thomas Lidy, and Andreas Rauber. Comparing shallow versus deep neural network architectures for automatic music genre classification. In 9th Forum Media Technology (FMT2016), volume 1734, pages 17–21. CEUR, 2016.
  • [18] Alexander Schindler, Thomas Lidy, and Andreas Rauber. Multi-temporal resolution convolutional neural networks for acoustic scene classification. In Detect. and Classific. of Acoustic Scenes and Events Workshop (DCASE2017), Munich, Germany, 2017.
  • [19] Yong Xu, Qiuqiang Kong, Qiang Huang, Wenwu Wang, and Mark D Plumbley. Attention and localization based on a deep convolutional recurrent model for weakly supervised audio tagging. arXiv preprint arXiv:1703.06052, 2017.
  • [20] Keunwoo Choi, Deokjin Joo, and Juho Kim.

    Kapre: On-gpu audio preprocessing layers for a quick implementation of deep neural network models with keras.

    In Machine Learning for Music Discovery Workshop at 34th Int. Conf. on Machine Learning. ICML, 2017.
  • [21] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  • [22] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 776–780. IEEE, 2017.
  • [23] Alexander Schindler, Sergiu Gordea, and Harry van Biessum. The europeana sounds music information retrieval pilot. In Euro-Mediterranean Conference, pages 109–117, 2016.
  • [24] Thomas Lidy, Andreas Rauber, Antonio Pertusa, and José Manuel Iñesta Quereda. Improving genre classification by combination of audio and symbolic descriptors using a transcription systems. In Proc. Int. Conf. Music Information Retrieval, 2007.
  • [25] Joseph Redmon and Ali Farhadi. YOLO9000: Better, Faster, Stronger. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , July 2017.
  • [26] Guanghan Ning, Zhi Zhang, Chen Huang, Zhihai He, Xiaobo Ren, and Haohong Wang. Spatially supervised recurrent convolutional neural networks for visual object tracking, 2016.
  • [27] Anton Milan, S. Hamid Rezatofighi, Anthony Dick, Ian Reid, and Konrad Schindler. Online multi-target tracking uing recurrent neural netwroks, 2016.
  • [28] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with deep association metric, 2017.
  • [29] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Tracking the untrackable: Learning to track multiple cues with long-term dependencies. In CVPR, 2017.
  • [30] Martin Boyer and Stephan Veigl. A distributed system for secure, modular computer vision. In Future Security 2014 9th Future Security Security Research Conference, Berlin, September 16-18, 2014, Proceedings, pages 696–699, Berlin, 2014.
  • [31] R. Schmidt, M. Rella, and S. Schlarb. Tomar – a data generator for large volumes of content. In 14th IEEE/ACM Int. Symp. on Cluster, Cloud and Grid Computing, pages 937–942, 2014.
  • [32] Benjamin Rampp. Zum Konzept der Sicherheit. In Regina Ammicht Quinn, editor, Sicherheitsethik, pages 51 – 61. Springer Fachmedien, Wiesbaden, 2014.
  • [33] Armin Grunwald. Einleitung und Überblick. In Armin Grunwald, editor, Handbuch Technikethik, pages 1 – 11. J.B. Metzler, Stuttgart, 2013.