Repository for Reusing Artifacts of Artificial Neural Networks

Artificial Neural Networks (ANNs) replaced conventional software systems in various domains such as machine translation, natural language processing, and image processing. So, why do we need an repository for artificial neural networks? Those systems are developed with labeled data and we have strong dependencies between the data that is used for training and testing our network. Another challenge is the data quality as well as reuse-ability. There we are trying to apply concepts from classic software engineering that is not limited to the model, while data and code haven't been dealt with mostly in other projects. The first question that comes to mind might be, why don't we use GitHub, a well known widely spread tool for reuse, for our issue. And the reason why is that GitHub, although very good in its class is not developed for machine learning appliances and focuses more on software reuse. In addition to that GitHub does not allow to execute the code directly on the platform which would be very convenient for collaborative work on one project.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

08/03/2021

Tutorials on Testing Neural Networks

Deep learning achieves remarkable performance on pattern recognition, bu...
05/27/2020

Code Duplication and Reuse in Jupyter Notebooks

Duplicating one's own code makes it faster to write software. This exped...
06/08/2020

Summarising Big Data: Common GitHub Dataset for Software Engineering Challenges

In open-source software development environments; textual, numerical and...
04/05/2018

Metrics Dashboard: A Hosted Platform for Software Quality Metrics

There is an emerging consensus in the scientific software community that...
11/24/2019

Cybernetical Concepts for Cellular Automaton and Artificial Neural Network Modelling and Implementation

As a discipline cybernetics has a long and rich history. In its first ge...
03/02/2021

Mining Software Repositories with a Collaborative Heuristic Repository

Many software engineering studies or tasks rely on categorizing software...
10/11/2018

An Initial Step Towards Organ Transplantation Based on GitHub Repository

Organ transplantation, which is the utilization of codes directly relate...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Application of Artificial Neural Networks (ANN) is expanding dramatically (Basheer and Hajmeer, 2000). ANNs are considered as replacements for software systems in many domains such as image, language, and text processing. Fast growth in applying Machine Learning (ML) based systems in futuristic projects like industrial internet of things, self-driving cars, and medicine shows the importance of these techniques and their impact on the future of research and development. Unlike conventional software systems that are based on intensive models, codes, and test cases, ANN based systems are developed using labeled data. Although recent improvements in technologies such as GPU clusters and Big Data analytic methods make it feasible to develop reliable systems using ANN, these methods still depend heavily to the data that are provided for training and testing such systems. Providing high quality labeled data and tagged information that satisfy the requirements of developing ML based systems is challenging (Amershi et al., 2019; Sculley et al., 2015). We tackle this problem by utilizing the well-known and established concept of ‘reuse’ in the development of conventional software systems (Krueger, 1992; Clements and Northrop, 2001). Software reuse promises increasing costs while improving time-to-market and increasing the quality of software products. Model based software development based on reuse is well studied, however, less attention has been made to applying concept of reuse in the development of ANN based systems. Reusing artifacts of ANN based solutions is limited to models while data (test and training data) and code are less considered(Ghofrani et al., 2019). To a prior submission of this paper we received a lot of feedback concerning thoughts of additional features, issues others saw with the idea, points that needed to be explained in more detail as well as what makes our idea different from that of others. In this paper, we do not only introduce an online tool for reusing the data, structure, and the models that are generated and utilized through the development of ANN systems. We also want to address the feedback we have gotten. Furthermore we extend the concept of sharing and reusing the artifacts of ANN based software systems by providing the advantages of reusability. With this concept it is possible to reduce the times a solution for an issue has to be created which is often consuming time and resources. Users can share, create, and modify the artifacts formed in the projects. Additionally, they can reuse the projects from other uses by copying and extending them in their own work-space. Our proposed online tool enables experts of software engineering to have a closer look at various aspects of reusing ANNs. Furthermore, the provided functionalities of our tool enables the ANN developers to reduce their effort to provide artifacts from their projects.

Section 2 proposes our tool and explains its main functionalities based on four use cases. In Section 3, we review current state of the practice regarding reuse in the context of development of ANN based systems. Section 5 concludes our paper and discusses the issues around our tool and possible future work.

2. Repository for Reusing Artifacts of ANN

In this section, we introduce the Repository for Reusing Artifacts of Artificial Neural Networks, shortly RAN2. Source code of the tool is publicly available in GitHub111https://github.com/ghofrani85/RAN2.git. Users can register and receive an identity to be authorized for managing project within the tool. Users will be redirected to their home page after successful registration and login to RAN2. In the home page, users can create a new project or edit and view their existing projects listed on the page (see Figure 1). Users can also add labels/tags to their projects to allow other users to find the project using the search functionality.

Figure 1. A view of user’s Dashboard in RAN2

Each project in RAN2 consists of four primary directories which are added to the project by default at the creation time. These folders contain four main categories from artifacts needed for developing ANN based solutions (see Figure 2). The check-boxes under each folder allow users to select items to download a customized package. Users can click the Download button to select a folder of the project and download it as a zip file. We decided on splitting the data for Training and Testing for a couple of reasons. The first and strongest is for the possibility to provide better comprehensibility. If another user can reproduce the model exactly as intended he can better understand what the original creator wanted to do. Secondly many times artificial data is used for training and then real data to test the outcome. Therefore a differentiation between these two is needed.

Figure 2. A view of four main folders under each project for storing related data to Test, Train, Model, Code in RAN2. Download button is in the right side. Selecting each checkbox includes the contents of corresponding folders in the downloaded zip file that is generated by Download button

Furthermore, users can also preview the content of these folders by clicking on them. These folders may contain sub-folders or some artifacts. Figure 3 illustrates a view on training data within a project. In this view, new sub-folders can be created, deleted, or renamed. New artifacts can be added which are based on the underlying repository of reusable assets in RAN2. This repository is responsible for sharing and reusing the artifacts. This repository contains all assets that users provided for their projects. Within each folder view, users can add new artifacts by uploading an asset to the repository and select the whole or a part of it, or even reuse directly an existing asset from the repository. An important feature is to label each asset with tags. Tags help finding and reusing the assets. RAN2 supports users by embedded tools to modify pictures or parts of a video or audio files as well as text files.

Figure 3. Example of contents of Learning Data Folder under one of projects in RAN2. The list of assigned artifacts are located in the upper part of the page while the list of sub-folders are listed in the bottom of the page. Tracking window helps the user to follow the last changes that made by the users on the content of this folder

Users can import the whole or parts of other projects into their project and customize it for their own needs. A copy button is available in the project that does not belong to the current user. After copying the projects, users can see a rate-up and rate-down button on the original project that allows them to give a feedback about the projects (see Figure 4). Collected feedback from users will be shown beside the project in the repository view of projects in RAN2(see Figure 5).

Figure 4. The projects from other users are available for the current user to copy. If the user copy an existing project from another user, two rate-up and rate-down buttons will appear to let the user give his/her feed-back about usefulness of the project. In this example the user javad.ghofrani@gmail.com has already copied a project from the user with identity of test@test.com. The user with identity of javad.ghofrani@gmail.com can rate-up or rate-down this project.
Figure 5. In the repository of all projects, the rating values for each project will be shown beside it
Figure 6. Component Diagram of RAN2

2.1. Example Usage Scenarios

Some example scenarios were provided based on the proposed functionalities in RAN2. For development Java was used for the backend, Bootstrap for the frontend and all is based on a PostgreSQL database.

Example 1. - figure 7: User sings up and creates a new project. She inserts some details about her network as description which helps the other users to find out what her network is aiming to do. She selects a

network matrix and uploads her test and trains data including apple, orange, and pears. The user uploads a python file into the code category. This file contains a DNN developed using python and Keras.

Figure 7. Sequence diagram example 1

Example 2. - figure 8: User

needs a network which can classify apple images among a stream of images that he receives from a camera. He has developed a DNN in C++ to perform this task. He does not have enough input data (apple images) to train this DNN. He creates a new project and wants to add some training data. In this step, he searches in the existing projects of our repository using the keyword ‘Apple’ and finds the assets in the project that have been created by user A. He selects the check boxes of the assets and copies the apple images (images with the apple label) into the category of train data in his new project.

Figure 8. Sequence diagram example 2

Example 3. - figure 9: User needs a project for classification of apples, oranges, and carrots. She searches and finds the project of user that is generated with same objectives. In this case, user is not sure how good the quality and accuracy of the trained model and network in the project from is. Therefore, before starting to copy this project for herself, she downloads this project and performs some tests with some images from her use case. This way, she can see that whether the accuracy of that DNN satisfies needs of her project.

Figure 9. Sequence diagram example 3

Example 4. - figure 10: User needs a neural network which classifies oranges and apples in images. He finds the project from user

by using search functionality of RAN2 and finds a similar project. He downloads a copy of this project and extends it by training this network with an additional set of data. This way, he improves the quality of the network with less training effort and time. Although this method is a common scenario among ANN developers (known as transfer learning), finding proper network is still challenging in such tasks.

Figure 10. Sequence diagram example 4

3. Related Work

This manuscript takes some of its inspirations from ProductLinRE222http://www.productlinre.com introduced in our previous paper (Ghofrani and Fehlhaber, 2018). ProductLinRE is an online platform that enables the cooperate work on artifacts of Requirements Engineering (RE) in the development of conventional software systems. Using ProductLinRE, users can share and reuse artifacts such as text, images, video, and audio files that are involved in RE processes to reduce the effort and costs of generating new ones. However, these functionality is adapted for conventional software development without considering the ANN based solutions.

Transfer learning (Torrey and Shavlik, 2010)

is an established method for reusing network structure and trained models of ANNs. In this method, an existing pre-trained network is reused by extending its structure or retraining with a smaller set of data for customizing its functionality. This method saves time and computational resources in comparison to ANNs that are developed and trained from scratch. Common datasets, such as Imagenet 

(Deng et al., 2009)

are used to provide pre-trained networks. The usual reused artifacts in transfer learning methods are network structure and models (weights of trained network). Tools and frameworks such as Tensorflow

333https://www.tensorflow.org

framework and Keras as deep learning library under Python support these way of reuse in the development of ANN based solutions. However, reusing the training sample is not supported in tools with transfer learning. Another disadvantage of transfer learning is the limited amount of well known datasets that contain pre-trained networks with visual data like images.

Various online tools aim at facilitating the development of ANN based solutions without revealing their development effort. These tools are introduced from open source foundations or commercial parties. We classified these tools based on their functionalities into two main categorized. First category includes tools that provide visualization to enhance workflow definition and specify the input and output data to train a ANN. Examples for this category of tools are RapidMiner

444https://rapidminer.com and Orange Data Mining Toolbox555https://orange.biolab.si. Second category provides computing power as processing services or data-storage service to handle the complexity of training ANNs. Microsoft Azure Machine Learning Studio666https://studio.azureml.net and IBM SPSS Modeler 777https://www.ibm.com/products/spss-modeler belong to this second category. The proposed work by Pahl and Loipfinger (Pahl and Loipfinger, 2018) follows the same strategy by providing encapsulated ML techniques as services which make it possible to reuse them in service-oriented architectures. The contribution of RAN2 compared to the ProductLinRE is the domain of usage. ProductLinRE is developed to facilitate reusability in RE while RAN2 covers the entire development process of ANN based solutions. Note that sharing the resources and artifacts in RAN2 repository with the users of ProductLinRE, and vice versa is a possibility.

To end this section a comparison of RAN2 with other existing tools in this area will be provided. Tools RAN2 is competing with and where we’re going to focus on are Google Colab, But4Reuse and OpenML.

Starting with Google Colab, it can be said that it offers a web application that allows a very direct collaboration on one model. The data can be shared selectively with other users. While it is certainly good that sharing with others is possible it is a bad thing that the sharing can not be done for all users of the platform and therefore there is no possibility to build a big community which shares and modifies their work all together. So Google Colab does not provide the functionality that RAN2 is trying to achieve.

The next tool is But4Reuse. It is one of the few tools that offers running the code in itself. This is great for really collaborative work because everyone can access the project and see the results. In addition to that it is very well documented so users can get into it easily. Not so good points of But4Reuse are that it supports only a limited amount of programming languages and that it is necessary to install an application, there is no web application for easier access.

OpenML is the most advanced tool in this comparison. It offers not only direct execution of flows and a pronounced tagging feature. It also has a big community of users that take part in the project. But OpenML could be criticized for not addressing the issue of data privacy, because everything is open to anybody on the platform. There is no function to make a project private. Additionally getting started is very hard and usage is complicated.

When comparing RAN2 with the other tools, it can be seen that although it is still in an early stage of development and it does not offer direct running of machine learning processes, it offers a specific focus on artificial intelligence, in an easy to easy to start with and easy to use web application.

To conclude this section it can be said that there exists no tool that covers all problems in the field. Each addresses different issues and focuses on certain functionality, so each also has its own field of problems it faces. That is the chance for RAN2 to address these problems and to become a tool that solves these issues.

4. Discussion

RAN2 is still in early stages of development. Some potential deficiencies and improper functionalities may degrade the quality of the tool and the user experience. However, the core features are functioning and the tool allows reusability in the development of ANN based solutions in its current state. ANN based solutions often require numerous samples for training. An inevitable technical issue is providing a big storage, but we do not address this concern in the paper. As training sets are valuable assets in the development of ANN-based solutions, experts tend to protect and personalize it through copyrights. In RAN2, the users can upload images from other references, videos, or similar material, or even reuse it from other users. Therefore, copyright agreements have to be introduced to protect the rights of the users.

Following the prior submission of this paper we had to categorize the feedback we got. There are four main points that can be identified: Additional Features, Distinguishing features, Explanatory issues of the paper and issues with the concept in general. While the Explanatory issues will be dealt with in their sections, in this section we are going to focus on a discussion of the other three main points.

4.1. Distinguishing features

Figure 11. Distinguishing features

As seen in figure 11 these are features that set RAN2 apart from other tools that exist in this area. Mainly this is about which tools do exist and what advantages it provides compared to them. This was dealt with in the Related Work section of this paper. Furthermore the question arose if it is useful anyways to reuse and not to start from scratch. To this point we can give a strong no as we are confident that developers of machine learning projects are able to save a lot of time especially when they are looking for fitting data to train their models with. Lastly the point if it is better to reuse than to just clone. The beauty of RAN2 is, that both is possible. Developers can both just clone a project and modify it to their own needs and they are able to reuse them partly or completely fitting to their needs.

4.2. Additional features

Figure 12. Additional features

As mentioned before the users have certain needs and wants for a project like RAN2. The most wished for ones where put together in figure 12. All of those are good ideas where we should definitely think about how to incorporate them into RAN2.

4.3. Issues in general

Figure 13. Issues

General issues others saw with the RAN2 project are displayed in figure 13. Those include mainly that the project is still in an early stage of development. In addition to that the tool will only gain usability when a critical number of users is engaging in it and participate in the reuse of models and data. This is am important point and it will show when users begin to use the platform. The last main point was about the confidentiality of data. For this issue we will build a functionality where users can decide how their data can be shared and reused. They can even make their project private. So the mentioned issues are not threatening to RAN2, but in further development adjustment need to be done.

5. Conclusion and Future work

In this paper, we introduced a web based tool for reusing the artifacts generated in the development of ANN based solutions. In order to offer a clear overview of these functionalities, we provided few examples of utilization that show how the usage of RAN2 can reduce costs (time) and efforts in the development of ANN-based solutions. RAN2 is still in early stages of development and future functionalities include (1) version controlling for creating branches and merging, (2) improving the ranking system with textual reviews or comments, (3) enabling users to add additional information—meta data, description and documentation—to improve flexibility, transparency, and reusability, and (4) providing a machine-to-machine interface for automated cooperation between systems to automatically search and find a solution and reuse it without any need for human interference. Other future work that is necessary is (5) how the moderation of the feedback feature should be realized and (6) what should be considered when the search feature is implemented. Other issues are (7) what happens when the origin of a cloned project is updated and (8) a further analysis of the needs of the end-user.

References

  • (1)
  • Amershi et al. (2019) Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software engineering for machine learning: a case study. In Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice. IEEE Press, 291–300.
  • Basheer and Hajmeer (2000) Imad A Basheer and Maha Hajmeer. 2000. Artificial neural networks: fundamentals, computing, design, and application. Journal of microbiological methods 43, 1 (2000), 3–31.
  • Clements and Northrop (2001) Paul Clements and Linda Northrop. 2001. Software Product Lines: Practices and Patterns. Addison-Wesley Professional.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    . Ieee, 248–255.
  • Ghofrani and Fehlhaber (2018) Javad Ghofrani and Anna Lena Fehlhaber. 2018. ProductlinRE: online management tool for requirements engineering of software product lines. In Proceeedings of the 22nd International Conference on Systems and Software Product Line-Volume 2. ACM, 17–22.
  • Ghofrani et al. (2019) Javad Ghofrani, Ehsan Kozegar, Arezoo Bozorgmehr, and Mohammad Divband Soorati. 2019. Reusability in artificial neural networks: an empirical study. In Proceedings of the 23rd International Systems and Software Product Line Conference-Volume B. ACM, 77.
  • Krueger (1992) Charles W Krueger. 1992. Software reuse. ACM Computing Surveys (CSUR) 24, 2 (1992), 131–183.
  • Pahl and Loipfinger (2018) Marc-Oliver Pahl and Markus Loipfinger. 2018. Machine learning as a reusable microservice. In NOMS 2018-2018 IEEE/IFIP Network Operations and Management Symposium. IEEE, 1–7.
  • Sculley et al. (2015) David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. In Advances in neural information processing systems. 2503–2511.
  • Torrey and Shavlik (2010) Lisa Torrey and Jude Shavlik. 2010. Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques. IGI Global, 242–264.