The ATLAS Physics and Committees Office (PO) is one of the ATLAS Collaboration’s [PERF-2007-01] executive committees. It is constituted by physicists and engineers performing tasks connected to the continuous support of committees and groups including the ATLAS Management, the Physics Coordinators, the Publication Committee, analysis group conveners, the Authorship Committee, the Speakers Committee, and many others. The PO also provides assistance to any member of the ATLAS collaboration, by for example facilitating membership, authorship, paper submission to the arXiv and journals, and reviewing talks and posters for national and regional meetings.
The PO supports the development of several tools including those used to manage physics analyses, prepare and submit papers, distribute detector performance documents, and track conference proceedings. It uses web-based systems to implement the metadata connected with analyses, version control for editing documents, and author lists. PO members are available to guide users in understanding the tools. The PO also assists with other daily tasks to lower the load on each member of the collaboration.
The ATLAS Collaboration has a dedicated organisational structure for work on detector maintenance and operation, data analysis, and scientific publication and outreach. Collaborative tools are needed to provide efficient communication among collaborators and straightforward interaction with the journals, the institutions, and the funding agencies.
This report is focused on the infrastructure for managing analysis and papers, especially its most recent developments which were launched in Fall 2017. Due to the phasing out of the SVN system [svn], a new system was built using the FENCE [Bruno] framework, described in Section 3. This is now used to handle any analysis or document type, for internal use or for a large publication, as is described in Sections 4 and 2. The framework is used not only for ATLAS document handling but for the organization of information about other entities including members, institutes, appointments, equipment, talks, and conferences. It is also used by the ALICE experiment to organize information on members, appointments, funding agencies, institutes, the author list, and shift bookings. The LHCb experiment uses the framework for members, appointments, and institutes management. ATLAS has very specific needs for each task, requiring integration with a single database. For this reason a flexible custom solution had to be developed.
The new system is based on Git and the associated CERN GitLab code repository hosting platform. Development of a special FENCE–GitLab integration has been necessary, as is detailed in Section 5. The ATLAS GitLab area for editing the documents and submitting the papers to the journals, PO-Gitlab, is described in Section 6. A description of the main tools used to support the collaboration author list and the acknowledgements of funding agencies and foundations is given in Section 7. A more general description of the way the metadata are managed is presented in Section 8.
2 ATLAS publication process strategy
The ATLAS experiment supports a wide physics programme to explore the fundamental nature of matter. To do so, it makes use of the Large Hadron Collider (LHC), which collides protons at almost the speed of light and a centre-of-mass energy of . To carry out such a physics program, physicists need software and graphical tools to analyse the data and compare them to theoretical models.
ATLAS is organised into several Physics (PHY) and Combined Performance (CP) working groups and subgroups. These groups are coordinated by conveners appointed by the collaboration for typically two years. Example names of PHY and CP groups include Top Quark (TOPQ), Standard Model (STDM), -physics (BPHY), Higgs (HIGG), Electron/Gamma (EGAM), and Jet and EtMiss (JETM). Studies of system detectors (SYS) and activities such as software (SOFT) and data preparation (DAPR) are also organised hierarchically with subgroups and conveners.
Once an analysis is finished or an aspect of the detector performance has been studied in detail, some members of the analysis team prepare a publication. These writers are known as the editors. In fundamental research, as is the case with the research conducted at CERN, the publication of the results is a duty and is the usual way to show the results publicly and to report outcomes to the taxpayers and funding institutions.
ATLAS produces six different types of documents:
PAPER: general publications in refereed journals, based on collision data analyses and detector projects;
notes: public documents classified as a note; they sometimes use only simulated data;
PROC and CONF notes: conference proceedings and notes containing preliminary results, respectively, which are shown at conferences;
INT: internal notes or technical documents.
PLOT: plots that can be used along with the above-mentioned documents.
All ATLAS analyses are discussed and presented in the relevant working groups which have the responsibility, together with the subgroups, to provide guidance, help, and/or resources to the analyses in both the early stages of an analysis and during its development. The working groups should also develop a coherent and realistic plan for the release of the results for a conference and/or journal publication. This is a necessary step before any paper draft can be planned or circulated. This procedure and the related steps (or phases) are described in an ATLAS internal document and are summarized below.
For an INT, CONF, or PUB note, the procedure has two phases: Phase 0 and Phase 1. For a paper that is sent to a peer-reviewed journal, the procedure has four phases, which include, in addition to the two above, Phase 2 and Submission.
The start of an analysis or a document is done at Phase 0 of the Analysis FENCE interface (also called the Analysis FENCE page). The Analysis Team (AT) starts their analysis and begins writing drafts and supporting documents. The type of document could be PAPER, CONF, or PUB. Some important settings are established at the start of an Analysis, including the constitution of the AT, the appointments of the group and sub-group conveners in charge of oversight of the analysis, and the constitution of an Editorial Board (EdBoard). From the start of Phase 0, they are assigned a dedicated GitLab space, a repository, with which to edit their documents. A GitLab repository with a skeleton INT note is created by default at Phase 0.
The EdBoard reviews the complete analysis and ensures that any documentation or paper drafts are prepared according to ATLAS policies. Once it is satisfied, it signs off on the draft PAPER or CONF note before its distribution to the ATLAS collaboration for review. The EdBoard should verify that the analysis is worth publishing in the proposed form and consult with the Publication Committee (PubComm) chair if there are doubts. It should also establish with the editors and conveners whether the paper should be a letter or an article, and propose a journal. These steps and the validation workflow are performed during Phase 1 and Phase 2, related respectively to the first and second circulation of the draft document to the collaboration for their comments. During the circulation periods, authors can read and comment on the paper draft.
The PubComm chair has the responsibility to assess the quality of the paper and ensure that the ATLAS guidelines and policies are followed. A Physics Approval Meeting is held after the first circulation, followed by a Physics Closure Meeting after the second circulation. After a sign-off of the revised draft following second circulation by the EdBoard, the draft goes to the Chair of the PubComm for a final sign-off.
The ATLAS Spokesperson (SP) is ultimately responsible for the scientific quality of the results from the ATLAS Collaboration and makes a final review of each paper before the Submission. The final draft is signed off by the SP or his/her delegate.
When the SP has signed off, the validation workflow at Phase 2 is finished. A message to the Physics Office Publications team (PO-Pub) is generated to inform them of a new document to submit. PO-Pub officers then proceed with the submission to the arXiv and the peer-reviewed journal. They are responsible for communication with the journal during all the steps (referee reports and proofs) through a dedicated Submission workflow. The submission is completed once the document is published online. The journal references are implemented at the last step of the workflow, which closes the procedure and makes available the references of the publication on the arXiv, public web pages, and the inSPIRE-HEP database [inspire].
CONF and PUB notes use only the Phase 1 workflow. The steps and validation workflow are implemented in systems developed using the FENCE framework, which is described in Section 3. The related web-based systems encompass all of the Phase 0, 1, 2, and Submission steps and are described in Section 4 with a focus on Phase 0. If necessary, at Phase 0, editors may request the creation of dedicated Gitlab repositories appropriately configured through the FENCE-GitLab integration, as is described in Section 5. The metadata filled in any of the Phases are exported to web sites to display the necessary information, including the Public Results pages that are explained in Section 8. Some of the metadata are also used internally by the collaboration to monitor the journal submission process or related activities. The Continuous Integration (CI) tools, which are explained in Section 6, allow validation of the document drafts and preparation of the appropriate ready-to-go tarball, a compressed set of files, containing the full LaTeX [latex] resources and files for the submission to the peer-reviewed journals.
For the category PAPER, a longer process is carried out by the PO-Pub officers. They check the author list and the acknowledgements. Author lists and acknowledgements are both handled and generated through the FENCE framework, described in Section 3, and their production is described in detail in Section 7.1. Before the final publication, and after the refereed review and acceptance by the journal, proofs are sent to the collaboration for a last check. While the editors proofread the content of the paper within a short period of time, usually two days, the PO-Pub officers check whether the authors and their affiliations have been appropriately handled by the journal, through comparison to the original files sent to them. This check is performed automatically using a tool called the Proof Checker, which is described in Section 7.3.
3 The FENCE framework
FENCE is an object-oriented PHP [php] framework designed for the development of web applications. It encompasses the concepts of encapsulation, data abstraction, polymorphism, and inheritance. FENCE uses an ORACLE database (DB) to store the data fetched and displayed in its interfaces. Although ORACLE is the default DB management system used, with some development effort, one can use instead other relational database services such as MySQL and Microsoft SQL Server.
A class can be defined as a template that describes the behaviour that the object of its type supports. FENCE assembles classes to build applications by making extensive use of configuration files, which are loaded into the engine at each request. It then generates the HTML response on the user’s browser. The classes can be inherited by the systems that make use of the framework, and therefore, the code can be reused, with similar features implemented from the predefined software components. As a consequence, the development process is accelerated and the maintenance cost is reduced.
The FENCE software development process encompasses software engineering methods such as requirements analysis, architecture, design, testing, deployment, and maintenance in order to guarantee the quality of the software. Requirements are gathered and documented prior to the solution design and, in this way, developers are able to propose broader solutions that can benefit the whole project. After any implementation, tests are performed to assure software correctness, robustness, extensibility, and re-usability.
Figure 1 presents the fifteen ATLAS web-based systems currently in production. These were developed using the FENCE framework, which facilitates their maintenance and enhancement. They can be divided into three categories: people, publications, and equipment. The people-related ones have features for managing personal information of the ATLAS members, including their contracts, appointments, affiliations, nominations, conferences, theses, and research activities. The systems related to publications automate the process of producing papers, conference and public notes, and weekly performance plots from collision data, for review. Those related to the equipment handle information about system detectors’ design and interconnection.
3.1 Fence main classes
The FENCE framework is composed of a library of helper classes that are extensible program-code templates for creating objects. Any new class can be coded and added to the framework, widening its scope, and can then be reused in different systems. One example is the Search class that provides methods to create search interfaces that allow data filtering through predefined search attributes. The SuperSearch class offers an advanced search interface, where the user can build logic queries with AND and OR operators. The inputs that are entered into a form can easily be added using classes such as TextInput, DateInput, and MemberInput, which provides a selection box with the list of all members of an experiment. The most important classes, developed to support the ATLAS publication process, are described in the following sections.
3.1.1 Workflow class
The FENCE Workflow represents any process involving states and actions triggered by a change from one state to another. It was used to implement the web system that supports the ATLAS publication process, which is organised in phases. Each phase is divided into several steps separated by actions. Each step can activate a number of tasks including the recording of metadata into the ATLAS database, triggering of an E-group creation, activating an update on GitLab [gitlab], and sending automatic emails.
The Workflow class was developed based on the concept of Directed Cyclic Graphs (DCG) that encompasses the relation between objects. Objects are called nodes and the relations between them are called edges, implying a directional flow. To represent this concept, some classes were created. The abstract Graph, whose corresponding code can be found in Section A.1, has methods that allow the addition and deletion of nodes and edges. The class that implements Graph is called MapperGraph. It stores nodes and edges inside a PHP data structure called SplObjectStorage that, for this implementation, can better manage objects than associative arrays. The use of this data structure allowed the development of very simple methods to retrieve neighbour nodes or edges given an origin and target node, which means retrieving a directional edge.
The Node class defines methods to set and get data related to one node. The Edge defines similar methods, but related to an edge. An example of data that can be added to a node is an instance of the Action, having methods to set and get function callbacks, defining its arguments, and being able to access its outputs. More details about the Action class implementation can be found in Section A.2.
The behaviour of the Workflow is controlled by a JSON file, following the FENCE pattern described in Section 3.3. This file defines a workflow’s steps, their order, and the actions that can be triggered at a given step. The Workflow class uses the MapperGraph, Node, Edge, and Action classes to build a graph and its elements.
3.1.2 Messenger class
The Messenger class is used by the Workflow to send automatic emails and to allow users to edit email templates. The JSON file used by the Workflow defines email template names to be triggered by an action. These templates and their variables are stored in the database in two JSON files. The first one contains all the templates with variables to be substituted, and the second contains the variables’ identifiers and the methods used to substitute them into the templates before sending the email. Using another class called DBJReader, the Messenger can read these JSON files from the database. It can then either get the templates and show them in the interface, so the users can edit them, or parse the variables and send an email. In the first case, the changes applied in the templates are saved in the database, but this time using the DBJWriter. In the second case, Messenger will substitute all the variables in the template and use the Mailer, designed to send automatic emails and to trigger the email to the correct recipient. A summary of this infrastructure is illustrated in Figure 2.
3.1.3 EgroupManager class
The EgroupManager class is similar to the Messenger class, since it also gets a template from a JSON file and substitutes variables. The difference is that the templates are not related to emails, but to E-group configurations. It does not allow users to edit the templates from the interface since they contain many technical details.
The EgroupManager class uses another FENCE class, called JReader, to get the templates from the JSON file. This class was designed to parse JSON files and store them in an object. After getting the JSON templates, the EgroupManager parses them, substituting all the variables.
With the template parsed, the EgroupManager uses the FENCE EgroupSOAPHandler to communicate with the E-groups API. To do so, it first makes an authentication. Using the methods available in the SOAP WebServices, it can create, update, and delete E-groups.
3.1.4 User class
The User class supports access control of the interfaces.
The main purpose of the User class is to define an object that stores information concerning the connected ATLAS member (connected to the main CERN authentication server), including the CERN CCID (CERN Computing ID), first and last name, E-groups, and others attributes. It also defines specific methods to facilitate access control within the interface.
In Section A.3, there are two examples of the above-mentioned methods. These are used to check user authorisation: is_expert() checks if the user is a member of the E-group of FENCE team developers, which is composed of the project developers. The method Permission($permission) accepts a permission to be checked as an argument and verifies if it is in the user permissions inventory.
When an extension of User is created, extra methods are appended to User to provide specific Utils for a context, Utils being a FENCE class that contains useful public methods used by many other classes of the framework. Every system has its own User class extending the FENCE core User class. Systems may therefore have specific methods that are used to grant edit permissions and control user access.
Configuration files, described in Section 3.3, provide multiple properties that set access control and edit permissions. This is mainly achieved in two ways. General access control is set using CERN E-groups, or FENCE user groups, including experts, administrators, and many others. In this case, User verifies the clearance by comparing the user actual E-groups and user groups to the required ones. The other way concerns edit permissions and uses specific roles. These roles are keys mapped to methods in User that check if the member is supposed to have edit permission on that specific field. An example is shown in LABEL:lst_User.
Taking the GROUP_CONVENER role as an example, it uses the following method to grant permission to edit the public short title field, see LABEL:lst_Conv:
3.2 Mbf (Models, Builders, and Factories) infrastructure
Models, Builders, and Factories are all heavily used software design patterns. Their combined use is a particular feature of the FENCE framework. The main goal of these development standards is to create a wrapper to store complex objects and facilitate their construction in different contexts, working as an SQL query builder. For instance, it would be possible to pass an actual SQL Query every time information from the database is needed. It is, however, much more convenient to just call a class that handles the queries and presents to the user the needed object. The desired behaviour described here is exactly how the MBF infrastructure works. In FENCE classes, objects are constructed simply by instantiating specific factory classes. For instance, in the example below, a member is constructed by instantiating the Member Factory, see LABEL:lst_Mbf:
In this example, MemberFactory extends the core Factory, which handles the whole process that connects to the database and assembles objects. An object containing the order of properties to be built is passed as an argument to the instantiated factory. In the example of LABEL:lst_Mbf, the member factory provides the first and last name as well as the email address of a member.
When a specific new object needs to be created, a group of three files is needed: the Factory, the Builder, and the Model files. The first one stores the inventory of factories a specific Factory connects to and sets which Builder it uses. The next stores the relation between the database structure and the Model, assigning table columns to its Setters. Finally, the Model is the class that is populated by the Builder and stores the information in structured objects that can be accessed through Getters.
From a perspective opposite to that of the paragraph above, Models are classes that serve as oriented object representations of the information. They define several set and get methods that handle specific properties of the object. These models are used in Builders, where the actual query is set and database columns are associated with a model set method. Finally, a Factory calls its corresponding Builder and contains an inventory, which may be empty, of other Factories that are related to this object.
3.3 Configuration files in Fence
The FENCE framework is based on configuration files that provide the necessary parameters and properties to build interfaces. The main goal of this infrastructure is to simplify many aspects of web system requirements. The configuration files are in JSON, a lightweight format for storing and transporting data, and since those can be transformed in structured objects, developers can easily define a group of properties within specific contexts. For instance, it is possible to set up which groups of users can have access to a certain interface. Another benefit of using configuration files is that major classes that have several arguments and environment parameters can be instantiated in a cleaner way, with just a configuration file path as argument. With that, developers feel encouraged to develop more generic and robust features, since they can be easily reused in the future.
Along with the configuration file concept, additional utilities were developed to guarantee the feasibility of this idea. One of these tools is the class JReader, which provides functionality for template variable substitution and JSON schema validation. Another one is the FENCE Content, which gets some default information from configuration files to handle common interface needs, such as access control, constants, and rendering outline formats.
Most of the time, when a new interface is created using FENCE, the class that generates the particular content of this page inherits the Content. At the same time, as is described in the Unified Model Language (UML) in Figure 3, the Content has a configuration file path as argument. This configuration file path is passed to an instance of JReader constructed within Content. The JReader method parsecontents makes available for Content the corresponding configuration file content.
4 Analysis web-based systems with a focus on Phase 0
To automate the process of conception, evolution, review, and approval of publications described in Section 2, five web-based systems were developed using the FENCE framework: PAPERs, CONF notes, PUB notes, PLOTs and Phase 0, see Figure 1. Together they are called the Analysis Web systems. The first four are described here briefly, while the last one is presented in detail.
The relationships among the five Analysis web-based systems are represented in Figure 4. The Phase 0 system has been implemented to support the publication process. The evolution of the process from the creation of a Phase 0 to the other publication systems (PAPER, CONF, and PUB) is described in Section 2. For the review and approval process of a publication set of plots, there is also the PLOTs system, which can be used during all phases whenever a new plot is sent for circulation to and review by the collaboration.
The PAPER features functionalities for inserting, retrieving, editing, and deleting the properties of a paper in a database, through managing the activity flow of its three phases: Phase 1, Phase 2, and Submission.
The CONF notes system incorporates notes that should be presented at a conference. The PUB notes system incorporates public notes that should be presented to the scientific community without being submitted to a journal or presented at a conference. The PLOTs system handles the plots that are used to present results in all other types of publications mentioned so far. Those three systems present functionalities for inserting, retrieving, editing, and deleting the properties of their entities in a database and also manage the workflow of each system’s Phase 1.
The PAPER, CONF notes, PUB notes, and PLOTs systems are quite similar, differing only in the number of phases and the workflow/steps involved in each one. They also resemble the actions related to each phase’s steps, which can be: saving data in the database, sending automatic emails, or creating or updating E-groups.
The need for the Phase 0 system arose in 2017, when the ATLAS IT department downgraded the Apache Subversion (SVN) [svn] version control system and encouraged its members and authors to use Git [git] because of its decentralised characteristic, which is better adapted to the situation of the collaborators. The experiment started to use the repository platform GitLab [gitlab] because of its continuous integration functionality, the possibility of storing repositories in private servers, and the provision of an API with many services.
The transition period has triggered the need for a tool that can communicate with the GitLab API and create automatically-configured Git repositories with each publication’s unique metadata. To formalise the creation of repositories at the beginning of the publication writing process, the concept of Phase 0 emerged. It was recognized that this could also include the flow of tasks during the preliminary stage of the editorial process, when it is not yet known whether scientific content will materialise into a paper, conference note, or public note. So, in March 2017, Phase 0 web-based system development was launched.
The system provides functionalities to support and formalise the initial stages that may lead to a publication, before accessing the PAPER, CONF, and PUB notes production process. Phase 0 can trigger different types of processes, including an Analysis workflow towards a PAPER or a CONF note which gathers all the physics and combined performance analysis activities (PHY, CP). One can also skip the Analysis Workflow towards a CONF/PAPER or a PUB note. This is allowed for a PUB note, which is usually a simulation work or an instrumental description. It is also allowed for a PAPER/CONF intended for an instrumental description purpose, or for a physics CONF note that should proceed as quickly as possible through internal review so that it may be used at a conference.
Phase 0 is the common stage for PAPER, CONF, and PUB note workflows, before Phase 1. It stores some metadata divided into steps, e.g. meeting dates, comments, links, groups of people such as Analysis Contacts, target dates for analysis finalisation, editorial board members and meetings, and approval sign-off dates. As is described in Section 2, each of those metadata should be filled in a specific order by users with the appropriate permissions and should trigger automatic emails or E-group updates all along the process.
4.1 Phase 0 repository
The first step of Phase 0 system implementation was the data modelling to identify the system’s entities with their attributes and relationships. A simplified version of this study will be presented next.
The main entity of the system is a Publication, which has attributes such as title, reference code, and creation date. A publication is always related to a Group and, most of the time, a Subgroup, whose attributes are name and description.
Members of the ATLAS experiment are related to a publication by one or more Roles such as Analysis Team or Editorial Board member. A Member has attributes such as his/her first name, last name, and primary email address. The attributes of a Role are its name, type, start date, and end date.
A publication contains Phases (in this system, only Phase 0), whose attributes are the start date and its status. During Phase 0 steps, many Meetings take place, and their attributes are title, date, and comments.
Some external Contents are associated with Phase 0, such as notes containing supporting documentation for the publication and meeting minutes that are stored on the CERN document server. This entity has as its attributes the name of the content, its type, and its web address.
Phase 0 is also related to Deadlines by which people finish their activities. A Deadline has as attributes its type and its date.
4.2 Phase 0 main functionalities
The Phase 0 system has three main functions. The first refers to the insertion of a new publication, when the members of the ATLAS experiment decide to publish the results of their work and need to define the principal data of the article or public note in order to start writing. The interface presents a web form that contains several fields that define the main information of the new publication. These include its title, reference code, groups, subgroups, and keywords. The second interface presents the search functionality. With this a user can search for publications by setting filters, and can write reports through the results table. The third interface allows editing of the information about a publication, facilitates the monitoring and evolution of Phase 0, and enables the automatic creation of Git repositories.
The functionality to submit a new analysis can be seen in Figure 5. Through this, a member fills out a form in steps. The mandatory fields in each step are indicated by asterisks (*). Information on how to fill each field are defined by the ‘i’ icon next to the field name. At the end of all steps, there is a confirmation step where the user can verify whether all fields have been filled in correctly. If so, the form information can be stored in the database, which now gathers the information that defines an analysis such as its title and reference code.
The advanced search functionality of the Phase 0 system, shown in Figure 6, allows a user to define criteria through three fields. The first defines a publication attribute, the second selects an operator, and the third allows a value to be entered. One or more search criteria can be selected and arranged by forming logical expressions using the AND and OR operators. Users can also configure the search results by setting the ordering of the records in ascending or descending order, grouping them by attributes, selecting the visible attributes, and saving those configurations for use in a future search. Search result reports can also be exported in CSV file format.
Finally, the publication details interface, the main interface of the system shown in Figure 7, presents metadata and allows editing of it. The interface also controls the workflow of Phase 0 activities, providing an overview of all its stages and highlighting the previous, current, and upcoming ones. A transition between Phase 0 steps triggers actions. The most common is storing data in the database. If allowed, a user has the option of saving the data to the repository and staying at the same step by pressing the ‘Save’ button; or saving the data and going to the next step by pressing the ‘Proceed’ button. When one moves forward in the workflow, the system triggers automatic messages that alert and provide instructions to the person responsible for the next step.
An example of a Phase 0 step that is part of a workflow is the Editorial Board “request meeting and formation data" step which is illustrated in Figure 8. The group convener is responsible for adding the Editorial Board “request meeting" title, date, comments, and links. The Publication Committee Chair is responsible for appointing the Editorial Board members and filling in the date on which they are appointed. Once all this information is in the system, the Publication Committee Chair can proceed to the next Analysis workflow step. Subsequently the Editorial Board E-group is automatically created, including information for all its members, and an email is sent, informing them that they were appointed and should proceed to the next step of the Analysis workflow.
The Workflow, Messenger, EgroupManager, and User FENCE classes (mentioned in Section 3) and the MBF infrastructure made possible the development of the Phase 0 system workflow. They do not, however, include the GitLab Integration, a key feature of the system, which is explained in detail in the next sections.
5 FENCE and GitLab integration
As was mentioned in Section 4, the FENCE Phase 0 system was designed and implemented to provide automatic creation of Git repositories to simplify the analysis and the editing of any type of draft to support the analysis. The Phase 0 functionalities include some features that trigger the GitLab commands. The integration of the software framework and the collaborative repository platform is described below.
5.1 GitLab structure to organise analysis groups and repositories
At any Phase 0 creation, Git repositories are created in GitLab under the atlas-physics-office group. Each leading Physics or Combined Performance group or System Detector/Activity effort is labelled as a category with four letters in the FENCE systems related to the analysis and the documentation creation. The full list of Physics and Combined Performance groups is shown in Table 1.
|FTAG||Flavour tag CP|
|HDBS||Higgs & Diboson Searches WG|
|HION||Heavy Ions WG|
|IDTR||Inner Detector Tracking CP|
|PMGR||Physics Modelling Group|
|STDM||Standard Model WG|
For example, the leading Top Quark physics group is TOPQ while the Electron/Gamma Combined Performance group is EGAM. The identifier (ID) of a Phase 0 FENCE entry is therefore labelled: ANA-GROUP-YEAR-NN where GROUP can be TOPQ, HIGG, or EGAM while YEAR is the year the document was created and NN is a two-digit counter. For instance, ANA-SUSY-2019-04 represents the fourth analysis the FENCE entry created in the SUSY group in 2019.
An analysis group may evolve into a PAPER, a CONF note, or a PUB note. The identifiers (IDs) of those documents are therefore GROUP-YEAR-NN, CONF-GROUP-YEAR-NN, or PUB-GROUP-YEAR-NN, respectively. This naming convention preserves backward compatibility with the different entries used for each type of document before Phase 0 creation.
In PO-Gitlab, an effort has been made to make the document IDs more logical. They are labelled:
ANA-GROUP-YEAR-NN-INTn for internal notes,
ANA-GROUP-YEAR-NN-PAPER for a paper,
ANA-GROUP-YEAR-NN-CONF for a CONF note, and
ANA-GROUP-YEAR-NN-PUB for a PUB note.
For example, in the Higgs category, for a given Phase 0 analysis entry ANA-HIGG-2017-08, will host ANA-HIGG-2017-08-INT1,2..n, ANA-HIGG-2017-08-PAPER, ANA-HIGG-2017-08-CONF, and ANA-HIGG-2017-08-PUB. Each repository is connected to the appropriate FENCE interface. This is illustrated in Figure 9 where the interface for the atlas-physics-office subgroups and repositories is shown. ANA-HIGG-2017-08, a subgroup of HIGG, contains for example one paper and one internal note repository, respectively ANA-HIGG-2017-08-PAPER and ANA-HIGG-2017-08-INT1.
5.2 Middleware between FENCE and the Gitlab REST Api
A set of classes was created with the original aim of making the use of the API easier between the FENCE systems. In fact, it is mostly used by the Analysis systems within the Analysis integration. Through the main class, called Gitlab, it is possible to handle all the basic operations offered by the API: create, get, and customise settings for projects, groups, and branches, handle commits, and carry out many other actions defined and explained in the REST API documentation [rest_api].
Each API endpoint can be accessed by one of the following HTTP methods: GET, POST, DELETE, and PUT. The FENCE–class uses them through methods detailed in Section B.1. Each of those methods makes a call to execMethod (see Section B.2), which configures the endpoint using the PHP CURL methods [php_curl] and executes one of the HTTP methods, returning the REST API answer. This can be a JSON file with metadata, or just a success, or an error message.
The metadata returned by the execMethod are then used to populate the attributes of many classes representing elements, including Branch, File, Commit, Project, Group, Label, and Member. These can then be manipulated by any FENCE system.
An example is the creation of a paper repository. The createProject method (see Section B.3), is called with the project name as the first argument (or an instance of the Project class) and the project parameters (such as path, namespace, default branch, and description) as the second argument. The method calls the POST method mentioned above and stores the new repository metadata in a FENCE Project object, which can be used for further manipulations.
5.3 FENCE-Gitlab Integration
The first interaction between FENCE and happens when a Phase 0 entry is created. A group with its reference code is automatically formed containing the first internal note repository. The content of this repository’s first commit is obtained from a source repository, which is the package containing file templates called atlaslatex. FENCE is responsible for substituting all the necessary variables into all the file templates according to the metadata inserted when creating the entry in the system. After the commit, FENCE automatically de-protects the master branch, creates the protected PO-ready branch, and creates the PO-Publication label. The last step is to set the developer permission to the Analysis Team E-group using LDAP synchronisation.
Another FENCE and integration process is executed when Phase 0 is finished or is skipped, thus proceeding to PAPER, CONF note, or PUB note Phase 1. FENCE automatically creates an internal note repository setting all the configuration elements that are needed. It is possible to append additional internal note repositories at any time. The creation of the configuration of the repositories holding the document is done without any input from the editor’s side, allowing for a streamlined process.
FENCE and Gitlab also interact while handling the author list of a publication. Creating the author list at first circulation triggers a request for the existence of the repository associated with the publication through the Gitlab API. The act of clicking on the button labeled "Create and push to Gitlab" (see Figure 10) creates the author list according to its reference date in all the formats (including xml and tex). It then starts a dialog between the two platforms, FENCE and , to push the files through the API. On first circulation, the files are added to , while on subsequent circulations, as they already exist, they are simply updated.
6 PO-GitLab and CI tools
The ATLAS Physics Office GitLab tools () simplify the publication process of ATLAS documents by using the features provided by the CERN GitLab platform.
The previous publication workflow involved a heavy email exchange between ATLAS editors and the Physics Office in order to ensure that ATLAS rules were being followed up to submission of the paper to the arXiv or the journal. This approach led, usually, to modifications implemented by different parties (officers and editors), which were sometimes not properly implemented and which slowed the publication process down. Due to the uniform and repetitive nature of the tasks required to submit a publication, the implementation of an automatic tool was favoured.
Three main tasks are handled by the PO-GitLab up to the final submission. They are: the automatic creation of GitLab repositories (Git repositories centralised in the remote platform), the real-time verification of technical rules by the GitLab Continuous Integration (CI) tools, and the automatic processing of the document itself. These tasks are described in this section.
6.1 Automatic document creation
A centralised area controlled by the ATLAS Physics Office needed to be designed first. Control is the key, in order to allow the Physics Office to maintain the quality of the document being accepted for publication.
A basic structure is set in GitLab to store the groups related to an analysis. The main GitLab group is called atlas-physics-office, and this represents the root of the group hierarchy tree. Each of its subgroups belongs to a leading group, for example HIGG, EXOT, SUSY, etc., as is mentioned in Section 5. In the case of the publication shown in Figure 7, a subgroup of GENR called ANA-GENR-2018-01 would be created. Inside ANA-GENR-2018-01 there would exist specific repositories for each type of analysis, designated ANA-GENR-2018-01-INT1, ANA-GENR-2018-01-PAPER, ANA-GENR-2018-01-PUB, and/or ANA-GENR-2018-01-CONF.
With this structure defined, it is possible to create documents automatically through FENCE, via the communication link between the framework and the GitLab API. This is explained in more detail in Section 5. This amortisation relies on file templates that have their variables substituted according to requirements of the related publication. This way, all created repositories contain the default documents correctly formatted to start writing a PAPER, CONF, PUB, or Internal Note. The repository is also configured with a new protected branch named PO-ready, which means that only members with the role Maintainers are allowed to push and merge. This special branch is used to run the final submission pipeline when the document is ready and has been reviewed by the relevant parties. The master branch is used as the main work branch, unprotected at the time of the repository creation, allowing all editors to push new commits and interact with the repository.
6.2 The real-time check with GitLab’s CI
GitLab CI tools are designed to automatically execute a set of tasks every time a new modification is introduced into the document (i.e. a new commit is pushed to the document repository). The approach from the Physics Office was to develop a package that is able to run different jobs on a given document, verifying distinct aspects, which are executed by a Python package. Given the modularity of the system, new and more complex tasks can be added, ensuring scalability.
GitLab’s CI is organised using pipelines. A pipeline is a set of jobs grouped in stages. All the jobs in the same stage are executed in parallel, while each stage is only executed after the previous one has completed. The dependencies among the jobs’ executions can be configured in different ways according to the status. For example, it is possible in some cases to start the jobs of the next stage only if the previous ones have finished successfully, and in other cases only execute them if the previous stage failed. Each time a new commit is pushed to the repository, a pipeline is triggered.
Different sets of checks are performed in each step of the publication process. For editors, all work done before the paper submission (detailed in Section 6.3) is monitored by the edit-pipelines as shown in Figure 11. These pipelines are triggered by any push made from branches whose name does not start with PO-. The special branches using the PO- prefix are tracked by the submit-pipelines when a paper is considered ready for submission to the arXiv and the peer-review journal.
Figure 11 presents an example of an edit-pipeline that consists of the following set of stages:
Preparation: This consists of only one job that checks the current version of the package.
Technical checks: This stage includes checks related to LaTeX:
Figures exist: checks if all figures used in the document are present in the repository.
Files exist: checks if all the tex files included in the document are present.
Repeated commands: checks for repeated user-defined commands. It is not wise to use the same command for different purposes. This can present a problem when captions for figures and tables are being generated for the ATLAS public pages.
Repeated labels: checks for duplicate labels in all tex files.
Undefined references: checks for undefined references.
Unused labels: warns if a LaTeX label has been defined but not used. Although this is not a problem, it might point to an improper reference.
ATLAS checks: These are checks related to ATLAS rules and style:
Bibliography: checks that the bibliography files are included.
Cover logo: checks that the proper logo is being used in the ATLAS template.
Figures labels: checks the ATLAS labels (e.g. ‘ATLAS Internal’) in the legends of figures depending on the type of document. Table 2 shows the labels that are allowed or not allowed in different file types.
Oversized figures: checks for figures larger than .
Preprint ID: checks that the preprint ID is included in the document.
Template version: checks that the version of the ATLAS LaTeX template is the latest one available.
Title and Abstract: checks that no user-defined commands (i.e. non-LaTeX commands) are being used in the title and in the abstract.
Build: this stage builds the document itself. As these pipelines will be active on each commit, the pdf file of the document is not stored as an artifact. Whether or not the pdf file is to be generated by a manual job (editors can trigger it by clicking on the play button on the interface) is indicated by a gear that produces and saves the document as an artifact for a user to download.
|Document type||Preliminary label||Internal label|
|PAPER||Not allowed||Not allowed|
|BOOK||Not allowed||Not allowed|
6.3 Paper submission
The CI also produces the required files for paper submission, using dedicated pipelines similar to the editing ones. These are called submit-pipelines. A protected Git branch, named PO-ready, is created by default at the time of the setup of the paper repository. When a paper is ready for submission, an editor creates a Merge Request from the Master to the PO-ready branch. When this request is accepted by a Physics Office officer, the paper submission pipelines are triggered. In addition, any branch or tag created following the pattern PO-* triggers the paper submission pipelines. These pipelines have the previously described tests but subsequently, at the build stage, a flattening of the LaTeX document occurs, with the following actions:
all the source files are merged into a single LaTeX source file;
all the comments in the LaTeX source file are removed;
all the figures are renamed following the convention required by the journals;
any directory structure is removed.
The various actions are shown in Figure 12.
Tarballs suitable for submission to the arXiv and journals are created using TeX Live 2016 and 2017, respectively. The two different versions are required by the journals because of differences in handling the bibliography and to avoid incompatibilities. The arXiv favours TeX Live 2016, while some APS journals, for example, require TeX Live 2017. The tarballs also contain files with plots and tables for the public web page. These tarballs are created as GitLab artifacts and can be downloaded by the corresponding editors and members of the Physics Office. In the submission tarballs, the auxiliary material (figures and tables not for submission) are not included.
7 Author lists, acknowledgements, and the proof checker
7.1 Author lists and acknowledgements files
The author list, often written authorlist for convenience, is the inventory of qualified authors at a given date, which is called the reference date. Every paper has a related list of qualified authors with a reference date that corresponds to the creation date of that list at the PAPER Phase 1, just before the first circulation of the draft document to the collaboration. Qualified authors are active physicists contributing to the maintenance and operation of the experiment. Some of them are retired people applying their pre-data credits (obtained before the data-taking era); they are called signing-only authors. Between FENCE Phase 1 and Phase 2, some people may receive exceptional authorship because of their involvement in the analysis or the paper, even if they are not yet qualified as authors though the usual process. Therefore the author list is updated to include “exceptional" authors. The special cases are studied by the Authorship Committee and proposed for approval to the Spokesperson, who will agree or not with each exception after reviewing the proposal from the Authorship Committee.
This information is stored in the ATLAS database and managed by FENCE. Figure 13 shows the full list of members (active and retired), their affiliations, and the related metadata that are needed to generate the full report of members and institutes.
The acknowledgements are incorporated in a legal paragraph that the collaboration agrees to include in each paper to thank funding agencies for their financial support. They do not change very often, but they may include or suppress a funding agency or a foundation at a given date. Therefore, similarly to the author list, the acknowledgement file is built for each paper at the reference date.
Both files, the author list and the acknowledgements, are built using the FENCE framework (see Figure 14) and are automatically pushed to the appropriate Gitlab repository, using the FENCE–integration (Section 5.3). Their integration into the paper is straightforward at the time of submission to a journal. FENCE provides an elegant way to retrieve the required information from the database (see Section 3.2) and build all the files.
The author list is built by the FENCE framework into an xml file. This is composed of three main blocks:
The xml file is used as a role, since it contains all the information needed to build the other files. It is the first one to be generated. A backup version of the first release of the author list is stored.
The acknowledgement tex file is built using a standard template and is filled using the FENCE framework to retrieve the required information about the ATLAS funding agencies.
7.2 Main functionalities of the FENCE author list user interface
The FENCE author list interface, Figure 14, shows the complete set of author lists created for every ATLAS paper that is being submitted or has been published since 2009. They are easily filtered using the SEARCH box. All the columns are self-explanatory; in the last column the drop-down menu gives access to the author list location, which can be distinguished by the icon. A download icon () means the files are stored in AFS and can be downloaded. A GitLab icon () means the paper and the files are located in a repository. The author lists can be downloaded or displayed in in the following file formats:
tex: used by the editors to include the author list into the draft publication;
xml: a structured file containing all the author list information. It is used by both the arXiv and the journal as the main database of the paper;
csv: a comma-separated values file used to export authorlist metadata;
pdf: a view of the author list;
cds: a simple text file with the author list information in the format author: institute.
7.3 Proof checker functionalities
Once the author list has been sent to the journal with the publication, a check is made to determine whether the publisher has correctly used the information provided at the paper production step. This check involves a comparison of the journal pdf file that was sent back to the ATLAS Collaboration for a proof review, to the original xml/tex file. This process used to be done by hand, requiring the officer to verify that each of the (3000) authors and (200) institutes were correctly reported and matched.
The proof checker is the tool provided for ATLAS to compare the publication (pdf file) of author lists and acknowledgements provided by the journal with the ATLAS data (xml) file. A report of this comparison, one for every version of the proof, is available to ATLAS PO-Pub officers who check the results. The proof checker follows this process:
retrieve the information from the xml file, containing the authors and their affiliations;
extract the text from the journal’s pdf file;
parse the text from the pdf file, creating the target reference;
compare the official reference obtained from the xml file with the target reference;
create a report with the differences found between the original and the target reference;
link the report to the main report page, see Figure 15.
The main difficulty with this process is involved with extracting the content from the pdf file; the text is not easily retrieved, for a variety of reasons. One is that many elements have to be identified and ignored, such as row numbers, watermarks, footers, and headings. Another reason is that words extracted from a pdf file don’t follow a specific coding convention; the file can contain non-ASCII characters that can be output in many different ways. The pdf file can specify a predefined encoding standard to use, or provide a lookup table of differences between a predefined and a built-in encoding standard; for fonts with uncommon Latin characters, which are routine in this kind of publication, special encoding is used. It is necessary to provide a ToUnicodeTable where semantic information about the characters is preserved. Also the proof checker has to pass through all the publication text and recognize where the author list starts, where it ends, where the institute list starts, and where it ends. All this is made more difficult by the fact that different publishers have different layouts and create different versions of pdf files. This makes the above problems not generic, but often specific to a particular publisher.
After the target reference is created, the comparison looks for:
authors that seem to be missing from the pdf file. Here, false positives are often due to character encoding and spaces;
authors with inconsistent punctuation. This section points out differences between original and target references authors’ first name punctuation, which can follow the rules X. or X.Y. or X.-Y. or X-Y. with or without space;
institutes that seem to be missing from the pdf file. Here false positives are often due to non-standard characters that break the entry;
institutes with close matches. All the entries that look like the original but have some inconsistencies land in this group. Some publishers replace USA with United States of America (or vice versa). Sometimes there is a new character that does not break the institute entry, but makes it so that the match is not perfect, for example, “Università" and “Universit‘ a";
mismatched authors. All the authors collaborate through one or more institutes. It is checked that the link between the author and the institute is consistent. This sometimes results in a false positive, because it is not always easy to extract from the pdf file the index number of an institute, mainly because the text coming from the pdf file also includes other elements such as line numbers of the document. For this reason an author originally assigned to institute number X can end up matched with target institute YX, because in the text extracted from the pdf the number X might be preceded by a Y line number; institute YX may not exist;
deceased authors. In some cases, ATLAS has tagged authors as deceased but the publication forgot to mark them as such, or vice versa;
missing funding agencies, or those wrongly added by the publisher.
In early 2019, due to changes in CERN systems, the component written in PROLOG which ran the comparison went out of service. This implied an urgent need for a new tool for this task. PROLOG takes a different approach in a generic problem-solving situation: the expression of the problem is translated in a logic stream without working directly on its resolution algorithm. PROLOG
is a language that is difficult to maintain, due to the fact that few developers work with it and its logic programming paradigm. Python was chosen to replace this role.
A way to obtain the best match among all the items of an array of institutes and authors was sought, because one cannot rely on finding an author or institute in the same position of the sequences in the xml and pdf files. For this purpose the concept of Levenshtein distance (Section D.1) was applied, so that a weighted index of similarity can be obtained to decide what is matched with what, and to then effectively check for anomalies.
A feature was developed to help the script evaluate as perfect matches some that would not otherwise appear to be such. A list of synonyms (Section 7.3.1) is created for every entry, author or institute, to teach the proof checker to validate similar strings when the differences are due only to problems we have when decoding the text from the pdf file. So, for instance, if author X. Nonamečič is not found in the target reference, but from the pdf entries we extracted an author with name X. Nonamež ciž c, then, as it has been previously verified that in the pdf file the name appears as expected, the proof checker considers it a perfect match, and skips the problem. A very long list of false positives can be found in the report page as “skipped items". The list of synonyms is updated manually, but a tool, the Synonym web page (Section 7.3.2), has been created to allow users to update this list themselves.
7.3.1 Proof checker synonyms
As introduced in Section 7.3, the comparison between the pdf file and the xml file can generate false positives. To minimize the list of false positives in the report page, the new version of the proof checker includes a synonyms list that allows the comparison script to understand if the difference is a real error or another correct way to display the same information.
An example of a working synonym is:
|Institute as stored into ATLAS DB & xml file|
|Physics Department, SUNY Albany, Albany NY, United States of America|
|Institute as written on the journal’s author list|
|Physics Department, SUNY Albany, Albany, New York, USA|
These differences are acceptable, since the main information is correctly displayed and no real errors are found.
7.3.2 Synonym web page
To manage the list of proof checker synonyms, ATLAS provides a web page that allows users to search for an existing entry and manage the recorded synonyms. Searching for an institute or author will display the list of records that match the search criteria, see Figure 16. This allows users to edit the synonyms for the record. Clicking the edit icon shows a new page section where users can insert their own known synonym for the record. After confirmation, this is added to the list of synonyms and is taken into account by the next run of the proof checker.
7.3.3 Report page
The proof checker provides a report after its run, one for each paper and draft version. This report is provided and stored in a JSON file and must be parsed to show the report results in a human-readable way. This is done by the proof_report web page, see Figure 17. The report contains all the paper information plus the comparison results sorted by topic (see Section D.4). The JSON file contains more information than that which is displayed; this is done to allow the web page to optimize the display of the huge amount of information and to retain data for future improvements. The web page contains some hidden sections that are produced by the proof checker via the known synonyms. These can be displayed by clicking on ‘Skipped +’. Here the page will show all the false positive results that the proof checker found on its comparison, but that are ignored after association with the synonyms.
The proof checker helps the Physics Office staff in a tedious task, but it is far from being a perfect tool. It needs to be continuously maintained and updated for new cases, changes in publication layouts, and new conventions in the author lists and their format. Further improvements are planned, with the goal of reducing the number of cases to be checked manually by the user to just a couple of dozen.
8 Handling the metadata
The ATLAS database stores data of various kinds that are displayed in different ways via web pages. The FENCE framework provides an API to retrieve this information. A call to the API, allowed after a user authentication, provides the results in a JSON format. This kind of information is easily parsed by most common programming languages and is standard for API results.
There are 3 main ways ATLAS provides web pages:
standard HTML pages;
include files for TWiki pages;
FENCE web pages.
The first two options run on an ATLAS PO Virtual Machine, which provides scripts, cron-jobs, or HTML pages to the users. This Virtual Machine is directly connected to the FENCE framework to use its API and retrieve data, parse it, and store it in the EOS ATLAS file system.
8.2 ATLAS data in public pages
FENCE also allows members who do not belong to the ATLAS Collaboration to access some of the information stored in its database. It provides various ways to retrieve and show the information, through a cron-job that runs on the ATLAS PO Virtual Machine and extracts the data, parses it, and shows it to the user.
An example in which the data are retrieved using the FENCE API with a cron-job on the ATLAS PO Virtual Machine is the ATLAS map web page, Figure 18, where users can see a map of all active member institutes of the ATLAS collaboration.
This dynamic web page filters the results to use only the active institutes. This process is done on the ATLAS PO Virtual Machine by a Python script, which makes a request to the API, parses the results, and builds its own JSON file. This file contains all the institute information (name, country, links, coordinates, etc.) and the layers to build the map. Once the Python script builds the JSON file, the output is interpreted by the web page, which takes care of displaying the layers and the markers for the institutes.
8.3 Data on TWiki pages
The public results page, Figure 19, is an example of an include using an HTML page which retrieves data using the FENCE API and displays it to the user into a TWiki page. This page shows the full list of papers, CONF notes, and PUB notes stored in the ATLAS database and managed by the FENCE framework. It also allows users to filter results using the buttons on the top of the page. This page loads 1300 records with all the related information. Retrieving all these records at once requires a lot of time. To avoid this, the page initially loads only the first 10 records of each section and makes the page available for user interaction, then in the background it loads the other records. This solution allows the loading process to run faster and avoids users’ having to wait for the complete data loading.
8.4 Data on Fence public pages
Although normally the FENCE web pages are under restrictions based on users’ roles, the FENCE framework also allows web pages that should be displayed publicly to be generated. This solution allows the developers to use all the powerful FENCE functionalities (MBF, for example) and to simplify the data retrieval process. In addition, it grants the information to be loaded on demand, without cron-jobs or passing through the API in a way that will increase the web page loading time.
An example of a public web page completely built using the FENCE framework is the ATLAS Conference and Talks page, as shown on Figure 20. This retrieves all the talks, grouped by their conference and registered within ATLAS, and displays a summary of all the information for each talk and conference, including speaker, institute, conference name, date, and location. All the table’s columns have search fields in order to allow the users to easily find the talk they are looking for without parsing all the records displayed on the page. There is an option to filter the results as future, past, or all (the default option). The page also contains some internal links that point to FENCE web pages (such as the link to the speaker profile). Such links are marked as internal because they demand authentication for these non-public data.
This page was built using the MBF infrastructure described in Section 3.2. Before it, developers had to create public web pages on TWiki or from scratch and retrieve all the data by accessing the database directly.
This article summarises the tools that have been set up to support the publication of documents by the ATLAS Collaboration. While the emphasis is on papers published in refereed journals, the technology created also supports internal documents and other public documents such as Conference and Public notes.
The FENCE framework is used as the backbone of the whole setup and is also used to interface the web-based tracking of the status of an analysis with the documentation in . Extensive use is made of the Continuous Integration tools available in to ensure that documents can easily be submitted to the arXiv and journals as soon as they have been approved by the collaboration.
The software solutions described in this document are now used to accompany the whole of a physics analysis, from the expressions of interest by research groups, to the final journal publication. They also include the generation of the appropriate author list and process the proof-reading.
The tools are used by the whole collaboration and minimise the amount of manual work required for repetitive procedures, easing the workload of editors, editorial boards, Management, and the Physics Office. At the same time, all documents connected to an analysis can now be accessed from a central tool where the experiment’s rules and knowledge are codified and made available in an intuitive way.
The authors are indebted to the ATLAS Collaboration for the support provided to achieve the results described in this paper. We are grateful to ATLAS collaborators who provided invaluable comments and input to the paper and the framework it presents. Special acknowledgements go to Marzio Nessi for helping initiate the Glance project in ATLAS and for supporting its development, and to Kathy Pommes for supervising the Glance team at CERN. Special thanks to Giordon Stark for thoroughly reviewing this paper.
Appendix A Classes for analysis and paper phases
a.1 Graph class
a.2 Action class
a.3 User authorisation class
Appendix B FENCE and integration classes
b.2 execMethod function
b.3 createProject function
Appendix C Author list files
c.1 Author list XML file header
c.2 Author list XML file institutes
c.3 Author list XML file authors
Appendix D Proofs checks
d.1 Levenshtein distance
Mathematically, the Levenshtein distance between two strings a, b of length |a| and |b| respectively is given by lev a,b(|a|,|b|) where
Where 1(ai bj) is equal to 0 when ai = bj and equal to 1 otherwise, and leva,b(i,j) is the distance between the first i characters of a and the first j characters of b.