Technologies and digitalisation are rapidly emerging in our society. People significantly rely on software applications and computing devices in their daily lives, consequently leaving their traceable digital activities, contributions and communications on those digital devices across the Internet (Statista2021b)
. Even when people are not using software, data about their normal life activities may also be collected by software applications through the ubiquity of IoT and GPS devices, surveillance cameras, face recognition apps and so on. Thus, our privacy is under constant threat in this current digital age. In fact, privacy invasions and attacks have been increasing significantly in recent years(Statista; OAIC2021). For examples, a cyber crime in the U.S. in 2018 exposed 471 million personal records, and a breach of a national ID database in India leaked over 1.1 billion records including biometric information (e.g., iris and fingerprint scans) (Statista2021). Those incidents and threats raise an urgent need for privacy to be deeply integrated into the development, testing and maintenance of software applications.
Although security and privacy are often discussed together, they are not the same (Bambauer2013). Security often refers to protection against the unauthorised access to software applications and the data they collect and store. On the other hand, privacy relates to protection of the individual rights to their personally identifiable information in terms of how those personal data are collected, used, protected, transferred, altered, disclosed and destroyed (Scholz2015; ISO/IEC2011; Norton; DataPrivacyManager2021; SNIA2021). For example, security controls are put in place to ensure that only people with credentials have access a software application in a hospital. However, if anyone with valid credentials can see patient health records using this software, then privacy is not protected. This example demonstrates that security can be achieved without privacy, however security is an essential component for privacy protection.
Cyberattacks, either in the form of security or privacy attacks, are often formed by exploiting vulnerabilities or weaknesses111The two terms are often interchangeable. Hereby, we will use vulnerabilities to refer to both of them. found in software systems. For instance, the infamous WannaCry ransomware attack exploited a vulnerability in Microsoft Windows systems, while the Heartbleed vulnerability in OpenSSL has made millions of websites and online platforms across the world vulnerable to cyberattacks. To prevent similar attacks, efforts have been put into understanding and publicly disclosing vulnerabilities so that developers can identify and fix them in their software applications. These efforts have resulted in the widely-known Common Weakness Enumeration (CWE), and Common Vulnerabilities and Exposures (CVE) systems (CWE; CVE).
However, there have been very little work (e.g., Antn2004; Yang2013; Ma2013) in identifying privacy vulnerabilities. A system which specifically records common privacy vulnerabilities does not exist yet. Thus, software developers often rely on the CWE and CVE systems to learn about privacy-related weaknesses and vulnerabilities. However, it is not clear to what extent privacy concerns are covered in those systems, and whether privacy receives adequate attention (which it deserves). To answer these questions, we have collected all 922 weaknesses recorded in CWE and 156,537 records registered in CVE to date, filtered out non privacy-related records and further analysed the shortlisted records that are privacy-related. We have found only 41 and 157 privacy vulnerabilities in the CWE and CVE systems respectively. The coverage of privacy-related vulnerabilities in both systems is very limited, only 4.45% in CWE and 0.1% in CVE. (Contribution 1)
The next questions we aimed to explore are what privacy threats are covered in those privacy-related vulnerabilities in the CWE/CVE systems and if they are adequately cover the privacy threats raised in both research and practice. To answer these questions, we have conducted an explanatory study on the privacy engineering literature, privacy standards and frameworks (e.g., ISO/IEC 29100), regulations in different jurisdiction, including the European Union General Data Protection Regulation (EU GDPR), California Consumer Privacy Act (CCPA), Health Insurance Portability and Accountability Act (HIPAA), Gramm-Leach Bliley Act (GLBA), the U.S. Privacy Act (USPA) and the Australian Privacy Act (APA), and reputable industry sources (e.g., OWASP (OWASP2020) and Norton (Nortona)). This explanatory study informed us to develop a taxonomy of common privacy threats that have been raised in research and practice. The taxonomy is built upon the existing well-known privacy threats taxonomy (Stallings2019). Multiple raters/coders then examined all 41 and 157 privacy vulnerabilities in the CWE and CVE systems, and mapped them to this taxonomy. The Cohen’s Kappa coefficient, used to measure the inter-rater agreement, was obtained at 0.874 and 0.875 for the CWEs and CVEs respectively, an almost perfect agreement, suggesting the strong reliability of the classification. We found that the existing privacy weaknesses and vulnerabilities reported in the CWE and CVE systems cover only 13 out of 24 common privacy vulnerabilities raised in research and practice. Many important types of privacy weaknesses and vulnerabilities are not covered such as improper personal data collection, use and transfer, allowing unauthorised actors to modify personal data, processing personal data at third parties, and improper handling of user privacy preferences and consent. (Contribution 2)
These actionable insights led to our proposal of 11 new common privacy weaknesses to CWE222We chose CWE instead of CVE since CVEs specify unique vulnerabilities detected in specific software systems and application, while CWEs are at a more abstract, generic level.. These new CWE entries cover the areas of privacy threats that have been raised in research and practice but do not exist in CWE yet. To further confirm the relevance and validity of our proposal, we extracted real code examples from software repositories that match with the new CWEs. Our contribution follows the CWE’s true spirit of a community-developed list, and will enhance the CWE system to serve as a common language and baseline for identifying, mitigating and preventing not only security but also privacy weaknesses and vulnerabilities. (Contribution 3)
The remainder of the paper is structured as follows. Section 2 discusses related existing work in security and privacy vulnerabilities in software applications. The identification of privacy-related vulnerabilities in CWE and CVE is presented in Section 3. Section 4 discusses the taxonomy of privacy threats and how the privacy-related vulnerabilities in CWE and CVE systems cover those privacy threats. Section 5 presents a new common privacy weakness proposal. Threats to validity is discussed in Section 6. Finally, we conclude and discuss future work in Section 7.
2 Related Work
Several systems have been established to standardise the reporting process and structure of common vulnerabilities (e.g., CWE (CWE), CVE (CVE) and OWASP (OWASP2020)). However, identifying the root causes of the reported vulnerabilities is still a time-consuming and expertise-required process (Gonzalez2019)Li2017; Gonzalez2019; Liu2020a). However, privacy vulnerabilities were not addressed in those previous work.
Substantial research have been done for detecting software vulnerabilities in software systems (Wang2010; Liu2012b; Chernis; Lin2020; Chakraborty2021; Hanif2021). Early vulnerability discovery approaches evolved from static analysis, fuzzing, penetration testing and Vulnerability Discovery Models (VDMs) (Wang2010; Liu2012b)
. Later approaches applied more advanced techniques such as machine learning, deep learning and neural networks to improve the accuracy of vulnerability detection(Chernis; Lin2020; Chakraborty2021; HoaTSE21; Hanif2021). These approaches heavily focus on detecting security vulnerabilities in software systems, while detection of privacy vulnerabilities were overlooked.
Recent work have also studied and used the CWE and CVE systems. For example, the work in (Bhandari2021) collected CVE records with their associated CWEs and code commits. The collected information was then analysed to produce insightful metadata such as concerned programming language and code-related metrics. This work can be applied in multiple applications related to software maintenance such as automated vulnerability detection and classification, vulnerability fixing patches analysis and program repair. Galhardo2020 proposed a formulation to calculate the most dangerous software errors in CWE. They used this formulation to identify the top 20 most significant CWE records in 2019. Again, these prior work only focus on security vulnerabilities.
Limited work has been done in identifying privacy vulnerabilities in software systems (Antn2004; Ma2013; Yang2013)). Antn2004
proposed a taxonomy of privacy goals based on Internet privacy policies. The study employed a content analysis through goal mining process to extract privacy goals from 25 privacy policies in e-commerce industries. The process consists of three steps: goals identification, classification and refinement. These identified privacy goals were classified into privacy protection and privacy vulnerability goals. The privacy vulnerability goal addresses a set of information processing that may violate consumer privacy (e.g., information monitoring, collection and transfer).
Yang2013 introduced a framework to detect privacy leakage in mobile applications. This study identified several common privacy vulnerabilities in Android such as unintended sensitive data transmission and local logging. Ma2013 discussed a privacy vulnerability in mobile sensing networks which collect mobility traces of people and vehicles (e.g., traffic monitoring). Although these networks receive anonymous data, it was proven in the study that these data can identify victims. These studies have confirmed the occurrence of privacy vulnerabilities in multiple types of software systems (e.g., web/mobile applications and sensing networks). However, most of the existing studies only focused on security concerns when investigating software vulnerabilities, thus overlooked privacy-related concerns in many contexts.
A large number of existing studies have investigated methods to preserve and protect privacy in software development. Several studies have proposed approaches to derive privacy requirements from organisational goals (Kalloniatis2008), data protection and privacy regulations (Breaux2008; Mihaylov2016a; Ayala-Rivera2018; Guo2020) or privacy policies (Omoronyia2012; Massey2013) to ensure that software systems comply with those restrictions and/or constraints. Tschersich and Yang2016b presented frameworks for designing privacy-preserving architecture in software development. Recent work have examined how organisations or service providers have compiled with individual rights in data protection regulations. For example, Kroger2020 studied on how iOS and Android app vendors respond when users request to access their personal data. This request is a mechanism to execute the right of access, one of the individual rights covered in GDPR and other privacy regulations. The study found that only half of the app vendors responded to the user access requests. This violates the right of access as the users must be able to request and get responses to access their personal data.
A number of studies have proposed taxonomies of privacy threats (e.g., Solove2006a; Kotz2011; Alsamani2018). Kotz2011 developed a taxonomy of privacy threats for mhealth systems (mobile computing for personal health monitoring). It emphasises on identity, access and disclosure threats of patients and their personal health information. The threats in this taxonomy can be caused by patients, internal parties and external parties. Alsamani2018 proposed a taxonomy of security and privacy threats for IoT. The taxonomy identifies the security and privacy threats posed by IoT objects (e.g., sensors and cameras). It adopted four groups of threats developed by Solove2006a to analyse privacy concerns of IoT objects. These two studies cover limited areas of existing software applications. One of the most well-known privacy threat taxonomies was proposed by Solove2006a. This taxonomy covers the harmful activities that can violate privacy of individuals. However, the taxonomy does not focus on privacy threats in software systems. Thus, we have adapted and extended this taxonomy to address privacy threats in software engineering.
3 Identifying privacy-related vulnerabilities in CWE and CVE
Common Weakness Enumeration (CWE) and Common Vulnerabilities and Exposures (CVE) are two well-known systems for publicly known weaknesses and vulnerabilities in software and hardware. A vulnerability is defined as “a weakness in an information system, system security procedures, internal controls, or implementation that could be exploited or triggered by a threat source” (Guitierrez2013).
The CWE system identifies common categories of flaws, bugs and other errors found in software and hardware implementation, code, design or architecture that could be vulnerable to attacks (CWE). The CWE system has three views: i) by research concepts, ii) by software development and iii) by hardware design. Our study focuses on the research concepts and software development views as they are related to software applications. A CWE record consists of a description, relationships to other CWE records, likelihood of exploit, demonstrative examples, observed examples, common consequences, mitigations and other relevant attributes. The interested parties can use this information to identify a weakness in their software systems and applications. For example, CWE-359333https://cwe.mitre.org/data/definitions/359.html describes a weakness that exposes private personal information to an unauthorised actor (see Figure 1). This CWE record also provides demonstrative examples, one of which is a code fragment that expose a user’s location. The CWE record also specifies potential mitigations such as identifying and consulting with all relevant privacy regulations regarding the processing of users’ location.
CVE is a catalogue of cybersecurity vulnerabilities that may exist in software products, applications and open libraries (e.g. Skype, Mozilla Firefox and Android). These vulnerabilities and their details are reported by organisations that have partnered with the CVE program. Once reported, the CVE Numbering Authorities (CNAs) will be responsible for determining each vulnerability record and assigning a unique identifier (i.e. CVE ID) to that record. Each record describes the details of a vulnerability and specifies affected version of a software. The CVE records are then publicly published and accessible to interested parties (e.g. software developers, organisations and researchers). These records specify unique vulnerabilities detected in specific software systems and application, thus they are more specific comparing to CWE records. For example, CVE-2000-1243444https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2000-1243 refers to a privacy leak in version 3.04 of Dansie Shopping Cart in which sensitive information such as user credentials were sent to an e-mail address controlled by the product developers.
In the absence of a system which specifically records common privacy vulnerabilities, software engineers and other interested parties often rely on the CWE and CVE systems for privacy weaknesses and vulnerabilities (like the one in Figure 1). However, both CWE and CVE target at cybersecurity, and although security and privacy are often discussed together, they are not the same. Security vulnerabilities are often exploited by unauthorised access to perform malicious actions in software applications. By contrast, privacy vulnerabilities may lead to violations of the individual rights to their personally identifiable information in terms of how those personal data are collected, used, protected, transferred, altered, disclosed and destroyed. Hence, we have explored to what extent privacy concerns are covered in the CWE and CVE systems, and whether privacy receives adequate attention which it deserves.
The privacy vulnerability identification process (see Figure 2) consists of the following steps: (i) obtaining the CWE and CVE lists, (ii) determining a list of keywords and performing a keyword search, (iii) identifying privacy-related criteria and annotating the CWE and CVE records, and (iv) performing privacy vulnerability analysis.
To identify the privacy vulnerability in CWE and CVE lists, we first download the whole CWE records in the research concepts and software development views555The CWE data is available at https://cwe.mitre.org/data/downloads.html and CVE records666The CVE data is available at https://cve.mitre.org/data/downloads/index.html from their websites. In the CWE list, the research concepts view contains 922 weaknesses. The software development view contains 418 weaknesses, however all of them are also covered in the research concepts view. Thus, we examined all the 922 weaknesses in the CWE to date. The attributes include CWE-ID, name, weakness abstraction, status, description, extended description, related weaknesses, weakness ordinalities, applicable platforms, alternate terms, mode of introduction, exploitation factors, likelihood of exploit, common consequences, detection methods, potential mitigations, observed examples, functional areas, affected resources and taxonomy mappings, related attack patterns and notes. The CVE list contains 156,537 records to date, all of which were examined in our study. We examined all of the following attributes in each CVE record: name (i.e. CVE-ID), status, description, references, phases, votes and comments.
After obtaining both lists, we then performed keyword searches to filter out the CWE and CVE records that do not have privacy related keywords. We used a search function in Microsoft Excel to examine the keywords in those records. A set of keywords consists of 37 words categorised into 4 groups as follows:
Group 1: general terms that relate to weaknesses and vulnerabilities in privacy. The terms include privacy, violation, leak and leakage (4 keywords). The term privacy generally appears in the CWE and CVE that reported privacy weaknesses and vulnerabilities. On top of that, we also include the terms violation, leak and leakage as they are alternatively used to express concerns in the context of privacy (see CWE-359777https://cwe.mitre.org/data/definitions/359.html for more details).
Group 2: terms used to refer to personal data or personally identifiable data. They could be sometimes used interchangeably. The keywords in this group include personal information, personal data, sensitive information, sensitive data, private information, private personal information, personally identifiable information, PII, protected health information, PHI, health information and health data (12 keywords). These terms are used in regulations (e.g., GDPR and HIPAA), standards (e.g., ISO/IEC 29100) and industry sources (e.g., Norton, CWE and CVE).
Group 3: terms that relate to relevant privacy and data protection regulations/standards/frameworks. In this group, we select a set of specific well-known and widely-adopted data protection regulations and privacy frameworks, which include General Data Protection Regulation, GDPR, California Consumer Privacy Act, CCPA, Health Insurance Portability and Accountability Act, HIPAA, Gramm-Leach Bliley Act, GLBA, Safe Harbor Privacy Framework, ISO/IEC 29100 (10 keywords). In addition, we also include the general terms to ensure that we cover unseen regulations and frameworks. The terms consist of regulation, data protection, privacy act, privacy framework, privacy standard (5 keywords).
Group 4: terms that are frequently seen in the privacy policies and literature when discussing personal data protection and user privacy (e.g., Massey2013, GDPR, CCPA and ISO/IEC 29100). These include right(s), consent, opt in/opt-in, opt out/opt-out, preference and breach (6 keywords).
|Rights||Description||Relevant data protection regulations/privacy acts|
|Right to be informed||Individuals must be informed about the collection, use and processing of their personal data. For example, the software applications must inform the purposes of personal data collection and processing. In case the personal data is processed by other parties, the individuals must be informed to whom their personal data is disclosed/transferred.||GDPR, CCPA, HIPAA, GLBA, APA|
|Right of access||Individuals must be able to request to access and receive a copy of their personal data.||GDPR, HIPAA, USPA, APA|
|Right to rectification||Individuals must be able to request to correct, complete and/or make changes to their personal data.||GDPR, HIPAA, USPA, APA|
|Right to erasure||Individuals must be able to request to erase their personal data.||GDPR, CCPA|
|Right to restrict of processing||Individuals must be able to request to restrict of personal data processing.||GDPR, HIPAA|
|Right to data portability||Individuals must be able to request obtain, reuse, move, copy or transfer their personal data across different services for their own purposes.||GDPR|
|Right to object||Individuals must be able to object to the processing of their personal data in certain circumstances (e.g. direct marketing).||GDPR, APA|
|Rights in relation to automated decision making and profiling||Organisations must obtain consent from individuals when the software may process their personal data based on automated decision making and profiling.||GDPR|
|Right to opt-out of sale||Individuals must be able to request to opt-out if their personal data is sold by organisations.||CCPA|
|Right to non-discrimination||Individuals, who exercise their individual rights with respect to particular regulations, must be provided with the same quality and price of goods and services by businesses.||CCPA|
|Right to request confidential communications||Individuals must be able to request to change means or location for receiving communication of health information.||HIPAA|
|Right to be protected against unwarranted invasion||Personal data must be protected from invasion resulting from the collection, use, disclosure, maintenance and other processing.||GDPR, USPA|
|Right to make a complaint||Individuals must be able to lodge a complaint to relevant organisations or supervisory authority if their personal data is mishandled.||GDPR, APA|
|Right to opt-out of sharing||Individuals must be able to opt-out from sharing their personal data.||GLBA|
|Right to not identifying yourself||Individuals must have an option not identifying themselves in certain circumstances (e.g. using pseudonym).||APA|
We acquired 185 CWE and 1,088 CVE records that contain at least one of the specified keywords. Next, we manually examined each of those records to identify privacy vulnerabilities. A vulnerability is considered as privacy-related if it satisfies one of the following criteria:
A weakness or vulnerability involves with any processing of personal data including, but not limited to, collection, use, storage, transfer, alteration, erasure and disclosure. As personal data can be used on its own or in conjunction with other information to identify or trace an individual (e.g. name, identification number, social security number, user credentials, health records, email addresses), the processing of personal data may affect information and user privacy (Scholz2015; ISO/IEC2011; DataPrivacyManager2021).
A weakness or vulnerability which may lead to violations of the individual rights (see Table 1). To extract the individual rights, we first selected a range of well-established data protection and privacy regulations in different domains such as governments, businesses, healthcare and finance (e.g., EU GDPR, California Consumer Privacy Act (CCPA), Health Insurance Portability and Accountability Act (HIPAA), Gramm-Leach Bliley Act (GLBA), the U.S. Privacy Act (USPA) and the Australian Privacy Act (APA). These regulations have been widely enacted in country- and regional-level, hence they are well respected by organisations worldwide. The regulations consist of articles or sections that explain provisions and details of legal constraints. In each regulation, we went through each article to look for the individual rights of data subjects/patients/consumers (i.e., the rights that data subjects must have in order to control the processing of their personal data). Once we found the individual right, we added and summarised its description into our list. Several individual rights are fundamental for personal data and privacy protection (e.g., right to be informed and right of access), hence they have been found in several regulations.
We went through the shortlisted 185 CWE and 1,088 CVE records to determine the vulnerabilities that meet the above criteria. In addition, we have found that the National Vulnerability Database (NVD) had done some mapping between CVEs and CWEs. Hence, once we have identified privacy-related CVEs, we used this mapping and applied the above criteria to identify additional privacy-related CWE records.
We also note that the CWE weaknesses are organised hierarchically where each CWE weakness can have other CWE weaknesses as its parents or children. We investigated whether privacy properties are inherited through this structure. We discuss here CWE-200888https://cwe.mitre.org/data/definitions/200.html as an example to demonstrate this case. CWE-200 describes a weakness that exposes sensitive information to an unauthorised actor. This weakness has one parent and eleven children. CWE-200 itself is classified as privacy-related, however its parent and seven of its children are not privacy-related. Similarly, CWE-201, one of the CWE-200 children, is privacy-related, but its child CWE-598 is not privacy-related. We also further explored the rest of CWE-200 children, and found that privacy properties are not inherited through the hierarchical structure. Thus, we did not take a hierarchical structure into account when examining whether a weakness is privacy-related.
We identified 41 and 157 privacy vulnerabilities in the CWE and CVE records respectively. The first 28 privacy-related CWE records were found after the keyword search and manual examination steps. The additional 13 privacy-related CWE records were later added after being identified by the privacy-related CVE records. They cover a wide range of privacy concerns in software applications such as missing personal data protection, improper access control, insufficient credentials protection, personal data exposures, unintentional errors made by software developers and personal data attacks from external attackers. These weaknesses and vulnerabilities are not only related to security, but also affect privacy of individuals.
We discuss here a few examples of CWE and CVE records that were classified as privacy vulnerabilities or weaknesses and refer the reader to (rep-pkg-privul) for a full list of them. CWE-359999https://cwe.mitre.org/data/definitions/359.html refers to exposure of private personal information to an unauthorized actor. Private personal information here includes social security numbers, geographical location, financial data and health records. In addition, this CWE also mentions relevant data protection regulations and privacy acts such as GDPR and CCPA.
Another example is CVE-2020-13702101010https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-13702 which refers to the rolling proximity identifier used in the Apple/Google exposure notification API beta through 2020-05-29. This vulnerability enables attackers to evade Bluetooth Smart Privacy due to a secondary temporary UID through tracking individual device movement using a Bluetooth LE discovery mechanism. This is a privacy vulnerability since it concerns user’s location and relates to the processing of user’s location. CVE-2021-21301111111https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-21301 describes a vulnerability in Wire for iOS (iPhone and iPad) before version 3.75. The application enables camera for video capturing, although the users have disabled this service. This vulnerability seriously violates user privacy as the users are not aware of their camera being enabled, and the camera may capture and expose the users and their environment.
CVE-2019-16522121212https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-16522 refers to the cross-site scripting (XSS) attacks occurred in WordPress. The eu-cookie-law plugin through 3.0.6 for WordPress is vulnerable to cross-site scripting (XSS) attacks. This vulnerability affects the cookie consent message which leads to the unclear information provided to the users, including details regarding personal data processing. CVE-2011-1717131313https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2011-1717 describes a vulnerability occurred in Skype for Android. The application stores sensitive user data without encryption in the sqlite3 databases (e.g. user IDs, phone number and date of birth). This CVE concerns the user data that can be used to identify an individual such as phone number and date of birth. It also involves with the lack of personal data protection stored in the database.
There are many cases where the reported vulnerabilities are not specifically privacy related. For example, CWE-78141414https://cwe.mitre.org/data/definitions/78.html enables the attacker to execute arbitrary commands on the operating systems, leading to unauthorised access to operating systems. However, this is not specifically a privacy vulnerability since it does not involve personal data or any processing of personal data. Similarly, CVE-2021-21466151515https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-21466 describes a vulnerability that allows an attacker to inject malicious code into certain versions of SAP Business Warehouse and SAP BW/4HANA. The attacker could create a malicious ABAP report which could be used to access data and disrupt the system functionalities. This vulnerability can lead to Denial of Service. This is again a security vulnerability rather than a privacy one since it does not directly related to violations of the individual rights to their personally identifiable information.
The coverage of privacy-related vulnerabilities in both CWE and CVE records is quite limited: 4.45% in the CWE system and 0.1% in the CVE system.
4 Common privacy threats in software applications
This section investigates how those privacy-related vulnerabilities identified in CWE and CVE (see Section 3) address common privacy threats in software applications. We first discuss a taxonomy of common privacy threats that we have developed based on an explanatory study of the literature. We then report the threats that have not been covered by the existing privacy-related vulnerabilities in CWE/CVE.
4.1 Explanatory study
To identify common privacy threats in practice, we performed an explanatory study on the literature of the following three groups: existing privacy software engineering research, well-established data protection regulations and privacy frameworks, and additional reputable industry resources. The details of this process are described below.
4.1.1 Privacy engineering research
Privacy engineering has attracted an emerging area of research in software engineering (Gurses2016). Our study in this area followed a systematic literature review process proposed by Kitchenham2007a to retrieve relevant papers and conduct a literature survey. This process consists of three phases: planning, papers selection and extracting & reporting (see Figure 3). In the planning phase, we defined our study goal which is to identify privacy threats addressed in the privacy engineering research. Our research question is “What are privacy threats caused by software developers/data processors/service providers/external parties that were addressed in privacy engineering research?”. To achieve the study goal, we developed a review protocol to determine the scope of our study. The protocol consists of four tasks: (i) determining a search keyword and time scope, (ii) selecting software engineering publication venues, (iii) determining exclusion and inclusion criteria, and (iv) determining a set of questions to identify the privacy threats in the papers. In the papers selection phase, we conducted a search process and applied inclusion and exclusion to the retrieved papers. Finally, we analysed and identified privacy threats in the included papers and reported the results.
Search keyword and time scope.
We used a search keyword “privacy” to search in title, abstract and keywords fields of papers in the selected publication venues. Only one search keyword was used as we already performed a search in the specific software engineering publication venues, thus we aimed to get all the papers that address privacy. In addition, the papers from these venues are peer-reviewed by experienced researchers in privacy and software engineering area, hence their significant contributions and quality are well received by the research community. We also determined to search for papers that were published in the past 20 years (2001 - 2020). This task was automatically done by the search functions in the academic databases (i.e., IEEE Xplore, ACM Digital Library, ScienceDirect, SpringerLink and Scopus). We have found 1,434 papers related to privacy during the period of 2001 to 2020 (see Figure 4).
Software engineering publication venues.
As privacy has also been widely addressed in other fields of study (e.g., law and social sciences), we scope down the search to get a reasonable number of papers in software engineering. We selected seventeen highly recognised software engineering publication venues consisting of 7 conferences, 6 journals, 2 symposiums and 2 workshops. These venues focus on various disciplines in software engineering, which make them proper candidates for representing privacy in multiple areas. Many of these venues were included in the existing papers that conducted systematic literature review in software engineering research (e.g., Ebrahimi2019; Perera2020a; Bertolino2018a). The full list of venues, along with the number of papers found in each, is shown in Table 2.
Exclusion and inclusion criteria.
We manually performed an inclusion and exclusion task to ensure that the papers retrieved from the automated search process satisfy our study scope. We initially determined a set of inclusion and exclusion criteria to filter out the papers that are irrelevant to our goal. The exclusion criteria (EC) are as follows:
EC1: Papers that contain insufficient, incomplete or irrelevant information. As we automatically exported the records of studies from their databases, we found a number of records that contain journal/conference/workshop introduction, information from program chairs, guest editorials, summaries of keynotes and prefaces, which do not satisfy our study goal. Hence, these records are excluded from our study.
EC2: Papers that are duplicate. We consider from the title of the papers. If the papers have the exact same titles, other versions except the most recent version of those studies are excluded.
EC3: Papers that are secondary or tertiary studies (e.g., existing systematic literature review papers and systematic mapping studies) and posters are excluded.
The following inclusion criteria (IC) are applied in the screening step:
IC1: The primary contribution of the papers is related to privacy.
IC2: The research contribution of the papers is related to software development.
After applying the exclusion criteria, we excluded 209 out of 1,434 papers. 1,225 papers were passed to the next step. We then applied the inclusion criteria to the abstracts of the papers (see Figure 5). If the papers satisfy both inclusion criteria, they are included in our study. If the papers do not satisfy one or both criteria, we exclude them from our study. Finally, 417 papers were included in our study161616The bibliographic data of those papers are available in (rep-pkg-privul).
Privacy threats identification.
To identify privacy threats, we examine the research studies reported in those papers, and shortlist the ones that focus specifically on privacy vulnerabilities and attacks from this list. The papers were analysed by asking the following questions:
What is a cause of privacy-related problem in software that has been raised in the paper? This cause must involve with personal data and is harmful to data subjects. This question aims to list all the privacy threats addressed in the paper.
Is the identified cause caused by software developers, data controllers/processors, organisations or external parties? This question helps us focus on the privacy threats that are not caused by users. There are papers that investigate user perceptions towards different privacy perspectives (e.g., user perceptions of online behavioural advertising/smart home devices and user confidence in using smartphones). We do not include the privacy threats that are caused by users in this study as it is out of our scope.
|Source title||Acronym||Type||Count||Included||Threats found|
|ACM Transactions on Software Engineering and Methodology||TOSEM||Journal||4||1||0|
|Empirical Software Engineering||EMSE||Journal||6||3||0|
|IEEE Symposium on Security and Privacy||SP||Symposium||330||64||9|
|IEEE Symposium on Security and Privacy Workshops||SPW||Workshop||134||49||5|
|IEEE Transactions on Software Engineering||TSE||Journal||22||7||1|
|Information and Software Technology||IST||Journal||16||5||0|
|International Conference on Automated Software Engineering||ASE||Conference||20||8||3|
|International Conference on Availability, Reliability and Security||ARES||Conference||186||113||93|
|International Conference on Foundations of Software Engineering||FSE||Conference||13||5||0|
|International Conference on Mining Software Repositories||MSR||Conference||8||2||0|
|International Conference on Software Engineering||ICSE||Conference||90||22||4|
|International Requirements Engineering Conference||RE||Conference||68||25||2|
|International Symposium on Empirical Software Engineering and Measurement||ESEM||Conference||5||1||0|
|International Workshop on Evolving Security and Privacy Requirements Engineering||ESPRE||Workshop||32||7||0|
|Symposium on Usable Privacy and Security||SOUPS||Symposium||104||77||47|
|Systems and Software||JSS||Journal||377||18||13|
|Breach of confidentiality||3|
|Not a vulnerability/an attack||238|
4.1.2 Privacy regulations and frameworks
We have studied 7 well-established data protection and privacy regulations and frameworks (i.e., GDPR (OfficeJournaloftheEuropeanUnion;2016), CCPA (CCPA), HIPAA (HIPAA), GLBA (GLBA), USPA (US1974), APA (APA) and ISO/IEC 29100 (ISO/IEC2011)). We have found that these regulations and frameworks focus mainly on the privacy threats to the rights that allow the individuals to control their personal data processing. They also raise a number of privacy threats relating individuals being informed about the processing (e.g., collection, use and transfer) of their personal data in software applications.
4.1.3 Additional industry sources
We have also included a range of reputable industry sources (e.g., OWASP and Norton) on this topic. These sources (e.g., OWASP2020; Norton) cover the recent trends of privacy risks and attacks. These sources are developed by well known communities and companies. For example, OWASP identified 20 privacy risks in web applications (OWASPsurvey) such as leaking personal data or vulnerabilities that involves stealing personal data through common cyberattacks such as cross-site scripting and broken session management. Norton also identified a number of social engineering cyberattacks that are related to privacy such as phishing and keystroke logging attacks (Nortona).
In the next section, we will discuss in details the common privacy threats that we have found in our explanatory study of the literature.
4.2 A taxonomy of common privacy threats
We have built a taxonomy of common privacy threats upon the well-established privacy threats taxonomy described in Stallings2019. This taxonomy was originally proposed by Solove2006a which covers privacy of individuals through the conception of laws. This taxonomy cover cases which are mainly caused by physical activities which violate privacy of individuals (e.g., a newspaper reports the name of a rape victim, or a company sells its members’ personal information despite promising not to do so) (Solove2006a). This taxonomy has been considered as one of the comprehensive taxonomies that emphasise on privacy threats. Later, Stallings2019 adapted Solove’s taxonomy to the context of information systems. However, Stallings’s taxonomy covers privacy threats in a generic and rather abstract level. Thus, we tailored this taxonomy and refined it into a more concrete version that addresses privacy threats in software engineering. The taxonomy consists of four categories of privacy threats: information collection, information processing, information dissemination and invasions (see Figure 6). Each of these groups contains different subcategories covering relevant harmful privacy threats. The yellow boxes in Figure 6 represent the categories and subcategories included in the original taxonomy.
After identifying the privacy threats in the explanatory study, we classified those threats into two groups: vulnerabilities and compliance. Privacy vulnerabilities refer to technical issues, flaws or errors that lead to privacy exploits in software applications and platforms. Compliance addresses the privacy threats which are related to the individual rights and the governance of personal data. We then expanded the Stallings’s taxonomy by mapping the privacy threats in the vulnerabilities group into their relevant subcategories in the original taxonomy (i.e., blue boxes in Figure 6). However, the privacy threats in the compliance group have not been addressed in any groups in the original taxonomy. Thus, we propose the compliance group as an extension to the original taxonomy (see Figure 7). This group is discussed in details in Section 4.2.5. The full taxonomy is available at (rep-pkg-privul).
In our taxonomy, we classify twenty-four privacy vulnerabilities into seven subcategories. In the classification process, we mapped a privacy threat in to the most relevant subcategory based on the description of each subcategory explained in Stallings2019. We describe the categories, subcategories and their relevant privacy threats below. We also include the samples of sources where the privacy threats were raised or discussed.
4.2.1 Information collection.
(Sources: Antn2004; Deng2011; Jana2013; Hasan2020; Lebeck2018; De2016; Drosatos2014a; OfficeJournaloftheEuropeanUnion;2016; ISO/IEC2011; HIPAA; GLBA; US1974; APA; OWASPsurvey) This category concerns privacy threats that occur when collecting personal data from individuals. The surveillance subgroup covers vulnerabilities existing in the way software collects personal data such as watching individuals through cameras or CCTVs, listening to individuals, or recording individuals’ activities. The privacy threat related to this subgroup is caused when personal data is collected without consent or permissions from individuals (e.g., via mobile sensors).
4.2.2 Information processing.
(Sources: Antn2004; Deng2011; Zhang2020a; Calandrino2011; Jana2013; Hasan2020; Omoronyia2012; Some2019; Figueiredo2017; Venkatadri2018; Horbe2015a; De2016; Fisk2015; Barman2015; Lin2016b; Iqbal2009; Sicari2012a; Yang2016b; ErolaArnau2011; BilogrevicIgor2011; Anh2014; Siewe2016a; Calciati; Drosatos2014a; Zhang2005a; Castiglione2010a; Tschersich; Rafiq2017; Iyilade2014; OfficeJournaloftheEuropeanUnion;2016; ISO/IEC2011; HIPAA; GLBA; APA; US1974; OWASPsurvey) This category covers vulnerabilities related to the use, storage, and manipulation of the collected personal data. It contains four subgroups: identification, insecurity, secondary use and exclusion. The identification subcategory addresses a privacy threat that aggregates personal data from various sources and use it to identify individuals.
The vulnerabilities in insecurity subcategory are caused by improper protection and handling of personal data. There are multiple forms of this vulnerability type such as lacking mechanisms to protect personal data, allowing unauthorised actors (e.g., internal/external staff and attackers) to access or modify personal data, or track individual users (e.g., user’s locations, visiting history, etc.). Transferring personal data between software applications without protection is also another form of this vulnerability. Personal data is sometimes required to be processed at third parties. This poses a privacy vulnerability since the third parties may apply lower levels of personal data protection than the source does. In addition, this subcategory refers to the vulnerabilities where mechanisms to protect personal data are in place but they are not appropriate. Different types of personal data require different methods/techniques and levels of protection. Thus, appropriate protection mechanisms should be used to protect personal data against potential risks. This vulnerability type is often refined into improper techniques/methods and insufficient levels of protection (e.g., weak encryption).
The secondary use subcategory refers to the use of personal data for other purposes without consent or not following user privacy preferences. Personal data used or transferred without user permissions, or to an unintended destination is a privacy vulnerability. The personal data used without following user privacy preferences can also cause a privacy vulnerability171717For example, CVE-2005-2512 reports a vulnerability which could result in a privacy leak in mail.app in Mac OS 10.4.2 and earlier in which remote images are loaded against the user’s preferences when an HTML message is printed or forwarded.
The exclusion subcategory refers to the failure to provide individuals with notice and input for managing their personal data. These vulnerabilities relate to consent which allows the users to express their agreement on the use of their personal data in software applications. We note that consent handling may seem to be a part of compliance, however this subcategory focuses on the malfunctions that cause vulnerabilities in consent handling. User consent is required when the processing of personal data is not required by laws. Users should also be notified when the conditions on the consent are changed. In addition, users should be allowed to modify or withdraw their consent. One example of this privacy vulnerability is in mobile applications where users allow a specific permission to an app, however the permission is overridden in other apps without their consent (Calciati; Zhang2020a). Apart from user consent, user privacy preferences are also important. Privacy preferences enable users to personalise how they prefer their personal data to be managed. This privacy vulnerability type can be in two forms: privacy preferences not provided to the users, and users not be able to modify their privacy preferences.
4.2.3 Information dissemination.
(Sources: Zhang2020a; Calciati; Jana2013; Lucia2012; OWASPsurvey) This category refers to the privacy threats that lead to the revelation of personal data to public. The breach of confidentiality subcategory covers the vulnerabilities that cause personal data leakage by those who are responsible for personal data processing (e.g., software developers, data analysts, etc.).
(Sources: Deng2011; Zeng2019; Reinheimer2020; OWASPsurvey; Nortona) This category addresses attacks that directly affect individuals. The intrusions subcategory covers vulnerabilities exploited by common privacy attacks in software applications. There are typically four attacks: web applications, phishing, keystroke logging and smart home devices. The attacks in web applications include, but not limited to: injection, broken authentication/authorisation/session management, security misconfigurations, cross-site request forgery (CSRF), insecure direct object reference (IDOR), using components with known vulnerabilities and invalidated redirects and forwards. Phishing attacks can be used to steal personal data by sending malicious links to the users through software applications, text messages and emails. These links bring the users to fake pages or programs that ask for their confidential data. The users are not aware of these malicious activities and end up giving away their personal data that can identify themselves (e.g. address, personal identifiers and financial data) to the attackers. Keystroke logging allows the attackers to track and record the keys that users have input on their keyboards. The attackers can then capture and steal personal data entered by users. Smart home devices are also vulnerable to external attacks. The attackers can intercept smart home systems to gain access to personal data.
We used a bottom-up approach to construct a taxonomy for the compliance category. The privacy threats that address the same concerns are grouped into the same subcategory. We classify privacy compliance into the following three subcategories. The first is not complying with individual rights (sources: OfficeJournaloftheEuropeanUnion;2016; CCPA; HIPAA; GLBA; US1974; APA; ISO/IEC2011; Deng2011; Omoronyia2013a; Bhatia2018; Bhatia2016a; Omoronyia2012; De2016; Yu2021; Mihaylov2016a; OWASPsurvey). There are 16 individual rights identified in our study (see Table 1).
The second is not providing contact details of a responsible person (sources: OfficeJournaloftheEuropeanUnion;2016; ISO/IEC2011). This subcategory cover cases where software applications do not provide the contact details of a responsible person or a representative who controls the processing of their personal data. This is a privacy threat as the users do not know whom to be contacted regarding their personal data processing. The third is improper personal data breach response (sources: OfficeJournaloftheEuropeanUnion;2016; ISO/IEC2011; HIPAA; OWASPsurvey). When a breach occurs, a responsible person must notify two key stakeholders: concerned individuals and a supervisory authority. The privacy threat is raised if the responsible person does not communicate the breach incident to the concerned users whose personal data is leaked to the public and the supervisory authority who monitors the personal data processing under the individual rights.
Our taxonomy covers the existing vulnerabilities occurred in different activities in software systems. These privacy vulnerabilities have been raised in real world software applications and software development processes. The taxonomy was also developed based on an existing comprehensive privacy threats taxonomy, and validated with the common vulnerabilities reported in CWE and CVE, as we will discuss in details in the next section.
4.3 Privacy threats covered in CWE/CVE
The next step in our study was to investigate how 41 privacy vulnerabilities in CWE and 157 in CVE (see Section 3) address the taxonomy of common privacy threats in Section 4.2. This process consists of the following steps. The first two co-authors (hereafter referred to as the coders) independently analysed each of the 41 privacy vulnerabilities in CWE, and classified it into the most relevant privacy threat in the taxonomy. Of the 157 privacy vulnerabilities in CVE, 112 were assigned to a specific CWE by the NVD. These 112 CVEs are automatically classified into the same privacy threat as their associated CWE. The remaining 45 CVEs, which were not assigned a specific CWE181818NVD used two special placeholder names for these: NVD-CWE-noinfo and NVD-CWE-other., were manually classified by our coders. To facilitate the classification step, each coder was provided with a Google Sheet form pre-filled with the privacy vulnerabilities in CWE and CVE and the privacy threats in the taxonomy. The privacy threats were prepared as a drop down list. The coders then selected the vulnerability that is most relevant and best described those CWE and CVE in their view.
We have also employed the Inter-Rater Reliability (IRR) assessment and disagreement resolution processes to ensure the reliability of the classifications. The Cohen’s Kappa coefficient is used to measure the inter-rater agreement as it is a perfect measure for a multi-class classification problem with two coders (Hallgren). Once the coders had completed the classifications, the inter-rater agreement was computed. The Kappa agreement values between the coders are 0.874 and 0.875 in CWE and CVE classifications respectively, which both achieve almost perfect agreement level (Viera2005). A disagreement resolution was conducted to resolve some small classification conflicts (4 CWEs and 4 CVEs). The coders met, went through them together, discussed and reclassified those vulnerabilities. Thus, the final classification reached the maximum agreement between the coders.
We have found that all the 41 CWEs and 157 CVEs together cover 13 vulnerabilities in the taxonomy. They are annotated with the corresponding CWEs and CVEs in Figure 6. For brevity, we do not include the CVE numbers in Figure 6, but they are provided in full in our replication package (rep-pkg-privul). Exposing personal data to an unauthorised actor, insufficient levels of protection, and personal data attacks are the top three most addressed vulnerabilities in both CWE and CVE. Personal data protection seems to attract a lot of attentions in CWE/CVE with more than half of the CWEs (56.1%) and CVEs (59.87%) vulnerabilities reported, most of which (36.59% in CWEs and 50.32% in CVEs) are related to exposing personal data to an unauthorised actor. There are 19.51% in CWE and 15.92% in CVE reporting vulnerabilities regarding personal data attacks.
There are four types of privacy vulnerabilities that are covered by both CWE and CVE: exposing personal data to an unauthorised actor, insufficient levels of personal data protection, improper methods/techniques for personal data protection, and personal data attacks. There are several types of privacy vulnerabilities that have been reported in CVE, but not in CWE, such as allowing unauthorised actors to track individuals, not following user privacy preferences, and not asking for user consent. 8.28% of the CVEs refer to those types of privacy vulnerabilities, suggesting that those types of vulnerabilities need to be added into the CWE system.
There are a number of areas that are not covered by the existing privacy vulnerabilities in CWE and CVE (as highlighted in red outline in Figure 6). For example, exclusion involves a range of privacy vulnerabilities in failing to provide users with notice of user consent and input about their privacy preferences. User consent and privacy preferences are two essential mechanisms that enable users to control their personal data processing. These sources (Antn2004; ISO/IEC2011; HIPAA) confirm that user privacy is vulnerable if users are not presented with options to specify or cannot modify their user privacy preferences. Similarly, user privacy may be compromised if the users cannot modify or withdraw their consent, or are not notified about any changes of consent (OfficeJournaloftheEuropeanUnion;2016; ISO/IEC2011; HIPAA; APA; OWASPsurvey). However, none of them (except not asking for a user consent vulnerability) are covered in existing CVE.
Other areas that are not well covered in existing CWEs and CVEs are privacy vulnerabilities due to insecurity and secondary use. Processing personal data at a third party is a risk since they may apply lower levels of personal data protection, particularly mobile (Zhang2020a) and web applications (OWASPsurvey). There are cases that user privacy is violated when personal data is used for unspecified purposes or used or transferred without permissions (Antn2004; Deng2011; Zhang2020a; Jana2013; Hasan2020; Rafiq2017; Lebeck2018; Iyilade2014; De2016; Fisk2015; OfficeJournaloftheEuropeanUnion;2016; ISO/IEC2011; HIPAA; GLBA; US1974; APA; OWASPsurvey). In addition, although allowing unauthorised actors to modify personal data and collecting personal data without user consent/permissions are serious threats, none of the existing privacy vulnerabilities in CWE and CVE cover this.
The existing privacy weaknesses and vulnerabilities reported in the CWE and CVE systems cover only 13 out of 24 common privacy vulnerabilities raised in research and practice.
5 New common privacy weaknesses
Our study has shown the gaps in the existing CWE and CVE systems in terms of covering privacy vulnerabilities. To fill these gaps, we propose 11 new common privacy weaknesses to be added to the CWE system. We focus on CWE instead of CVE since CVE specifies unique vulnerabilities existing in specific software systems and application, while CWEs are at the generic level, similar to our taxonomy of privacy threats. We followed the CWE schema (CWEschema2021) to define the new CWE entries which include attributes such as name, description, mode of introduction, common consequence, detection method, potential mitigation and demonstrative example. These attributes provide an overview of a privacy weakness in terms of its causes and consequences, and mitigation methods.
The new common privacy weaknesses that we propose address four groups of threats: surveillance, insecurity, secondary use and exclusion. The weaknesses in the insecurity group can be detected and resolved by implementing security mechanisms to better protect personal data and user privacy. On the other hand, the other groups of weaknesses require more attention from software development teams on examining privacy constraints involved, and designing relevant functions to respond to those constraints. The software development team needs to determine relevant functions, takes privacy constraints into consideration and regularly reviews existing functions to ensure that they do not violate user privacy. For example, a user may want to change his/her user preferences when a new feature is launched, or a system must notify its users when there is a change to the policy that the users have given consent to.
Due to space limit, we present here two examples (see Table 4 and Table 5) of the new CWE privacy weaknesses and refer to the readers to our replication package (rep-pkg-privul) for the remaining newly proposed CWEs. Table 4 shows a new common privacy weakness for not allowing a user to withdraw his/her consent. This weakness belongs to the exclusion category. Following the CWE schema, a short summary of the weakness is provided in its name, while a detailed description is provided in the description section. The mode of introduction briefly discusses how and when the weakness is introduced, which in this case is the architecture and design phase.
|Class: Not allowing a user to withdraw his/her consent|
|Name: Missing consent withdrawal|
|Description: The software forces users to give consent before providing its services. These services may include personal data processing. However, the software does not a provide a function for users to withdraw their consent. The users can only accept consent, but they cannot withdraw their consent when they wish to. This vulnerability seriously violates user privacy.|
|Mode of introduction: Phase: Architecture and Design. This weakness is caused by a missing privacy consideration about consent management and its related processes, which leads to the missing consent withdrawal function.|
|Common consequence: The software violates user privacy by not allowing users to express their agreement on the use of their personal data.|
|Detection method: Method: Manual analysis. Description: A consent management page or window in a software does not show an icon or option to withdraw consent.|
|Potential mitigations: Phase: Architecture and Design. Strategy: User consent withdrawal consideration. Description: The software development team should consider which points in the software that should provide a user an ability to withdraw consent.|
|Demonstrative example: Issue MDL-62309 in Moodle reports that the users cannot withdraw consent and cannot enter the site without giving consent. This issue violates user privacy as the users should be able to freely withdraw consent.|
The common consequence section identifies a privacy property that is violated and an effect that is caused by the weakness. The detection method section describes different methods that the weakness can be detected in software. We also propose methods to mitigate the weakness. It is noted that different phases in software development may pose different privacy concerns. Finally, we provide a demonstrative example of the new weaknesses by extracting code fragments, issue reports and commits from real software repositories. For example, Figure 8 shows a code fragment extracted from the Github commit ad5e213 191919https://github.com/moodle/moodle/commit/ad5e2135c5d2ccd7f53a08fc0c66de66d431cfdf in Issue MDL-62309202020https://tracker.moodle.org/browse/MDL-62309 of the Moodle project. This example demonstrates the existence of the missing consent withdrawal weakness in practice.
Table 5 presents a new common privacy weakness for collecting personal data without user consent/permissions. This weakness belongs to improper personal data collection category. Figure 9 shows a code fragment confirming the existence of a missing consent check before collecting personal data weakness. The code fragment is extracted from the Github commit 0b09df0212121https://github.com/HumanDynamics/rhythm-server/commit/
0b09df0fa0a35dd1fe8a6b2160fb4e68299574d4 of the HumanDynamic repository (HumanDynamics).
|Class: Collecting personal data without user consent/permissions|
|Name: Missing a consent check before collecting personal data|
|Description: The software does not check for a user consent prior to personal data collection. This makes the software collects personal data that users have not given consent to (e.g., location and speech).|
|Mode of introduction: Phase: Implementation. This weakness is caused by missing a consent check before collecting personal data.|
|Common consequence: The software violates user privacy since users has not given consent/permissions to collect their personal data.|
|Detection method: Method: Manual analysis. Description: Perform a code check at points of personal data collection.|
|Potential mitigations: Phase: Implementation. Strategy: Check for user consent before collecting data. Description: The software development team should perform a consent check at every point that collects personal data in the software.|
|Demonstrative example: A commit 0b09df0 in HumanDynamics repository collects user speech without consent check.|
We propose 11 new common privacy weaknesses to be added to CWE. These will significantly improve the coverage of privacy weaknesses and vulnerabilities in CWE, and subsequently CVE.
6 Threats to validity
Our method for the extraction of privacy vulnerabilities in CWE and CVE using keywords might not result in the complete list. However, we have used several strategies to mitigate this threat such as determining the keywords based on alternate terms described in CWE, and using the frequent terms identified in the studies that performed a large-scale analysis in privacy policies and considering general terms to cover unseen materials (e.g., regulation, data protection and privacy standard). Our taxonomy of common privacy threats is constructed and refined based on the existing privacy threats taxonomy. We have put our best effort to ensure the comprehensiveness of the study by examining popular software engineering publication venues, well-established data protection regulations and privacy frameworks, and reputable industry sources. However, we acknowledge that there might be other sources of other privacy threats that we have not identified yet. We have carefully defined a set of inclusion and exclusion criteria to select the most relevant papers so that we got a reasonable number of papers to be examined individually. Future work would involve expanding our explanatory study to increase the generalisability of our taxonomy for common privacy threats. In addition, classifying the privacy-related vulnerabilities in CWE and CVE into the common privacy threats in the taxonomy involved subjective judgements. We have applied several strategies (e.g., using multiple coders, applying inter-rater reliability assessments and conducting disagreement resolution) to mitigate this threat. Future work could explore the use of external subject matter experts in these tasks.
7 Conclusions and future work
The increasing use of software applications in people’s daily lives has put privacy under constant threat as personal data are collected, processed and transferred by many software applications. In this paper, we performed a number of studies on privacy vulnerabilities in software applications. Our study on the Common Weakness Enumeration (CWE) and Common Vulnerabilities and Exposures (CVE) systems found that the coverage of privacy-related vulnerabilities in both systems is quite limited (4.45% in the CWE system and 0.1% in the CVE system).
We have also investigated on how those privacy-related vulnerabilities identified in CWE and CVE address the common privacy threats in software applications. To do so, we developed a taxonomy of common privacy threats based on selected privacy engineering research, well-established data protection regulations and privacy frameworks and industry resources. We have found that only 13 out of 24 common privacy vulnerabilities in the taxonomy are covered by the existing weaknesses and vulnerabilities reported in CWE and CVE. The top three most addressed vulnerabilities are exposing personal data to an unauthorised actor, insufficient levels of protection and personal data attacks. Based on these actionable insights, we proposed 11 new common privacy weaknesses to be added to the CWE system. We also mined code fragments from real software repositories to confirm the existence of those privacy weaknesses. These newly proposed weaknesses will significantly improve the coverage of privacy weaknesses and vulnerabilities in CWE, and subsequently CVE.
Future work involves expanding our taxonomy to cover additional common privacy threats that may have raised or discussed in other sources that we have not included in this study. We will also perform a study to characterise privacy vulnerabilities in software applications. This will enable us to develop new techniques and tools for automatically detecting privacy vulnerabilities in software and suggesting fixes.