Extracting Insights from the Topology of the JavaScript Package Ecosystem

10/02/2017
by   Nuttapon Lertwittayatrai, et al.
Kasetsart University
0

Software ecosystems have had a tremendous impact on computing and society, capturing the attention of businesses, researchers, and policy makers alike. Massive ecosystems like the JavaScript node package manager (npm) is evidence of how packages are readily available for use by software projects. Due to its high-dimension and complex properties, software ecosystem analysis has been limited. In this paper, we leverage topological methods in visualize the high-dimensional datasets from a software ecosystem. Topological Data Analysis (TDA) is an emerging technique to analyze high-dimensional datasets, which enables us to study the shape of data. We generate the npm software ecosystem topology to uncover insights and extract patterns of existing libraries by studying its localities. Our real-world example reveals many interesting insights and patterns that describes the shape of a software ecosystem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

07/15/2020

A complex network analysis of the Comprehensive R Archive Network (CRAN) package ecosystem

Free and open source software package ecosystems have existed for a long...
09/14/2017

On the Impact of Micro-Packages: An Empirical Study of the npm JavaScript Ecosystem

The rise of user-contributed Open Source Software (OSS) ecosystems demon...
12/12/2018

Breaking the borders: an investigation of cross-ecosystem software packages

Software ecosystems are collections of projects that are developed and e...
12/10/2020

Guiding Development Work Across a Software Ecosystem by Visualizing Usage Data

Software is increasingly produced in the form of ecosystems, collections...
11/18/2017

Automatic link extraction: The good, the bad and the ugly in software ecosystem mining

This abstract presents the automatic link extraction pitfalls based on o...
04/13/2020

Connecting the Dots: Discovering the "Shape" of Data

Scientists use a mathematical subject called 'topology' to study the sha...
03/28/2021

Scalable Call Graph Constructor for Maven

As a rich source of data, Call Graphs are used for various applications ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Software ecosystems have a tremendous impact on contemporary software development. Software developers are more likely to rely on third-party libraries from the ecosystem, to gain the benefits of quality, speed to market and ease of use. One such example of a emerging software ecosystem is the JavaScript Package ecosystem. Since inception, the ecosystem has exploded its growth to over half a million111as of July, 2017 the size of the npm repository reached 475,000 packages.

packages available for its users, making it the biggest and popular Open Source Software ecosystems in recent times. A ecosystem itself is comprised of many social and technical aspects that can be represented as high-dimensional data.

The datasets gathered from software ecosystems are vastly high-dimensional, noisy and are generally challenging when attempting to identify patterns or insights at a higher level. A recent study by Wittern et al. [Wittern2016] investigated some of the dynamics in the study. They studied the ecosystem from several perspectives of evolution, popularity and adoption, however, patterns between the different features analyzed separately. Other related work [Decan2017], [Kikas2017] focus on the dependencies between the different packages within the npm ecosystem while others have studied from certain social-technical aspects [Constantinou2017].

In this paper, we apply topological methods to study complex high-dimensional data sets by extracting shapes (patterns) and obtaining insights about them. Leveraging concepts and algorithms from the mathematics field of Topological Data Analysis (TDA), we provide a geometric representation of complex datasets. TDA permits the analysis of relationships between related dataset features. We illustrate our approach by applying it to a representative sample of packages and six key features that describe the ecosystem topology. We are able to extract the following insights from our generated npm ecosystem topology:

  • The topological shape becomes more refined as more data is added.

  • The number of dependencies for a package is a strong feature in the topology.

  • Packages that are more likely to be used within ecosystem are located separately from packages meant for application usage outside the ecosystem.

  • Top authors of packages tend to release packages intended for usage within the ecosystem itself.

  • Packages are not likely to be affiliated to an organizational domain.

  • Packages are licensed by a dominant software license (i.e., MIT).

  • Popular tagged keywords are common with packages across the topology.

Furthermore, we show how the topology is more insightful than standard alternative methods such as archetypal analysis. We envision that the study of topology and investigation of additional key features may lead to better understanding of software ecosystems.

Ii Background

Lum et al. [Lum13] showed how the shape of the topology can be leveraged to extract useful insights. Lum and colleagues demonstrate how TDA allows exploration of the data, without first having to formulate a query or hypothesis, demonstrating the importance of understanding the “shape” of data in order to extract meaningful insights.

Topology is the field within mathematics that deals with the study of shapes. It has its origins in the 18th century, with the work of the Swiss mathematician Leonhard Euler. TDA is the result of a concerted effort to adapt topological methods to various applied problems, one of which is the study of large and high dimensional data sets. In Lum’s study, they applied topology to three very different kinds of data, namely gene expression from breast tumors, voting data from the United States House of Representatives and player performance data from the NBA, in each case finding stratifications of the data which are more refined than those that are produced by standard methods.

The only other work in which TDA was applied in a software setting was by Costa et al. [Costa2017]. Using a more complex range of techniques, they concluded that topological analysis might be useful for characterizing software system behavior early enough and for early characterization of system reliability, that may contribute to software reliability modeling. In this work, we only focus on the topology mapper algorithm [SPBG:SPBG07:091-100] to generate a topological map of the software ecosystem.


Fig. 1: Taken from [Lum13], A 3D object (hand) represented as a point cloud. B) A filter value is applied to the point cloud and the object is now colored by the values of the filter function. C) The data set is binned into overlapping groups. D) Each bin is clustered and a network is built. In this work, each filter is represented by the extracted features of the npm ecosystem.

Iii A Software Ecosystem Topology

In this section, we discuss the method by which the topology of an ecosystem is mapped. We use the definition of software ecosystem as “a collection of software systems, which are developed and co-evolve in the same environment” [Lungu2008].

Iii-a Topological Mapper Method

The mapper algorithm [SPBG:SPBG07:091-100] is a method for constructing useful combinatorial representations of geometric information about high-dimensional point cloud data. It can be used to reduce high dimensional data sets into some mathematical objects, namely simplicial complexes, with far fewer points that can capture topological and geometric information at a specified resolution. As shown in Figure 1, instead of acting directly on the data set, it assumes a choice of a filter or combination of filters, which can be viewed as a map to a metric space, and builds an informative representation based on clustering the various subsets of the data set associated the choices of filter values.

Iii-B JavaScript Package Ecosystem Features

In order to create a point cloud for the ecosystem topology, we first identify six filters (i.e., referred to as features in this paper), which will be indicative of our dataset. Mainly based on the work of Wittern [Wittern2016] and other work that studied the software ecosystems [Decan2017], [Kikas2017], [Bavota2015], we identify similar six features that can be used as geometric features of a package. Similar to Wittern, these six features are present in the meta-file package.json, and are shown in Listing 1:

  • f1- Author: Name of person who build this package. This indicator should be able to group packages built by the same author. For example from Listing 1, in line 11, an author of this package is “James Halliday”.

  • f2- Author Domain: Email domain of person who build this package. This indicator should show packages built by authors from the same organization or company. For example from Listing 1, in line 12, an author domain of this package is “substack.net”.

  • f3- License: License tell people know what organization that publish the package how they are permitted to use it. For example from Listing 1, in line 5, a license of this package is “MIT”.

  • f4- Tagged Keywords: An array of strings that helps people discover your package as it’s listed in npm search. For example from Listing 1, in line 16, keywords of this package are “browswer”, “requir”, … , “javascript”.

  • f5- Version Released: Version form an identifier that is assumed to be completely unique. Changes to the package should come along with changes to the version. For example from Listing 1, in line 3, a version of this package is “14.4.0”.

  • f6- Number of Dependencies: The number of mapped package dependencies to a version range. For example from Listing 1, in line 22, dependencies of this package are “JSONStream”, “assert”, and “through”.

1{
2  ”name”: ”browserify”,
3  ”version”: ”14.4.0”,
4  ”description”: ”browser-side require() the node way”,
5  ”license”: ”MIT”,
6  ”repository”: {
7    ”type”: ”git”,
9  },
10  ”author”: {
11    ”name”: ”James Halliday”,
12    ”email”: ”mail@substack.net”,
13    ”url”: http://substack.net
14  },
15  ….
16  ”keywords”: [
17    ”browser”,
18    ”require”,
19        ….
20    ”javascript”
21  ],
22  ”dependencies”: {
23    ”JSONStream”: ”^1.0.3”,
24    ”assert”: ”^1.4.0”,
25    ”through”: ”^2.3.4”
26  },
27  ….
28}\end{lstlisting}
29
30\subsection{Feature Vector Calculation}
31One of the complexities of the data is the dimensions within each feature.
32To cope with the complexity, we adopt a \textit{Vector Space Model (VSM)} from the Information Retrieval (IR) field to represent the high-dimension of each feature.
33For a vector space, we first need a corpus of each of the features.
34Suppose we have three packages $P_1$, $P_2$, $P_3$.
35For instance, for the license features \textit{f3}, our corpus will be constructed as follows:
36$$
37%A =
38\begin{array}{c|cccc|}
39z & P_1 & P_2 & P_3 &  \\ \hline
40MIT & 0 & 1 & 0 &   \\
41ISC & 1& 0 & 1 &   \\
42Apache & 0& 0 & 0 &   \\
43 &  &  &  &   \\
44\end{array}
45$$
46In this example, we find that in the license feature, we use the two terms MIT ($z_1$) and ISC ($z_2$).
47Thus, we can represent the feature vector for a license as follows:
48$$
49\vec{P_1}^{\,f3}=
50\begin{array}{ccccc}
51z_1 & z_2 &  z_3 &  \\ \hline
520 & 1 & 0 &  \\
53\end{array}
54$$
55Note that the corpus %matrix A:
56$m \times n$ matrix whose $i^{th}$, $j^{th}$, is represented by binary 0 or 1 to indicate whether this feature is used by a package or not.
57In this example, we can see that the MIT license is used by only $P_2$.
58This kind of binary function weighting is used to construct the \textit{$\vec{P_x}^{\,f1}$, $\vec{P_x}^{\,f2}$, $\vec{P_x}^{\,f3}$} vectors.
59
60For the tagged keywords \textit{f4}, we use the \texttt{word2vec} technique as a weighting function instead of the binary function as proposed by Mikolov et al. \cite{DBLP:journals/corr/abs-1301-3781}.
61The model is used for learning vector representations of words, such that words that share common contexts in the corpus are located in close proximity to one another (i.e., generated by a similarity score) in that space.
62We use the \texttt{word2vec} function from the gensim python library\footnote{gemsim is a topic modelling library for python. Available at \url{https://radimrehurek.com/gensim/index.html}} to calculate a similarity score between words.
63Below, we construct $f4$ for packages $P_1$, $P_2$, $P_3$.
64
65%%%% Add more about word2vec %%%%
66
67
68
69$$
70\begin{array}{c|cccc|}
71 k & P_1 & P_2 & P_3 &  \\ \hline
72Web & 1 & 0.733261108 & 0 &   \\
73Http & 0.925269127 & 0.686954796 & 0 &   \\%
74Console & 0 & 0.889973283 & 0.476446807 &   \\%
75Server & 0 & 0.76110518 & 1 &   \\%
76 &  &  &  &   \\
77\end{array}
78$$
79
80In this example, we can visually observer by the similarity scores, that package $P_2$ has a much closer similarity to all of the four keywords \textit{Web} ($k_1$), \textit{Http} ($k_2$), \textit{Console} ($k_3$) and \textit{Server} ($k_4$) than $P_1$ and $P_3$.
81%So, We use Word2Vec to calculate similarity score of Web, Http and Console keywords.
82
83%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
84
85The remaining features (i.e., \textit{f5} and \textit{f6}) use a more simplified set of metrics.
86For the version released \textit{f5}, we use the release versioning to estimate the current maturity and release of the package.
87In our example browserify (i.e., $P_1$) is at current version 14.4.0.
88Therefore, we represent this package as ${P_1}^{f5}=14.4$.
89Similarly, we use a count of the dependencies as a measure of how dependent a package is on the ecosystem.
90In this example, we find that browserify lists 3 dependencies (JSONStream, assert and through), hence
91${P_1}^{f6}=3$.
92Finally, we combine all the feature vectors to end with a single vector for each package.
93For instance:
94\begin{equation*}
95\vec{P_x} = \vec{P_x}^{\,f1} \wedge \vec{P_x}^{\,f2} \wedge \vec{P_x}^{\,f3} \wedge \vec{P_x}^{\,f4} \wedge {P_x}^{f5} \wedge {P_x}^{f6}
96\end{equation*}
97
98Since each vector is a matrix, it is important to note that size of the dimension for each feature is dependent on the size of the terms ($z_1, …, z_x$) in each feature.
99The key advantage of our TDA approach is the ability to process and visualize these types of high-dimensional datasets.
100
101\subsection{Data Collection and Topology Representation}
102
103%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
104\begin{table}[t]
105  \centering
106  \caption{Summary of Data Collected}
107    \label{tab:data_collected}
108  \begin{tabular}{l|cc|c|c}
109    \hline\hline
110    & Dataset Statistics  \\ \hline
111        Snapshot Date & July-1st-2016 $\sim$ July-15th-2016\\
112    %\# Downloaded Packages (Collected) & 151,100   \\
113        \# Collected Packages (after filtering) & 72,650 \\
114        \# sample packages (generate topology) & 10,000 \\
115    \hline
116  \end{tabular}
117\end{table}
118%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
119
120To evaluate our topology methodology, we used a sample of the npm ecosystem.
121In this section, we will describe the data collection and visualization analysis.
122
123\subsubsection{Data Extraction and Preprocessing}
124
125As shown in Listing \ref{code:package.json}, we are able to extract all our metrics by mining the \texttt{package.json} from each package.
126Using the same method from prior work, we randomly selected and mined 151,100 JavaScript npm packages.
127To improve our data collection, we only select packages that include all the features needed.
128
129To extract each dimension, we used python scripts with the following libraries.
130For the \texttt{word2vec} analysis, we used a standard threshold (i.e., 500 words) as a base for the  algorithm.
131
132%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
133\begin{table}[t]
134  \centering
135  \caption{Size of dimension for each granularity of $\vec{P_x}$. Note the Top features are related to $f1$,$f2$,$f3$ and $f4$ features.}
136    \label{tab:scalability}
137  \begin{tabular}{l|cccc}
138    \hline\hline
139    Degree & \# Dimensions & Data Size & Topology Generation Time \\ \hline
140    Top 20 &  82 x 10,000 & 1.9 MB & 10.53 minutes \\
141    Top 50 & 202 x 10,000 & 4.7 MB & 18.47 minutes \\
142    Top 100 & 402 x 10,000 & 9.4 MB & 39.42 minutes \\
143    Top 1000 & 3,380 x 10,000 & 86.8 MB & 52.16 minutes \\
144    \hline
145  \end{tabular}
146\end{table}
147%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
148
149
150
151%After we have collected all JavaScript package information. Hence, using the \texttt{package.json} meta-data file as show in Listing \ref{code:package.json}, we extracted the version field (Line 3), license field (Line 5), author name and email fields (Lines 11 and 12), keywords field (Line 16), and dependencies field (Line 22), which  are specified in a simple object that maps a package name to a version range.
152
153%We have collect \texttt{package.json} file from 151,100 JavaScript package git repository. Then, we have to screen all of json files and select only files that have the following part (name, description, keywords, author, version, and dependencies). Finally, we totally have \texttt{package.json} file that can be used  72,650 JavaScript package
154
155
156%%%%%%%%%%%%%%%%%%%%%%%%%%%%f6%%%%%%%%%%%%%%%%%%%%%%%%%%
157\begin{figure*}[t]
158\centering
159  \begin{tabular}{cccc}
160      \hspace{-10mm}
161      \begin{minipage}{0.25\linewidth}
162      \centering
163      %\subfloat[Top 20]{
164            \includegraphics[keepaspectratio,scale=0.15,angle=0]{ResultPicture/f6-top20.jpg}
165            %\subcaption{First sub-figure}
166            %\label{test}}
167      \\
168            (a) Top 20
169      \end{minipage}
170            &
171      \begin{minipage}{0.25\linewidth}
172      \centering
173            %\hspace*{-20mm}
174      %\subfloat[Top 100]{
175              \includegraphics[keepaspectratio,scale=0.15,angle=0]{ResultPicture/f6-top50.jpg}
176            %}\\
177            \\
178            (b) Top 50
179      \end{minipage}
180            &
181      \begin{minipage}{0.25\linewidth}
182      \centering
183            %\hspace*{-15mm}
184      %\subfloat[Top 100]{
185              \includegraphics[keepaspectratio,scale=0.17,angle=0]{ResultPicture/f6-top100.jpg}
186            %}\\
187            \\
188            (c) Top 100
189      \end{minipage}
190      &
191      \begin{minipage}{0.25\linewidth}
192      \centering
193            %\hspace{-20mm}
194      %\subfloat[Top 1000]{
195            %100->1000
196              \includegraphics[keepaspectratio,scale=0.15,angle=0]{ResultPicture/f6-top1000.jpg}
197            \\
198            (d) Top 1000
199            %}
200          %}
201      \end{minipage}
202      \end{tabular}
203      \\
204    \caption{Summary view of the JavaScript Package topology at different granularities. We find that the shape evolves, yet is able to maintain its key points.}
205     \label{fig:evolution}
206  \end{figure*}
207%%%%%%%%%%%%%%%%%%%%%%%%%%%%f6%%%%%%%%%%%%%%%%%%%%%%%%%%
208
209\subsubsection{Using the Mapper algorithm in TDA}
210We use the Knotter tool\footnote{\url{https://github.com/rosinality/knotter}}, which is an implementation of mapper algorithm for TDA \cite{SPBG:SPBG07:091-100}.
211%The input requires only knowledge of the distances between points and a choice of combination of filters, and produces a multi-resolution representation based on that filter.
212The method provides a common framework which includes the notions of density clustering trees, disconnectivity graphs, and Reeb graphs, but which substantially generalizes all three.
213We use the t-Distributed Stochastic Neighbor Embedding (t-SNE) \cite{Maaten2008}, a technique for dimensionality reduction and clustering, and our defined features as the filters for the visualization construction.
214
215%A full list of the packages and tools used to extract and generate our package vectors is available online at \url{line}.
216
217We use different layers of granularity of the features.
218Table \ref{tab:scalability} shows the scale of the high-dimensional features (i.e., $f1, f2, f3, f4$) at the three levels of Top 20, Top 50, Top 100 and Top 1,000.
219This intention is to understand whether the key features can be seen at the high-dimensions of the data analysis.
220Due to the limitations of the tool, we were only able to load up to 10,000 packages (i.e., with loading times of over 50 minutes) into the topology.
221Table \ref{tab:scalability} details the data size and topology generation loading times.
222
223%\subsubsection{Create Corpus}
224%Now, we have six metrics (author name, author domain, dependencies, keywords, license, and version). We prepare for each own metrics.
225%\subsubsection*{Author, Author Domain, Keywords, and License}
226%These corpus are prepare in the same way. We put all words in metrics to lowercase letter. We collect only domain in author-email field. After that, we count words that appear as show in Table \ref{tab:project_info} and rank them on top 20, top 50, top 100, and top 1000.
227%\subsubsection*{Dependencies}
228%In this corpus, we collect only name field in the metrics and put all words in metrics to lowercase letter.
229%\subsubsection*{Version}
230%In this corpus, we collect only two front field. For example, if we found the version is 1.2.1, we collect only 1.2.
231
232%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
233%\begin{table*}[!]
234% \centering
235% \caption{Quantity of value in each Corpus}
236%    \label{tab:project_info}%
237% \begin{tabular}{l|c|c|c|c}
238%   \hline\hline
239%   Features & Rational & Corpus Generation & Corpus Size & Average Size Per package \\ \hline
240%   Author& & & 6,680 & 0.092 \\
241%   Author domain& & & 2,226 & 0.031 \\
242%   Tagged Keywords& & & 38,165 & 0.525 \\
243%   License& & & 379 & 0.005 \\
244 %       Version Releases& & &  & \\
245%        Dependencies& & &  & \\%
246%   \hline
247% \end{tabular}
248%\end{table*}
249%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
250
251%\subsubsection{Feature Vector Granularity}
252%To control the dimensions of each feature, analyze the shape of the topology at three levels of granularity.
253%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
254%\begin{table}
255% \centering
256% \caption{Potential Size of vectors for $\vec{P_x}^{\,f1}$, $\vec{P_x}^{\,f2}$, $\vec{P_x}^{\,f3}$, $\vec{P_x}^{\,f4}$}
257%    \label{tab:project_info}
258% \begin{tabular}{l|c|c|c|c}
259%   \hline\hline
260%   Degree & Top 20 & Top 50 & Top 100 & Top 1000 \\ \hline
261%   Author &  &  &  & \\
262%   Author Domain &  &  &  & \\
263%   License &  &  &  & \\
264%   Tagged Keywords &  &  &  & \\
265 %       Version Released &  &  &  & \\
266  %      Number of Dependencies &  &  &  & \\
267%   \hline
268% \end{tabular}
269%\end{table}
270%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
271
272%\subsubsection{Create Vector Metrics}
273%To build our datasets, we use 10,000 libraries to build datasets from the corpus. And we have three different way to generate the datasets.
274%\subsubsection*{Binary Count}
275%In first method, we focus on generate author, author domain, and license corpus. In each package, we use binary count between top 20, top 50, top 100, top 1000 of each corpus and package. If package has word same as in the corpus, the value between corpus and package will be plus one in the table. And if package doesn’t have word same as in the corpus, the value between corpus and package will do nothing in the table.
276%\subsubsection*{Word2Vec}
277%We apply to use word2vec with keywords corpus. We use all of keywords to train word2vec algorithm. After finish training, we compare between top 20, top 50, top 100, top 1000 of keywords corpus and keywords in package. If keyword in package appear in corpus, the value is one in the table. If keyword in package doesn’t appear in corpus, that keyword will be put in word2vec. And result is 500 list of most similar word and score that between 0 and 1. If most similar word in list appear in corpus, the value is score that came from word2vec.
278%\subsubsection*{Frequency Count}
279%We use frequency count on the dependency corpus. We count on how much dependency in each package.
280%\subsubsection*{Version}
281%In version corpus, We put the version of package in the table.
282
283%Now, we have created all vector metrics successfully. Then, we put all metrics in each rank together. For example, in top 20 dataset, we have author (top 20), author domain (top 20), license (top 20), keywords (top 20), version, and dependencies. So, we will have 4 datasets (top 20, top 50, top 100, top 1000).
284
285%\subsubsection{Apply TDA}
286%After we have all vector metrics of all corpus. Then, we put them into TDA. TDA we used is from knotter package.
287
288
289%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
290  \begin{figure*}[!t]
291     \center
292       \centering
293      \includegraphics[keepaspectratio,scale=0.5]{ResultPicture/f6-top1000-rc.jpg}\\
294    \caption{The npm Ecosystem Topology. Color is related to the \textit{f6} feature.}
295     \label{fig:overview}
296    \end{figure*}
297%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
298
299\subsection{Analysis and Interpretation of the Topology}
300
301Our analysis is by a visual analysis and identification of important features on the shape of the topology.
302We interpret the results of the topology using two levels of analysis:
303
304\subsubsection{Topology Overview and Shape Analysis}
305We analyze the topology of the ecosystem, analyzing the shape of the data over the different granularities (i.e., Top 20, Top 50, Top 100, Top 1,000).
306Furthermore, we then investigate the dominant features that outline the shape of the topology.
307
308%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
309\begin{table}[t]
310  \centering
311  \caption{‘GitHub strong’ vs. Npm strong tagged keywords ($f3$) as discussed by Wittern et. al \cite{Wittern2016}.}
312    \label{tab:strong}
313  \begin{tabular}{c|cc}
314    \hline\hline
315    ‘GitHub strong’ & npm strong \\ \hline
316    gruntplugin & util & \\
317    gulpplugin & array & \\
318    express & buffer & \\
319    react & string & \\
320    authenticate & file & \\
321        \hline
322  \end{tabular}
323\end{table}
324%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
325
326One particular analysis is the categorization of tagged keywords.
327Shown in Table \ref{tab:strong}, Wittern et al. \cite{Wittern2016} found a set of keywords that were likely to be related to either applications (i.e., \texttt{GitHub strong}) or the npm package ecosystem (i.e., \texttt{npm strong}).
328As part of our analysis, we would like to identify the locations of libraries that use these sets of keywords.
329
330\subsubsection{ Deeper Topology Feature Analysis}
331We analyze each feature to identify some interesting observations and their relationships to the other features.
332Our method is to identify the locations of the most frequent occurring terms (i.e., top 5 $z_x$) of each feature.
333For instance, in reference to the authors ($f1$), we will map the libraries that belong to top 5 authors of npm packages.
334In the case study, we specifically look at the Author ($f1$), Author Domain ($f2$), License ($f3$) and Tagged Keywords ($f4$) features of the topology.
335
336
337\section{Results}
338In this section, we discuss our results in terms of (1) the topology overview and (2) topology features for our constructed npm ecosystem topology.
339
340\subsection{JavaScript Package Ecosystem Topology}
341\vspace{2mm}
342\begin{quote}
343\textit{“The topological shape becomes more refined as more data is added”}
344\end{quote}
345
346Figure \ref{fig:evolution} depicts the shape of the data at the different granuality of dimension levels (i.e., Top 20, Top 50, Top 100 and Top 1,000).
347We can see for the figures that each shape is geometrically different, however, the key points of the shape are still the same.
348This result indicates that libraries with these high features are apparent in the Top 20.
349However, an argument could be said that the shape is more refined as more data is added.
350By refined, we mean that the extremely points of the data (i.e., represented by the edges of the shape) become more apparent.
351It is because of this reason, that we decided to perform the rest of our analysis at the Top 1,000.
352
353Figure \ref{fig:overview} depicts a detailed analysis of important points in the topology mapped to some of the feature attributes.
354From this figure, we are able to extract the following insights.
355
356\vspace{2mm}
357\begin{quote}
358\textit{“The number of package dependencies is a strong feature in the topology”}
359\end{quote}
360
361We find that the shape of the topology is influenced by the number of dependencies adopted by a package ($f6$).
362This is clearly highlighted by the top-right in the shape.
363Actually, such high dependency packages may risk becoming blacklisted\footnote{a blog on these types of packages and their impact to the ecosystem is at \url{https://github.com/jfhbrook/hoarders/issues/2}} due to the debate of whether or not it is simply hoarding other packages.
364Conversely, the other two points show a lower set of dependencies.
365
366%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
367\begin{table}[!t]
368  \centering
369  \caption{Top 5 ranking (highest frequency) for each Feature}
370    \label{tab:f56}
371  \begin{tabular}{c|ccc}
372    \hline\hline
373    Frequency & Versions (f5) & \# Dependencies (f6) \\ \hline
374    1 & 0.0 (2,214) & cordova-plugin-require-bluetoothle (112) \\
375    2 & 1.0 (2,202) & npm (85)\\
376    3 & 0.1 (1,682) & gtb (62) \\
377    4 & 0.2 (675) & mikser (61) \\
378    5 & 1.1 (579) & react-setup (61)\\
379         & 42.2 (1) & …\\
380        \hline
381  \end{tabular}
382\end{table}
383%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
384
385\vspace{2mm}
386\begin{quote}
387\textit{“Packages that are more likely to be used within ecosystem are located separately from packages meant for application usage outside the ecosystem”}
388\end{quote}
389
390Figure \ref{fig:overview} clearly shows that packages containing the \texttt{‘npm strong’} (i.e., ecosystem-use) libraries are located apart from libraries that are \texttt{‘GitHub strong’} (i.e., application-use).
391We found that the released versions was not an important feature in the topology.
392However, as shown in Figure \ref{fig:overview}, we can identify that package that had the most releases is located near the \texttt{‘npm strong’} libraries.
393In fact, we found this package to be \textit{ydr-utils}, which is indeed used specialized packages within the npm ecosystem\footnote{inspection of the readme.md file shows that it is used by a specialized set of npm packages \url{https://github.com/cloudcome/nodejs-ydr-utils}}.
394
395\subsection{Topology Features}
396%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
397\begin{table*}[!t]
398  \centering
399  \caption{Top 5 ranking (most frequent terms ($z_x$)) for each Feature}
400    \label{tab:f1234}
401  \begin{tabular}{c|ccccccc}
402    \hline\hline
403    Frequency Rank & Author (f1) & Author Domain (f2) & License (f3) & Tagged Keywords (f4) \\ \hline
404    1 & Author 1 (437) & gmail.com (7,576) & MIT (6,715) & react (3,084) && \\
405    2 & Author 2 (436) & substack.net (328) & ISC (1,191) & api (2,984) &&\\
406    3 & Author 3 (328) & outlook.com (173) & APACHE-2.0 (950) & yeoman-generator (2280) &&\\
407    4 & Author 4 (275) & gmx.de (134) & BSD-2-CLAUSE (524) & cli (2,210) &&\\
408    5 & Author 5 (265) & qq.com (97) & BSD-3-CLAUSE (452) & css (2,173) &&\\
409    \hline
410  \end{tabular}
411\end{table*}
412%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
413
414
415%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
416\begin{figure*}[p]
417\centering
418\hfil
419\hspace*{-6mm}
420 \begin{tabular}{cc}
421  \begin{minipage}{0.5\linewidth}
422  \centering
423    \includegraphics[keepaspectratio,scale=0.4,angle=90]{ResultPicture/f1-top1000-r.jpg}
424  \\
425    \caption{Author Top 1000}
426  \label{fig:f1}
427  \end{minipage}
428  &%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
429  \begin{minipage}{0.5\linewidth}
430  \centering
431    \includegraphics[keepaspectratio,scale=0.4,angle=90]{ResultPicture/f2-top1000-r.jpg}
432  \\
433    \caption{Author Domain Top 1000}
434  \label{fig:f2}
435  \end{minipage}
436\end{tabular}
437%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
438\hspace*{-6mm}
439 \begin{tabular}{cc}
440  \begin{minipage}{0.5\linewidth}
441  \centering
442    \includegraphics[keepaspectratio,scale=0.4,angle=90]{ResultPicture/f3-top1000-r.jpg}
443  \\
444    \caption{License Top 1000}
445  \label{fig:f3}
446  \end{minipage}
447  &%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
448    \begin{minipage}{0.5\linewidth}
449  \centering
450  \includegraphics[keepaspectratio,scale=0.4,angle=90]{ResultPicture/f4-top1000-r.jpg}
451  \\
452    \caption{Tagged Keywords Top 1000}
453  \label{fig:f4}
454  \end{minipage}
455\end{tabular}
456\end{figure*}
457%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
458%figures and table of archetype analysis
459%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
460  \begin{figure*}[!t]
461     \center
462    \hspace{-5mm}
463       \centering \includegraphics[keepaspectratio,scale=0.545]{ResultPicture/pcplot.pdf}\\
464    \caption{The figure shows parallel coordinate plot for the Top 20 npm dataset. The red line is archetype 1 (A1), the green line is archetype 2 (A2) and the blue line is archetype 3 (A3).}
465     \label{fig:pcplot}
466    \end{figure*}
467%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
468\begin{table*}[t]
469  \centering
470  \caption{Packages that were identified as close to the extreme points of each archetype A1, A2 and A3.}
471    \label{tab:aa}
472  \begin{tabular}{c|ll}
473    \hline\hline
474    Archetype & Identified Packages\\ \hline
475    A1 & tar-parse, turtle-run, marked-sanitized, haversort, bmxplayjs & \\
476        A2 & statsd-influxdb-backend, ardeidae, demo-blog-system, git-ssb-web, social-media-resolver & \\
477        A3 & stream-viz, programify, polyclay-couch, meshblu-core-task-check-update-device-is-valid, apidoc-almond & \\
478        \hline
479  \end{tabular}
480\end{table*}
481%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
482  \begin{figure}[!t]
483     \center
484       \centering
485      \includegraphics[keepaspectratio,scale=0.4]{ResultPicture/simplexplot.pdf}\\
486    \caption{A simplexplot that shows the a triangle plot that represents a package in the ecosystem. Note the extreme points represent each archetype.}
487     \label{fig:simplexplot}
488    \end{figure}
489%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
490Figures \ref{fig:f1}, \ref{fig:f2}, \ref{fig:f3}, \ref{fig:f4} shows detailed feature information related to authors ($f1$), author domains ($f2$), license ($f3$) and tagged keywords ($f4$).
491Tables \ref{tab:f1234} and \ref{tab:f56} supplement these Figures by showing the Top 5 most frequent terms for all features.
492Drawing from all the presented information, we are able to make the following observations:
493\vspace{2mm}
494\begin{quote}
495\textit{“Top authors of packages tend to release packages intended for usage within the ecosystem itself}
496\end{quote}
497Figure \ref{fig:f1} and Table \ref{tab:f1234} shows that the Top 3 authors (i.e., authors 1 with 437 packages, author 2 with 436 packages and author 3 with 328 packages) are located in the same location as the npm strong libraries, thus providing evidence that the Top 3 authors were more likely to develop packages for the npm ecosystem.
498However, the Top $4^{th}$ (i.e., with 275 packages) and $5^{th}$ (with 265 packages) authors develop packages aimed for applications (i.e., ‘GitHub strong’).
499
500
501\vspace{2mm}
502\begin{quote}
503\textit{“Packages are not likely to be affiliated to an organizational domain}
504\end{quote}
505
506Figure \ref{fig:f2} and Table \ref{tab:f1234} shows that the \texttt{gmail} domain is significantly (i.e., showing 7,576 packages) used by authors of npm packages.
507The second most used domain is for the \texttt{substack.com} domain\footnote{author GitHub profile at \url{https://github.com/substack}}, which belongs to the Author 3.
508This evidence suggests that many of the packages are indeed contributed by individuals and not by developers that are represented by a single organization.
509Furthermore, packages created by the authors with no affiliation are more likely to contribute packages that will be used by the npm ecosystem.
510
511\vspace{2mm}
512\begin{quote}
513\textit{“Packages are licensed by a dominant software license (i.e., MIT)}
514\end{quote}
515
516Figure \ref{fig:f3} and Table \ref{tab:f1234} show the MIT license to be the most widespread license (with 6,715) used for packages in the npm ecosystem.
517The next closest popular licenses include the ISC (i.e., with 1,191) and APACHE-2.0 (i.e., with 950).
518
519\vspace{2mm}
520\begin{quote}
521\textit{“Popular tagged keywords are common with packages across the topology}
522\end{quote}
523
524Figure \ref{fig:f4} and Table \ref{tab:f1234} illustrates how the most frequent keywords (i.e., react, api, yeoman-generator, cli and css) are used across the topology of packages.
525This result provides evidence that the individual tagged keywords are generic, therefore not strong indicators for a software ecosystem topology.
526
527%\vspace{2mm}
528%\begin{quote}
529%\textit{“Many of the npm packages use the MIT license}
530%\end{quote}
531
532
533%\subsection{\RqTwo}
534\section{Comparison with Archetypal Analysis}
535%We analyze with Archetypal Analysis using R package\cite{Eugster2009}.
536%Archetypes are a collectively-inherited unconscious idea, pattern of thought, image, etc., that is universally presented in individual psyches.
537%By using archetypes, we are able to understand the characteristics of the samples more simply than many other existing methods.
538%Archetypal analysis has the aim to represent observations in a multivariate data set as convex combinations of extremal points.
539%The aim of archetypal analysis is to find pure types”, the archetypes, within a set defined in a specific context.
540%\myworries{Onoue Archetype 1: No.7769,2328,4857,3882,7891, Archetype 2: No.7253,8020,950,6454,4216, Archetype 3: No.1908,1910,1957,1909,9992}
541
542%\myworries{No.7769  tar-parse, 2328  turtle-run, 4857  marked-sanitized, 3882  haversort, 7891  bmxplayjs}
543
544%\myworries{No.7253  statsd-influxdb-backend, 8020  ardeidae, 950  demo-blog-system, 6454  git-ssb-web, 4216  social-media-resolver}
545
546%\myworries{No.1908  stream-viz, 1910  programify, 1957  polyclay-couch, 1909  meshblu-core-task-check-update-device-is-valid, 9992  apidoc-almond}
547
548
549
550
551%To analyze the characteristics of all contributors, we adopt an archetypal analysis.
552An archetypal analysis is a statistical method that synthesizes a set of multivariate observations through a few \textit{\textbf{archetypes}}, which lie on the boundary of the data scatter and represent pure individual types \cite{Porzio:2008:UAB:1416593.1416598}.
553We use the archetypal analysis to compare how useful the topology is for visualizing and analyzing high-dimensional data.
554In detail, archetypal analysis describes individual data points based on the distance from extreme points, archetypes.
555%whereas cluster analysis focuses on describing its segments using the average members as the prototypes.
556We used \textsf{R} package \textsf{archetypes} \cite{Eugster:Leisch:2009:JSSOBK:v30i08} for the analysis of the Top 20 npm dataset (See Figure \ref{fig:evolution} and Table \ref{tab:scalability}) used in the previous TDA analysis.
557From the elbow criterion with the curve of the residual sum of squares (RSS), $k = 3$ is determined as the number of archetypes.
558
559Figure \ref{fig:pcplot} presents a parallel coordinate plot of all packages and each line represents one package with all feature values.
560The three colored lines are archetypes in this data (red is Archetype 1, green is Archetype 2, and blue is Archetype 3).
561We find that the keywords are the strongest feature indicators.
562In addition, Table \ref{tab:aa} presents some of the actual packages close to obtained three archetypes.
563From our analysis, we were able to qualitatively summarize our findings related to tagged keywords ($f4$):
564\vspace{2mm}
565\begin{itemize}
566\item \textit{Archetype 1} (A1) has packages that contains keywords, such as \textit{web, plugin, test, http, express, node, api and server.}
567\item \textit{Archetype 2} (A2) has packages that contains keywords like\textit{ html, gulpplugin, css, javascript and gulp. }
568\item \textit{Archetype 3} (A3) has lower packages compared to the other two archetypes.
569\end{itemize}
570
571
572Figure \ref{fig:simplexplot} shows triangular graph which is plotted result of archetype analysis, we a dot representing a package in the npm ecosystem.
573From the figure, we can observe that the packages are widely distributed among three archetypes.
574%Note that the archetypes are not always actual contributors who are being observed, but are generated from the multidimensional data to be representatives of pure archetypes. Each observed libraries can be regarded as a mixture of those archetypes.
575This provides us evidence to argue that the topology provides more insights and patterns as compared to this archetypal analysis.
576%Compared to the observations from TDA analysis, even though using the same data, we could derive a few patterns of libraries.
577One of the explanations is due to the limited number of archetypes.
578We observe that summarizing these small number of representations is prone to loose
579some information, making the raw data too complex for interpretation.
580With respect to the topology, investigation of the topology shape provides use with a more flexible analysis while reducing the dataset.
581%and enable us to investigate topologies.
582
583
584\section{Discussion}
585
586In this section, we discuss the implications and then follow up with the threats to our study.
587This includes the possible applications of how developers can leverage the software ecosystem topology.
588
589\subsection{Implications}
590\label{sec:imp}
591We discuss two benefits where understanding the topology is beneficial for both practitioners and researchers alike.
592The first is for searching and selection of components (i.e., packages) within the ecosystem.
593This has been some work that empirically the update and dependency relationships within the ecosystem \cite{Decan2017,Kikas2017,KulaEMSE2017,Ihara2017,Ishio2017}.
594We envision that based on a query of features, a developer should be able assess their options from the topology and make a more informed decision on similar or recommended libraries.
595The topology may also be used as a guide for novice developers.
596For example, a developer can use the topology as a guide to some of the more common practices (i.e., licensing the package under MIT).
597For future work, we would like to further explore how an ecosystem topology can be leveraged to search and recommend similar or useful packages for a developer.
598
599The second benefit is related to the sustainability and scalability of a software ecosystem.
600We believe that such methods such as ecosystem topology provides us a more empirical means to assess various inconspicuous patterns within an ecosystem.
601For instance, the topology can reveal the location of packages that are either made for the npm ecosystem or for application usage.
602Such patterns may become indicators of the ecosystem health.
603For future work, we would like to explore additional features, especially the more social (i.e., contributors and open source development organization activities ) or technical aspects (i.e., source code evolution) of packages within the ecosystem.
604Furthermore, we would like to study the evolution of the ecosystem topology as an additional future work.
605
606There are many challenges related to software ecosystems.
607Work by Serebrenik and Mens \cite{Serebrenik2015} grouped these challenges as: \textit{ Architecture and Design, Governance, Dynamics and Evolution, Data Analytics, Domain-Specific Ecosystems Solutions, and Ecosystems Analysis.}
608We believe that topology analysis may prove to be useful in addressing some of these issues at the higher level.
609Other future avenues for research are related to the extension of our method and techniques.
610We find that the topology provides a holistic method to visualize and compare these features at a higher-level.
611In this work we only implement the topology method (i.e., TDA mapper) within the TDA field.
612Future work may include a more in-depth analysis using other TDA concepts such as persistence homology.
613
614Our overall vision is towards a more concrete means of automated library recommendations and categorizations.
615We believe that this study a step in this direction.
616Hence, future work will include a more deeper look at the TDA and provide more realistic and useful library categorized that are actionable for software developers.
617
618
619\subsection{Threats to Validity}
620
621\subsubsection{Construct}
622This validity is concerned with threats to the construction of the topology, which is the selection of the features.
623We understand that there is a plethora of other much more stronger indicators that could be used in the topology.
624In this work, we use the six features that are popularly used in prior works \cite{Wittern2016}, \cite{Decan2017}, \cite{Kikas2017}, \cite{Bavota2015}.
625As discussed in the prior section (Section \ref{sec:imp}), we plan to expand our feature list in the future.
626
627\subsubsection{Internal}
628This validity is related to the accuracy of the data collected and tools used in the experiments.
629For the topology generation, we randomly selected 10,000 npm projects (discarding any package that was missing any of the features) to generate our topology.
630We understand that with the rate by which an ecosystem changes, that the results may quickly become outdated.
631However, based on the different granularity (Top 20, Top 50, Top 100, Top 1,000) we believe that the shape of the data may change but the structure may still resemble our current topology.
632To validate this, we would have to experiment with much more data.
633
634A minor threat to our study is our \textit{vsm} formulation for each package.
635For instance, we use the \texttt{word2vec} technique for the tagged keyword feature.
636Our main rational is that the word2vec provides a more robust techniques compared to the basic binary technique.
637Although this is not empirically evaluated, we are confident that the results are representative.
638The second threat to validity is the accuracy of the tool used