Array Programming with NumPy

06/18/2020 ∙ by Charles R Harris, et al. ∙ 0

Array programming provides a powerful, compact, expressive syntax for accessing, manipulating, and operating on data in vectors, matrices, and higher-dimensional arrays. NumPy is the primary array programming library for the Python language. It plays an essential role in research analysis pipelines in fields as diverse as physics, chemistry, astronomy, geoscience, biology, psychology, material science, engineering, finance, and economics. For example, in astronomy, NumPy was an important part of the software stack used in the discovery of gravitational waves and the first imaging of a black hole. Here we show how a few fundamental array concepts lead to a simple and powerful programming paradigm for organizing, exploring, and analyzing scientific data. NumPy is the foundation upon which the entire scientific Python universe is constructed. It is so pervasive that several projects, targeting audiences with specialized needs, have developed their own NumPy-like interfaces and array objects. Because of its central position in the ecosystem, NumPy increasingly plays the role of an interoperability layer between these new array computation libraries.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

NumPy arrays

The NumPy array is a data structure that efficiently stores and accesses multidimensional arrays [18]

, also known as tensors, and enables a wide variety of scientific computation. It consists of a pointer to memory, along with metadata used to interpret the data stored there, notably

data type, shape, and strides (Fig. 1a).

The data type describes the nature of elements stored in an array. An array has a single data type, and each array element occupies the same number of bytes in memory. Examples of data types include real and complex numbers (of lower and higher precision), strings, timestamps, and pointers to Python objects.

The shape of an array determines the number of elements along each axis, and the number of axes is the array’s dimensionality. For example, a vector of numbers can be stored as a one-dimensional array of shape , while color videos are four-dimensional arrays of shape .

Strides are necessary to interpret computer memory, which stores elements linearly, as multidimensional arrays. It describes the number of bytes to move forward in memory to jump from row to row, column to column, and so forth. Consider, for example, a 2-D array of floating-point numbers with shape , where each element occupies 8 bytes in memory. To move between consecutive columns, we need to jump forward 8 bytes in memory, and to access the next row bytes. The strides of that array are therefore . NumPy can store arrays in either C or Fortran memory order, iterating first over either rows or columns. This allows external libraries written in those languages to access NumPy array data in memory directly.

Users interact with NumPy arrays using indexing (to access subarrays or individual elements), operators (e.g., , , for vectorized operations and for matrix multiplication), as well as array-aware functions; together, these provide an easily readable, expressive, high-level API for array programming, while NumPy deals with the underlying mechanics of making operations fast.

Indexing an array returns single elements, subarrays, or elements that satisfy a specific condition (Fig. 1b). Arrays can even be indexed using other arrays (Fig. 1c). Wherever possible, indexing that retrieves a subarray returns a view on the original array, such that data is shared between the two arrays. This provides a powerful way to operate on subsets of array data while limiting memory usage.

To complement the array syntax, NumPy includes functions that perform vectorized calculations on arrays, including arithmetic, statistics, and trigonometry (Fig. 1d). Vectorization—operating on whole arrays rather than their individual elements—is essential to array programming. This means that operations that would take many tens of lines to express in languages such as C can often be implemented as a single, clear Python expression. This results in concise code and frees users to focus on the details of their analysis, while NumPy handles looping over array elements near-optimally, taking into consideration, for example, strides, to best utilize the computer’s fast cache memory.

When performing a vectorized operation (such as addition) on two arrays with the same shape, it is clear what should happen. Through broadcasting, NumPy allows the dimensions to differ, while still producing results that appeal to intuition. A trivial example is the addition of a scalar value to an array, but broadcasting also generalizes to more complex examples such as scaling each column of an array or generating a grid of coordinates. In broadcasting, one or both arrays are virtually duplicated (that is, without copying any data in memory), so that the shapes of the operands match (Fig. 1d). Broadcasting is also applied when an array is indexed using arrays of indices (Fig. 1c).

Other array-aware functions, such as sum, mean, and maximum, perform element-by-element reductions, aggregating results across one, multiple, or all axes of a single array. For example, summing an -dimensional array over axes results in a -dimensional array (Fig. 1f).

NumPy also includes array-aware functions for creating, reshaping, concatenating, and padding arrays; searching, sorting, and counting data; and reading and writing files. It provides extensive support for generating pseudorandom numbers, includes an assortment of probability distributions, and performs accelerated linear algebra, utilizing one of several backends such as OpenBLAS

[19, 20] or Intel MKL optimized for the CPUs at hand.

Altogether, the combination of a simple in-memory array representation, a syntax that closely mimics mathematics, and a variety of array-aware utility functions forms a productive and powerfully expressive array programming language.

Figure 2: NumPy is the base of the scientific Python ecosystem. Essential libraries and projects that depend on NumPy’s API gain access to new array implementations that support NumPy’s array protocols (Fig. 3).

Scientific Python ecosystem

Python is an open-source, general-purpose, interpreted programming language well-suited to standard programming tasks such as cleaning data, interacting with web resources, and parsing text. Adding fast array operations and linear algebra allows scientists to do all their work within a single language—and one that has the advantage of being famously easy to learn and teach, as witnessed by its adoption as a primary learning language in many universities.

Even though NumPy is not part of Python’s standard library, it benefits from a good relationship with the Python developers. Over the years, the Python language has added new features and special syntax so that NumPy would have a more succinct and easier to read array notation. Since it is not part of the standard library, NumPy is able to dictate its own release policies and development patterns.

SciPy and Matplotlib are tightly coupled with NumPy—in terms of history, development, and use. SciPy provides fundamental algorithms for scientific computing, including mathematical, scientific, and engineering routines. Matplotlib generates publication-ready figures and visualizations. The combination of NumPy, SciPy, and Matplotlib, together with an advanced interactive environment like IPython [21], or Jupyter [22], provides a solid foundation for array programming in Python. The scientific Python ecosystem (Fig. 2) builds on top of this foundation to provide several, widely used technique specific libraries [16, 17, 23], that in turn underlay numerous domain specific projects [24, 25, 26, 27, 28, 29]. NumPy, at the base of the ecosystem of array-aware libraries, sets documentation standards, provides array testing infrastructure, and adds build support for Fortran and other compilers.

Many research groups have designed large, complex scientific libraries, which add application specific functionality to the ecosystem. For example, the eht-imaging library [30] developed by the Event Horizon Telescope collaboration for radio interferometry imaging, analysis, and simulation, relies on many lower-level components of the scientific Python ecosystem. NumPy arrays are used to store and manipulate numerical data at every step in the processing chain: from raw data through calibration and image reconstruction. SciPy supplies tools for general image processing tasks such as filtering and image alignment, while scikit-image, an image processing library that extends SciPy, provides higher-level functionality such as edge filters and Hough transforms. The scipy.optimize

module performs mathematical optimization. NetworkX

[23]

, a package for complex network analysis, is used to verify image comparison consistency. Astropy

[24, 25] handles standard astronomical file formats and computes time/coordinate transformations. Matplotlib is used to visualize data and to generate the final image of the black hole.

The interactive environment created by the array programming foundation along with the surrounding ecosystem of tools—inside of IPython or Jupyter—is ideally suited to exploratory data analysis. Users fluidly inspect, manipulate, and visualize their data, and rapidly iterate to refine programming statements. These statements are then stitched together into imperative or functional programs, or notebooks containing both computation and narrative. Scientific computing beyond exploratory work is often done in a text editor or an integrated development environment (IDEs) such as Spyder. This rich and productive environment has made Python popular for scientific research.

To complement this facility for exploratory work and rapid prototyping, NumPy has developed a culture of employing time-tested software engineering practices to improve collaboration and reduce error [31]. This culture is not only adopted by leaders in the project but also enthusiastically taught to newcomers. The NumPy team was early in adopting distributed revision control and code review to improve collaboration on code, and continuous testing that runs an extensive battery of automated tests for every proposed change to NumPy. The project also has comprehensive, high-quality documentation, integrated with the source code [32, 33, 34].

This culture of using best practices for producing reliable scientific software has been adopted by the ecosystem of libraries that build on NumPy. For example, in a recent award given by the Royal Astronomical Society to Astropy, they state:

The Astropy Project has provided hundreds of junior scientists with experience in professional-standard software development practices including use of version control, unit testing, code review and issue tracking procedures. This is a vital skill set for modern researchers that is often missing from formal university education in physics or astronomy.

Community members explicitly work to address this lack of formal education through courses and workshops [35, 36, 37].

The recent rapid growth of data science, machine learning, and artificial intelligence has further and dramatically boosted the usage of scientific Python. Examples of its significant application, such as the

eht-imaging library, now exist in almost every discipline in the natural and social sciences. These tools have become the primary software environment in many fields. NumPy and its ecosystem are commonly taught in university courses, boot camps, and summer schools, and are at the focus of community conferences and workshops worldwide.

NumPy and its API have become truly ubiquitous.

Array proliferation and interoperability

Figure 3: NumPy’s API and array protocols expose new arrays to the ecosystem. In this example, NumPy’s mean function is called on a Dask array. The call succeeds by dispatching to the appropriate library implementation (i.e., Dask in this case) and results in a new Dask array. Compare this code to the example code in Fig. 1g.

NumPy provides in-memory, multidimensional, homogeneously typed (i.e., single pointer and strided) arrays on CPUs. It runs on machines ranging from embedded devices to the world’s largest supercomputers, with performance approaching that of compiled languages. For most its existence, NumPy addressed the vast majority of array computation use cases.

However, scientific data sets now routinely exceed the memory capacity of a single machine and may be stored on multiple machines or in the cloud. In addition, the recent need to accelerate deep learning and artificial intelligence applications has led to the emergence of specialized accelerator hardware, including graphics processing units (GPUs), tensor processing units (TPUs), and field-programmable gate arrays (FPGAs). Due to its in-memory data model, NumPy is currently unable to utilize such storage and specialized hardware directly. However, both distributed data and the parallel execution of GPUs, TPUs, and FPGAs map well to the

paradigm of array programming: a gap, therefore, existed between available modern hardware architectures and the tools necessary to leverage their computational power.

The community’s efforts to fill this gap led to a proliferation of new array implementations. For example, each deep learning framework created its own arrays; PyTorch

[38]

, Tensorflow

[39], Apache MXNet [40], and JAX arrays all have the capability to run on CPUs and GPUs, in a distributed fashion, utilizing lazy evaluation to allow for additional performance optimizations. SciPy and PyData/Sparse both provide sparse arrays—which typically contain few non-zero values and store only those in memory for efficiency. In addition, there are projects that build on top of NumPy arrays as a data container and extend its capabilities. Distributed arrays are made possible that way by Dask, and labeled arrays—referring to dimensions of an array by name rather than by index for clarity, compare x[:, 1] vs. x.loc[:, ’time’]—by xarray [41].

Such libraries often mimic the NumPy API, because it lowers the barrier to entry for newcomers and provides the wider community with a stable array programming interface. This, in turn, prevents disruptive schisms like the divergence of Numeric and Numarray. But exploring new ways of working with arrays is experimental by nature and, in fact, several promising libraries—such as Theano and Caffe—have already ceased development. And each time that a user decides to try a new technology, they must change import statements and ensure that the new library implements all the parts of the NumPy API they currently use.

Ideally, operating on specialized arrays using NumPy functions or semantics would simply work, so that users could write code once, and would then benefit from switching between NumPy arrays, GPU arrays, distributed arrays, and so forth, as appropriate. To support array operations between external array objects, NumPy therefore added the capability to act as a central coordination mechanism with a well-specified API (Fig. 2).

To facilitate this interoperability, NumPy provides “protocols” (or contracts of operation), that allow for specialized arrays to be passed to NumPy functions (Fig. 3). NumPy, in turn, dispatches operations to the originating library, as required. Over four hundred of the most popular NumPy functions are supported. The protocols are implemented by widely used libraries such as Dask, CuPy, xarray, and PyData/Sparse. Thanks to these developments, users can now, for example, scale their computation from a single machine to distributed systems using Dask. The protocols also compose well, allowing users to redeploy NumPy code at scale on distributed, multi-GPU systems via, for instance, CuPy arrays embedded in Dask arrays. Using NumPy’s high-level API, users can leverage highly parallel code execution on multiple systems with millions of cores, all with minimal code changes [42].

These array protocols are now a key feature of NumPy, and are expected to only increase in importance. As with the rest of NumPy, we iteratively refine and add protocol designs to improve utility and simplify adoption.

Discussion

NumPy combines the expressive power of array programming, the performance of C, and the readability, usability, and versatility of Python in a mature, well-tested, well-documented, and community-developed library. Libraries in the scientific Python ecosystem provide fast implementations of most important algorithms. Where extreme optimization is warranted, compiled languages such as Cython [43], Numba [44], and Pythran [45], that extend Python and transparently accelerate bottlenecks, can be used. Because of NumPy’s simple memory model, it is easy to write low-level, hand-optimized code, usually in C or Fortran, to manipulate NumPy arrays and pass them back to Python. Furthermore, using array protocols, it is possible to utilize the full spectrum of specialized hardware acceleration with minimal changes to existing code.

NumPy was initially developed by students, faculty, and researchers to provide an advanced, open-source array programming library for Python, which was free to use and unencumbered by license servers, dongles, and the like. There was a sense of building something consequential together, for the benefit of many others. Participating in such an endeavor, within a welcoming community of like-minded individuals, held a powerful attraction for many early contributors.

These user-developers frequently had to write code from scratch to solve their own or their colleagues’ problems—often in low-level languages that precede Python, like Fortran [46] and C. To them, the advantages of an interactive, high-level array library were evident. The design of this new tool was informed by other powerful interactive programming languages for scientific computing such as Basis [47], Yorick [48], R [49], and APL [50], as well as commercial languages and environments like IDL and MATLAB.

What began as an attempt to add an array object to Python became the foundation of a vibrant ecosystem of tools. Now, a large amount of scientific work depends on NumPy being correct, fast, and stable. It is no longer a small community project, but is core scientific infrastructure.

The developer culture has matured: while initial development was highly informal, NumPy now has a roadmap and a process for proposing and discussing large changes. The project has formal governance structures and is fiscally sponsored by NumFOCUS, a nonprofit that promotes open practices in research, data, and scientific computing. Over the past few years, the project attracted its first funded development, sponsored by the Moore and Sloan Foundations, and received an award as part of the Chan Zuckerberg Initiative’s Essentials of Open Source Software program. With this funding, the project was (and is) able to have sustained focus over multiple months to implement substantial new features and improvements. That said, it still depends heavily on contributions made by graduate students and researchers in their free time.

NumPy is no longer just the foundational array library underlying the scientific Python ecosystem, but has also become the standard API for tensor computation and a central coordinating mechanism between array types and technologies in Python. Work continues to expand on and improve these interoperability features.

Over the next decade, we will face several challenges. New devices will be developed, and existing specialized hardware will evolve, to meet diminishing returns on Moore’s law. There will be more, and a wider variety of, data science practitioners, a significant proportion of whom will be using NumPy. The scale of scientific data gathering will continue to expand, with the adoption of devices and instruments such as light sheet microscopes and the Large Synoptic Survey Telescope (LSST) [51]. New generation languages, interpreters, and compilers, such as Rust [52], Julia [53], and LLVM [54], will invent and determine the viability of new concepts and data structures.

Through various mechanisms described in this paper, NumPy is poised to embrace such a changing landscape, and to continue playing a leading role in interactive scientific computation. To do so will require sustained funding from government, academia, and industry. But, importantly, it will also need a new generation of graduate students and other developers to engage, to build a NumPy that meets the needs of the next decade of data science.

References

  • [1] K. E. Iverson, “Notation as a tool of thought,” Communications of the ACM, vol. 23, p. 444–465, Aug. 1980.
  • [2] P. F. Dubois, “Python: Batteries included,” Computing in Science & Engineering, vol. 9, no. 3, pp. 7–9, 2007.
  • [3] T. E. Oliphant, “Python for scientific computing,” Computing in Science & Engineering, vol. 9, pp. 10–20, May-June 2007.
  • [4] K. J. Millman and M. Aivazis, “Python for scientists and engineers,” Computing in Science & Engineering, vol. 13, no. 2, pp. 9–12, 2011.
  • [5] F. Pérez, B. E. Granger, and J. D. Hunter, “Python: an ecosystem for scientific computing,” Computing in Science & Engineering, vol. 13, no. 2, pp. 13–21, 2011.
  • [6] B. P. Abbott, R. Abbott, T. Abbott, M. Abernathy, F. Acernese, K. Ackley, C. Adams, T. Adams, P. Addesso, R. Adhikari, et al., “Observation of gravitational waves from a binary black hole merger,” Physical Review Letters, vol. 116, no. 6, p. 061102, 2016.
  • [7] A. A. Chael, M. D. Johnson, R. Narayan, S. S. Doeleman, J. F. Wardle, and K. L. Bouman, “High-resolution linear polarimetric imaging for the event horizon telescope,” The Astrophysical Journal, vol. 829, no. 1, p. 11, 2016.
  • [8] P. F. Dubois, K. Hinsen, and J. Hugunin, “Numerical Python,” Computers in Physics, vol. 10, no. 3, pp. 262–267, 1996.
  • [9] D. Ascher, P. F. Dubois, K. Hinsen, J. Hugunin, and T. E. Oliphant, “An open source project: Numerical Python,” 2001.
  • [10] T.-Y. Yang, G. Furnish, and P. F. Dubois, “Steering object-oriented scientific computations,” in Proceedings of TOOLS USA 97. International Conference on Technology of Object Oriented Systems and Languages, pp. 112–119, IEEE, 1997.
  • [11] P. Greenfield, J. T. Miller, J. Hsu, and R. L. White, “numarray: A new scientific array package for Python,” PyCon DC, 2003.
  • [12] T. E. Oliphant, Guide to NumPy. Trelgol Publishing USA, 1st ed., 2006.
  • [13] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, I. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors, “SciPy 1.0—fundamental algorithms for scientific computing in Python,” Nature Methods, vol. 17, pp. 261–272, 2020.
  • [14] J. D. Hunter, “Matplotlib: A 2D graphics environment,” Computing in Science & Engineering, vol. 9, no. 3, pp. 90–95, 2007.
  • [15] W. McKinney, “Data structures for statistical computing in Python,” in Proceedings of the 9th Python in Science Conference (S. van der Walt and J. Millman, eds.), pp. 51 – 56, 2010.
  • [16] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and É. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, no. Oct, pp. 2825–2830, 2011.
  • [17] S. van der Walt, J. L. Schönberger, J. Nunez-Iglesias, F. Boulogne, J. D. Warner, N. Yager, E. Gouillart, T. Yu, and the scikit-image contributors, “scikit-image: image processing in Python,” PeerJ, vol. 2, p. e453, 2014.
  • [18] S. van der Walt, S. C. Colbert, and G. Varoquaux, “The NumPy array: a structure for efficient numerical computation,” Computing in Science & Engineering, vol. 13, no. 2, pp. 22–30, 2011.
  • [19] Q. Wang, X. Zhang, Y. Zhang, and Q. Yi, “Augem: automatically generate high performance dense linear algebra kernels on x86 cpus,” in SC’13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12, IEEE, 2013.
  • [20] Z. Xianyi, W. Qian, and Z. Yunquan, “Model-driven level 3 blas performance optimization on loongson 3a processor,” in 2012 IEEE 18th International Conference on Parallel and Distributed Systems, pp. 684–691, IEEE, 2012.
  • [21] F. Pérez and B. E. Granger, “IPython: a system for interactive scientific computing,” Computing in Science & Engineering, vol. 9, no. 3, pp. 21–29, 2007.
  • [22] T. Kluyver, B. Ragan-Kelley, F. Pérez, B. Granger, M. Bussonnier, J. Frederic, K. Kelley, J. Hamrick, J. Grout, S. Corlay, P. Ivanov, D. Avila, S. Abdalla, and C. Willing, “Jupyter Notebooks—a publishing format for reproducible computational workflows,” in Positioning and Power in Academic Publishing: Players, Agents and Agendas (F. Loizides and B. Schmidt, eds.), pp. 87–90, IOS Press, 2016.
  • [23] A. A. Hagberg, D. A. Schult, and P. J. Swart, “Exploring network structure, dynamics, and function using NetworkX,” in Proceedings of the 7th Python in Science Conference (G. Varoquaux, T. Vaught, and K. J. Millman, eds.), (Pasadena, CA USA), pp. 11–15, 2008.
  • [24] Astropy Collaboration, T. P. Robitaille, E. J. Tollerud, P. Greenfield, M. Droettboom, E. Bray, T. Aldcroft, M. Davis, A. Ginsburg, A. M. Price-Whelan, W. E. Kerzendorf, A. Conley, N. Crighton, K. Barbary, D. Muna, H. Ferguson, F. Grollier, M. M. Parikh, P. H. Nair, H. M. Unther, C. Deil, J. Woillez, S. Conseil, R. Kramer, J. E. H. Turner, L. Singer, R. Fox, B. A. Weaver, V. Zabalza, Z. I. Edwards, K. Azalee Bostroem, D. J. Burke, A. R. Casey, S. M. Crawford, N. Dencheva, J. Ely, T. Jenness, K. Labrie, P. L. Lim, F. Pierfederici, A. Pontzen, A. Ptak, B. Refsdal, M. Servillat, and O. Streicher, “Astropy: A community Python package for astronomy,” Astronomy & Astrophysics, vol. 558, p. A33, Oct. 2013.
  • [25] A. M. Price-Whelan, B. M. Sipőcz, H. M. Günther, P. L. Lim, S. M. Crawford, S. Conseil, D. L. Shupe, M. W. Craig, N. Dencheva, A. Ginsburg, J. T. VanderPlas, L. D. Bradley, D. Pérez-Suárez, M. de Val-Borro, P. Paper Contributors, T. L. Aldcroft, K. L. Cruz, T. P. Robitaille, E. J. Tollerud, A. Coordination Committee, C. Ardelean, T. Babej, Y. P. Bach, M. Bachetti, A. V. Bakanov, S. P. Bamford, G. Barentsen, P. Barmby, A. Baumbach, K. L. Berry, F. Biscani, M. Boquien, K. A. Bostroem, L. G. Bouma, G. B. Brammer, E. M. Bray, H. Breytenbach, H. Buddelmeijer, D. J. Burke, G. Calderone, J. L. Cano Rodríguez, M. Cara, J. V. M. Cardoso, S. Cheedella, Y. Copin, L. Corrales, D. Crichton, D. D’Avella, C. Deil, É. Depagne, J. P. Dietrich, A. Donath, M. Droettboom, N. Earl, T. Erben, S. Fabbro, L. A. Ferreira, T. Finethy, R. T. Fox, L. H. Garrison, S. L. J. Gibbons, D. A. Goldstein, R. Gommers, J. P. Greco, P. Greenfield, A. M. Groener, F. Grollier, A. Hagen, P. Hirst, D. Homeier, A. J. Horton, G. Hosseinzadeh, L. Hu, J. S. Hunkeler, Ž. Ivezić, A. Jain, T. Jenness, G. Kanarek, S. Kendrew, N. S. Kern, W. E. Kerzendorf, A. Khvalko, J. King, D. Kirkby, A. M. Kulkarni, A. Kumar, A. Lee, D. Lenz, S. P. Littlefair, Z. Ma, D. M. Macleod, M. Mastropietro, C. McCully, S. Montagnac, B. M. Morris, M. Mueller, S. J. Mumford, D. Muna, N. A. Murphy, S. Nelson, G. H. Nguyen, J. P. Ninan, M. Nöthe, S. Ogaz, S. Oh, J. K. Parejko, N. Parley, S. Pascual, R. Patil, A. A. Patil, A. L. Plunkett, J. X. Prochaska, T. Rastogi, V. Reddy Janga, J. Sabater, P. Sakurikar, M. Seifert, L. E. Sherbert, H. Sherwood-Taylor, A. Y. Shih, J. Sick, M. T. Silbiger, S. Singanamalla, L. P. Singer, P. H. Sladen, K. A. Sooley, S. Sornarajah, O. Streicher, P. Teuben, S. W. Thomas, G. R. Tremblay, J. E. H. Turner, V. Terrón, M. H. van Kerkwijk, A. de la Vega, L. L. Watkins, B. A. Weaver, J. B. Whitmore, J. Woillez, V. Zabalza, and A. Contributors, “The Astropy Project: Building an Open-science Project and Status of the v2.0 Core Package,” The Astronomical Journal, vol. 156, p. 123, Sept. 2018.
  • [26] P. J. Cock, T. Antao, J. T. Chang, B. A. Chapman, C. J. Cox, A. Dalke, I. Friedberg, T. Hamelryck, F. Kauff, B. Wilczynski, and M. J. L. de Hoon, “Biopython: freely available Python tools for computational molecular biology and bioinformatics,” Bioinformatics, vol. 25, no. 11, pp. 1422–1423, 2009.
  • [27] K. J. Millman and M. Brett, “Analysis of functional Magnetic Resonance Imaging in Python,” Computing in Science & Engineering, vol. 9, no. 3, pp. 52–55, 2007.
  • [28] T. SunPy Community, S. J. Mumford, S. Christe, D. Pérez-Suárez, J. Ireland, A. Y. Shih, A. R. Inglis, S. Liedtke, R. J. Hewett, F. Mayer, K. Hughitt, N. Freij, T. Meszaros, S. M. Bennett, M. Malocha, J. Evans, A. Agrawal, A. J. Leonard, T. P. Robitaille, B. Mampaey, J. Iván Campos-Rozo, and M. S. Kirk, “SunPy—Python for solar physics,” Computational Science and Discovery, vol. 8, p. 014009, Jan. 2015.
  • [29] J. Hamman, M. Rocklin, and R. Abernathy, “Pangeo: A Big-data Ecosystem for Scalable Earth System Science,” in EGU General Assembly Conference Abstracts, EGU General Assembly Conference Abstracts, p. 12146, Apr 2018.
  • [30] A. A. Chael, K. L. Bouman, M. D. Johnson, R. Narayan, S. S. Doeleman, J. F. Wardle, L. L. Blackburn, K. Akiyama, M. Wielgus, C.-k. Chan, et al., “ehtim: Imaging, analysis, and simulation software for radio interferometry,” Astrophysics Source Code Library, 2019.
  • [31] K. J. Millman and F. Pérez, “Developing open-source scientific practice,” Implementing Reproducible Research. CRC Press, Boca Raton, FL, pp. 149–183, 2014.
  • [32] S. van der Walt, “The SciPy documentation project (technical overview),” in Proceedings of the 7th Python in Science Conference (SciPy 2008) (G. Varoquaux, T. Vaught, and K. J. Millman, eds.), pp. 27–28, 2008.
  • [33] J. Harrington, “The SciPy documentation project,” in Proceedings of the 7th Python in Science Conference (SciPy 2008) (G. Varoquaux, T. Vaught, and K. J. Millman, eds.), pp. 33–35, 2008.
  • [34] J. Harrington and D. Goldsmith, “Progress report: NumPy and SciPy documentation in 2009,” in Proceedings of the 8th Python in Science Conference (SciPy 2009) (G. Varoquaux, S. van der Walt, and K. J. Millman, eds.), pp. 84–87, 2009.
  • [35] G. Wilson, “Software carpentry: Getting scientists to write better code by making them more productive,” Computing in Science & Engineering, November–December 2006.
  • [36] J. E. Hannay, H. P. Langtangen, C. MacLeod, D. Pfahl, J. Singer, and G. Wilson, “How do scientists develop and use scientific software?,” in Proc. 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering, 2009.
  • [37] K. J. Millman, M. Brett, R. Barnowski, and J.-B. Poline, “Teaching computational reproducibility for neuroimaging,” Frontiers in Neuroscience, vol. 12, p. 727, 2018.
  • [38] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32 (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), pp. 8024–8035, Curran Associates, Inc., 2019.
  • [39] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
  • [40] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.
  • [41] S. Hoyer and J. Hamman, “xarray: N-D labeled arrays and datasets in Python,” Journal of Open Research Software, vol. 5, no. 1, 2017.
  • [42] P. Entschev, “Distributed multi-GPU computing with Dask, CuPy and RAPIDS.” EuroPython 2019, 2019.
  • [43] S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D. S. Seljebotn, and K. Smith, “Cython: The best of both worlds,” Computing in Science & Engineering, vol. 13, no. 2, pp. 31–39, 2011.
  • [44] S. K. Lam, A. Pitrou, and S. Seibert, “Numba: A LLVM-based Python JIT compiler,” in Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM ’15, (New York, NY, USA), pp. 7:1–7:6, ACM, 2015.
  • [45] S. Guelton, P. Brunet, M. Amini, A. Merlini, X. Corbillon, and A. Raynaud, “Pythran: Enabling static optimization of scientific python programs,” Computational Science & Discovery, vol. 8, no. 1, p. 014001, 2015.
  • [46] J. Dongarra, G. H. Golub, E. Grosse, C. Moler, and K. Moore, “Netlib and na-net: Building a scientific computing community,” IEEE Annals of the History of Computing, vol. 30, no. 2, pp. 30–41, 2008.
  • [47] P. F. Dubois, “The basis system,” tech. rep., Lawrence Livermore National Laboratory, CA (USA), 1989. UCRL-MA-118543, Parts I-VI.
  • [48] D. H. Munro and P. F. Dubois, “Using the yorick interpreted language,” Computers in Physics, vol. 9, no. 6, pp. 609–615, 1995.
  • [49] R. Ihaka and R. Gentleman, “R: a language for data analysis and graphics,” Journal of Computational and Graphical Statistics, vol. 5, no. 3, pp. 299–314, 1996.
  • [50] K. E. Iverson, “A programming language,” in Proceedings of the May 1-3, 1962, Spring Joint Computer Conference, pp. 345–351, 1962.
  • [51] T. Jenness, F. Economou, K. Findeisen, F. Hernandez, J. Hoblitt, K. S. Krughoff, K. Lim, R. H. Lupton, F. Mueller, W. O’Mullane, et al., “Lsst data management software development practices and tools,” in Software and Cyberinfrastructure for Astronomy V, vol. 10707, p. 1070709, International Society for Optics and Photonics, 2018.
  • [52] N. D. Matsakis and F. S. Klock, “The rust language,” Ada Letters, vol. 34, pp. 103–104, Oct. 2014.
  • [53] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, “Julia: A fresh approach to numerical computing,” SIAM Review, vol. 59, no. 1, pp. 65–98, 2017.
  • [54] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong program analysis and transformation,” (San Jose, CA, USA), pp. 75–88, Mar 2004.
  • [55] P. Peterson, “F2PY: a tool for connecting Fortran and Python programs,” International Journal of Computational Science and Engineering, vol. 4, no. 4, pp. 296–305, 2009.
  • [56] The NumPy Project Community, “NumPy project governance,” 2015.
  • [57] The NumPy Project Community, “NumPy code of conduct,” 2018.
  • [58] D. Holth, “Pep 427 – the wheel binary package format 1.0,” 2012.
  • [59] Brett, M. et al, “multibuild,” 2016.
  • [60] B. Griffith, P. Virtanen, N. Smith, M. van Kerkwijk, and S. Hoyer, “NEP 13 – a mechanism for overriding ufuncs,” 2013.
  • [61] S. Hoyer, M. Rocklin, M. van Kerkwijk, H. Abbasi, and E. Wieser, “NEP 18 – a dispatch mechanism for numpy’s high level array functions,” 2018.
  • [62] M. E. O’Neill, “Pcg: A family of simple fast space-efficient statistically good algorithms for random number generation,” Tech. Rep. HMC-CS-2014-0905, Harvey Mudd College, Claremont, CA, Sept. 2014.
  • [63] J. K. Salmon, M. A. Moraes, R. O. Dror, and D. E. Shaw, “Parallel random numbers: As easy as 1, 2, 3,” in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, (New York, NY, USA), pp. 16:1–16:12, ACM, 2011.
  • [64] C. Doty-Humphrey, “Practrand, version 0.94.”
  • [65] M. Matsumoto and T. Nishimura, “Mersenne Twister: A 623-dimensionally equidistributed uniform pseudo-random number generator,” ACM Transactions on Modeling and Computer Simulation, vol. 8, pp. 3–30, Jan. 1998.
  • [66] K. Sheppard, B. Duvenhage, P. de Buyl, and D. A. Ham, “bashtage/randomgen: Release 1.16.2,” Apr. 2019.
  • [67]

    G. Marsaglia and W. W. Tsang, “The ziggurat method for generating random variables,”

    Journal of Statistical Software, Articles, vol. 5, no. 8, pp. 1–7, 2000.
  • [68] D. Lemire, “Fast random integer generation in an interval,” ACM Transactions on Modeling and Computer Simulation, vol. 29, pp. 1–12, Jan 2019.
  • [69] top500, “Top 10 sites for november 2019,” 2019.
  • [70] wikichip, “Astra - supercomputers,” 2019.
  • [71] Wikipedia, “Arm architecture,” 2019.
  • [72] NumPy Developers, “Numpy roadmap,” 2019.
  • [73] Dustin Ingram, “Pep 599 – the manylinux2014 platform tag,” 2019.

Methods

We use Git for version control and GitHub as the public hosting service for our official upstream repository (https://github.com/numpy/numpy). We each work in our own copy (or fork) of the project and use the upstream repository as our integration point. To get new code into the upstream repository, we use GitHub’s pull request (PR) mechanism. This allows us to review code before integrating it as well as to run a large number of tests on the modified code to ensure that the changes do not break expected behavior.

We also use GitHub’s issue tracking system to collect and triage problems and proposed improvements.

Library organization

Broadly, the NumPy library consists of the following parts: the NumPy array data structure ndarray; the so-called universal functions; a set of library functions for manipulating arrays and doing scientific computation; infrastructure libraries for unit tests and Python package building; and the program f2py for wrapping Fortran code in Python [55]. The ndarray and the universal functions are generally considered the core of the library. In the following, we give a brief summary of these components of the library.

Core.

The ndarray data structure and the universal functions make up the core of NumPy.

The ndarray is the data structure at the heart of NumPy. The data structure stores regularly strided homogeneous data types inside a contiguous block memory, allowing for the efficient representation of -dimensional data. More details about the data structure are given in “The NumPy array: a structure for efficient numerical computation” [18].

The universal functions, or more concisely, ufuncs, are functions written in C that implement efficient looping over NumPy arrays. An important feature of ufuncs is the built-in implementation of broadcasting. For example, the function arctan2(x, y) is a ufunc that accepts two values and computes . When arrays are passed in as the arguments, the ufunc will take care of looping over the dimensions of the inputs in such a way that if, say, x is a 1-D array with length 3, and y is a 2-D array with shape , the output will be an array with shape (Fig. 1c). The ufunc machinery takes care of calling the function with all the appropriate combinations of input array elements to complete the output array. The elementary arithmetic operations of addition, multiplication, etc., are implemented as ufuncs, so that broadcasting also applies to expressions such as x + y * z.

Computing libraries.

NumPy provides a large library of functions for array manipulation and scientific computing, including functions for: creating, reshaping, concatenating, and padding arrays; searching, sorting and counting data in arrays; computing elementary statistics, such as the mean, median, variance, and standard deviation; file I/O; and more.

A suite of functions for computing the

fast Fourier transform (FFT)

and its inverse is provided.

NumPy’s linear algebra library includes functions for: solving linear systems of equations; computing various functions of a matrix, including the determinant, the norm, the inverse, and the pseudo-inverse; computing the Cholesky, eigenvalue, and singular value decompositions of a matrix; and more.

The random number generator library in NumPy provides alternative bit stream generators

that provide the core function of generating random integers. A higher-level generator class that implements an assortment of probability distributions is provided. It includes the beta, gamma and Weibull distributions, the univariate and multivariate normal distributions, and more.

Infrastructure libraries.

NumPy provides utilities for writing tests and for building Python packages.

The testing subpackage provides functions such as assert_allclose(actual, desired) that may be used in test suites for code that uses NumPy arrays.

NumPy provides the subpackage distutils which includes functions and classes to facilitate configuration, installation, and packaging of libraries depending on NumPy. These can be used, for example, when publishing to the PyPI website.

F2py.

The program f2py is a tool for building NumPy-aware Python wrappers of Fortran functions. NumPy itself does not use any Fortran code; F2PY is part of NumPy for historical reasons.

Governance

NumPy adopted an official Governance Document on October 5, 2015 [56]. Project decisions are usually made by consensus of interested contributors. This means that, for most decisions, everyone is entrusted with veto power. A Steering Council, currently composed of 12 members, facilitates this process and oversees daily development of the project by contributing code and reviewing contributions from the community.

NumPy’s official Code of Conduct was approved on September 1, 2018 [57]. In brief, we strive to: be open; be empathetic, welcoming, friendly, and patient; be collaborative; be inquisitive; and be careful in the words that we choose. The Code of Conduct also specifies how breaches can be reported and outlines the process for responding to such reports.

Funding

In 2017, NumPy received its first large grants totaling 1.3M USD from the Gordon & Betty Moore and the Alfred P. Sloan foundations. Stéfan van der Walt is the PI and manages four programmers working on the project. These two grants focus on addressing the technical debt accrued over the years and on setting in place standards and architecture to encourage more sustainable development.

NumPy received a third grant for 195K USD from the Chan Zuckerberg Initiative at the end of 2019 with Ralf Gommers as the PI. This grant focuses on better serving NumPy’s large number of beginning to intermediate level users and on growing the community of NumPy contributors. It will also provide support to OpenBLAS, on which NumPy depends for accelerated linear algebra.

Finally, since May 2019 the project receives a small amount annually from Tidelift, which is used to fund things like documentation and website improvements.

Developers

NumPy is currently maintained by a group of 23 contributors with commit rights to the NumPy code base. Out of these, 17 maintainers were active in 2019, 4 of whom were paid to work on the project full-time. Additionally, there are a few long term developers who contributed and maintain specific parts of NumPy, but are not officially maintainers.

Over the course of its history, NumPy has attracted PRs by 823 contributors. However, its development relies heavily on a small number of active maintainers, who share more than half of the contributions among themselves.

At a release cycle of about every half year, the five recent releases in the years 2018 and 2019 have averaged about 450 PRs each,111 Note that before mid 2011, NumPy development did not happen on github.com. All data provided here is based on the development which happened through GitHub PRs. In some cases contributions by maintainers may not be categorized as such. with each release attracting more than a hundred new contributors. Figure 4 shows the number of PRs merged into the NumPy master branch. Although the number of PRs being merged fluctuates, the plot indicates an increased number of contributions over the past years.

Figure 4: Number of pull requests merged into the NumPy master branch for each quarter since 2012. The total number of PRs is indicated with the lower blue area showing the portion contributed by current or previous maintainers.

Community calls

The massive number of scientific Python packages that built on NumPy meant that it had an unusually high need for stability. So to guide our development we formalized the feature proposal process, and constructed a development roadmap with extensive input and feedback from the community.

Weekly community calls alternate between triage and higher level discussion. The calls not only involve developers from the community, but provide a venue for vendors and other external groups to provide input. For example, after Intel produced a forked version of NumPy, one of their developers joined a call to discuss community concerns.

NumPy enhancement proposals

Given the complexity of the codebase and the massive number of projects depending on it, large changes require careful planning and substantial work. NumPy Enhancement Proposals (NEPs) are modeled after Python Enhancement Proposals (PEPs) for “proposing major new features, for collecting community input on an issue, and for documenting the design decisions that have gone into Python”222https://numpy.org/neps/nep-0000.html. Since then there have been 19 proposed NEPS—6 have been implemented, 4 have been accepted and are being implemented, 4 are under consideration, 3 have been deferred or superseded, and 2 have been rejected or withdrawn.

Central role

NumPy plays a central role in building and standardizing much of the scientific Python community infrastructure. NumPy’s docstring standard is now widely adopted. We are also now using the NEP system as a way to help coordinate the larger scientific Python community. For example, in NEP 29, we recommend, along with leaders from various other projects, that all projects across the Scientific Python ecosystem adopt a common “time window-based” policy for support of Python and NumPy versions. This standard will simplify downstream project and release planning.

Wheels build system

A Python wheel [58] is a standard file format for distributing Python libraries. In addition to Python code, a wheel may include compiled C extensions and other binary data. This is important, because many libraries, including NumPy, require a C compiler and other build tools to build the software from the source code, making it difficult for many users to install the software on their own. The introduction of wheels to the Python packaging system has made it much easier for users to install precompiled libraries.

A GitHub repository containing scripts to build NumPy wheels has been configured so that a simple commit to the repository triggers an automated build system that creates NumPy wheels for several computer platforms, including Windows, Mac OSX and Linux. The wheels are uploaded to a public server and made available for anyone to use. This system makes it easy for users to install precompiled versions of NumPy on these platforms.

The technology that is used to build the wheels evolves continually. At the time this paper is being written, a key component is the multibuild suite of tools developed by Matthew Brett and other developers [59]. Currently, scripts using multibuild are written for the continuous integration platforms Travis-CI (for Linux and Mac OSX) and Appveyor (for Windows).

Recent technical improvements

With the recent infusion of funding and a clear process for coordinating with the developer community, we have been able to tackle a number of important large scale changes. We highlight two of those below, as well as changes made to our testing infrastructure to support hardware platforms used in large scale computing.

Array function protocol

A vast number of projects are built on NumPy; these projects are consumers of the NumPy API. Over the last several years, a growing number of projects are providers of a NumPy-like API and array objects targeting audiences with specialized needs beyond NumPy’s capabilities. For example, the NumPy API is implemented by several popular tensor computation libraries including CuPy333https://cupy.chainer.org/, JAX444https://jax.readthedocs.io/en/latest/jax.numpy.html, and Apache MXNet555https://numpy.mxnet.io/. PyTorch666https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html and Tensorflow777https://www.tensorflow.org/tutorials/customization/basics provide tensor APIs with NumPy-inspired semantics. It is also implemented in packages that support sparse arrays such as scipy.sparse and PyData/Sparse. Another notable example is Dask, a library for parallel computing in Python. Dask adopts the NumPy API and therefore presents a familiar interface to existing NumPy users, while adding powerful abilities to parallelize and distribute tasks.

The multitude of specialized projects creates the difficulty that consumers of these NumPy-like APIs write code specific to a single project and do not support all of the above array providers. This is a burden for users relying on the specialized array-like, since a tool they need may not work for them. It also creates challenges for end-users who need to transition from NumPy to a more specialized array. The growing multitude of specialized projects with NumPy-like APIs threatened to again fracture the scientific Python community.

To address these issues NumPy has the goal of providing the fundamental API for interoperability between the various NumPy-like APIs. An earlier step in this direction was the implementation of the __array_ufunc__ protocol in NumPy 1.13, which enabled interoperability for most mathematical functions [60]. In 2019 this was expanded more generally with the inclusion of the __array_function__ protocol into NumPy 1.17. These two protocols allow providers of array objects to be interoperable with the NumPy API: their arrays work correctly with almost all NumPy functions [61]. For the users relying on specialized array projects it means that even though much code is written specifically for NumPy arrays and uses the NumPy API as import numpy as np, it can nevertheless work for them. For example, here is how a CuPy GPU array can be passed through NumPy for processing, with all operations being dispatched back to CuPy:

import numpy as np
import cupy as cp
x_gpu = cp.array([1, 2, 3])
y = np.sum(x_gpu)  # Returns a GPU array

Similarly, user defined functions composed using NumPy can now be applied to, e.g., multi-node distributed Dask arrays:

import numpy as np
import dask.array as da
def f(x):
    """FunctionusingNumPyAPIcalls"""
    y = np.tensordot(x, x.T)
    return np.mean(np.log(y + 1))
x_local = np.random.random([10000, 10000])  # random local array
x_distr = da.random.random([10000, 10000])  # random distributed array
f(x_local)  # returns a NumPy array
f(x_distr)  # works, returns a Dask array

Random number generation

The NumPy random module provides pseudorandom numbers from a wide range of distributions. In legacy versions of NumPy, simulated random values are produced by a RandomState object that: handles seeding and state initialization; wraps the core pseudorandom number generator based on a Mersenne Twister implementation888to be precise, the standard 32-bit version of MT19937; interfaces with the underlying code that transforms random bits into variates from other distributions; and supplies a singleton instance exposed in the root of the random module.

The RandomState object makes a compatibility guarantee so that a fixed seed and sequence of function calls produce the same set of values. This guarantee has slowed progress since improving the underlying code requires extending the API with additional keyword arguments. This guarantee continues to apply to RandomState.

NumPy 1.17 introduced a new API for generating random numbers that use a more flexible structure that can be extended by libraries or end-users. The new API is built using components that separate the steps required to generate random variates. Pseudorandom bits are generated by a bit generator. These bits are then transformed into variates from complex distributions by a generator. Finally, seeding is handled by an object that produces sequences of high-quality initial values.

Bit generators are simple classes that manage the state of an underlying pseudorandom number generator. NumPy ships with four bit generators. The default bit generator is a 64-bit implementation of the Permuted Congruential Generator [62] (PCG64). The three other bit generators are a 64-bit version of the Philox generator [63] (Philox), Chris Doty-Humphrey’s Small Fast Chaotic generator [64] (SFC64), and the 32-bit Mersenne Twister [65] (MT19937) which has been used in older versions of NumPy.999The randomgen project supplies a wide range of alternative bit generators such as a cryptographic counter-based generators (AESCtr) and generators that expose hardware random number generators (RDRAND) [66]. Bit generators provide functions, exposed both in Python and C, for generating random integer and floating point numbers.

The Generator consumes one of the bit generators and produces variates from complicated distributions. Many improved methods for generating random variates from common distributions were implemented, including the Ziggurat method for normal, exponential and gamma variates [67], and Lemire’s method for bounded random integer generation [68]. The Generator is more similar to the legacy RandomState, and its API is substantially the same. The key differences all relate to state management, which has been delegated to the bit generator. The Generator does not make the same stream guarantee as the RandomState object, and so variates may differ across versions as improved generation algorithms are introduced.101010Despite the removal of the compatibility guarantee, simple reproducibility across versions is encouraged, and minor changes that do not produce meaningful performance gains or fix underlying bug are not generally adopted.

Finally, a SeedSequence is used to initialize a bit generator. The seed sequence can be initialized with no arguments, in which case it reads entropy from a system-dependent provider, or with a user-provided seed. The seed sequence then transforms the initial set of entropy into a sequence of high-quality pseudorandom integers, which can be used to initialize multiple bit generators deterministically. The key feature of a seed sequence is that it can be used to spawn child SeedSequences to initialize multiple distinct bit generators. This capability allows a seed sequence to facilitate large distributed applications where the number of workers required is not known. The sequences generated from the same initial entropy and spawns are fully deterministic to ensure reproducibility.

The three components are combined to construct a complete random number generator.

from numpy.random import (
    Generator,
    PCG64,
    SeedSequence,
)
seq = SeedSequence(1030424547444117993331016959)
pcg = PCG64(seq)
gen = Generator(pcg)

This approach retains access to the seed sequence which can then be used to spawn additional generators.

children = seq.spawn(2)
gen_0 = Generator(PCG64(children[0]))
gen_1 = Generator(PCG64(children[1]))

While this approach retains complete flexibility, the method np.random.default_rng can be used to instantiate a Generator when reproducibility is not needed.

The final goal of the new API is to improve extensibility. RandomState is a monolithic object that obscures all of the underlying state and functions. The component architecture is one part of the extensibility improvements. The underlying functions (written in C) which transform the output of a bit generator to other distributions are available for use in CFFI. This allows the same code to be run in both NumPy and dependent that can consume CFFI, e.g., Numba. Both the bit generators and the low-level functions can also be used in C or Cython code.111111As of 1.18.0, this scenario requires access to the NumPy source. Alternative approaches that avoid this extra step are being explored.

Testing on multiple architectures

At the time of writing the two fastest supercomputers in the world, Summit and Sierra, both have IBM POWER9 architectures [69]. In late 2018, Astra, the first ARM-based supercomputer to enter the TOP500 list, went into production [70]. Furthermore, over 100 billion ARM processors have been produced as of 2017 [71], making it the most widely used instruction set architecture in the world.

Clearly there are motivations for a large scientific computing software library to support POWER and ARM architectures. We’ve extended our continuous integration (CI) testing to include ppc64le (POWER8 on Travis CI) and ARMv8 (on Shippable service). We also test with the s390x architecture (IBM Z CPUs on Travis CI) so that we can probe the behavior of our library on a big-endian machine. This satisfies one of the major components of improved CI testing laid out in a version of our roadmap [72]—specifically, “CI for more exotic platforms.”

PEP 599 [73] lays out a plan for new Python binary wheel distribution support, manylinux2014, that adds support for a number of architectures supported by the CentOS Alternative Architecture Special Interest Group, including ARMv8, ppc64le, as well as s390x. We are thus well-positioned for a future where provision of binaries on these architectures will be expected for a library at the base of the ecosystem.

Acknowledgments

We thank Ross Barnowski, Paul Dubois, Michael Eickenberg, and Perry Greenfield, who suggested text and provided helpful feedback on the manuscript.

We also thank the many members of the community who provided feedback, submitted bug reports, made improvements to the documentation, code, or website, promoted NumPy’s use in their scientific fields, and built the vast ecosystem of tools and libraries around NumPy. We also gratefully acknowledge the Numeric and Numarray developers on whose work we built.

Jim Hugunin wrote Numeric in 1995, while a graduate student at MIT. Hugunin based his package on previous work by Jim Fulton, then working at the US Geological Survey, with input from many others. After he graduated, Paul Dubois at the Lawrence Livermore National Laboratory became the maintainer. Many people contributed to the project including T.E.O. (a co-author of this paper), David Ascher, Tim Peters, and Konrad Hinsen.

In 1998 the Space Telescope Science Institute started using Python and in 2000 began developing a new array package called Numarray, written almost entirely by Jay Todd Miller, starting from a prototype developed by Perry Greenfield. Other contributors included Richard L. White, J. C. Hsu, Jochen Krupper, and Phil Hodge. The Numeric/Numarray split divided the community, yet ultimately pushed progress much further and faster than would otherwise have been possible.

Shortly after Numarray development started, T.E.O. took over maintenance of Numeric. In 2005, he led the effort and did most of the work to unify Numeric and Numarray, and produce the first version of NumPy.

Eric Jones co-founded (along with T.E.O. and P.P.) the SciPy community, gave early feedback on array implementations, and provided funding and travel support to several community members. Numerous people contributed to the creation and growth of the larger SciPy ecosystem, which gives NumPy much of its value. Others injected new energy and ideas by creating experimental array packages.

K.J.M. and S.J.v.d.W. were funded in part by the Gordon and Betty Moore Foundation through Grant GBMF3834 and by the Alfred P. Sloan Foundation through Grant 2013-10-27 to the University of California, Berkeley. S.J.v.d.W., S.B., M.P., and W.W. were funded in part by the Gordon and Betty Moore Foundation through Grant GBMF5447 and by the Alfred P. Sloan Foundation through Grant G-2017-9960 to the University of California, Berkeley.

Author Contributions Statement

K.J.M. and S.J.v.d.W. composed the manuscript with input from others. S.B., R.G., K.S., W.W., M.B., and T.J.R. contributed text. All authors have contributed significant code, documentation, and/or expertise to the NumPy project. All authors reviewed the manuscript.

Competing Interests

The authors declare no competing interests.