## Abstract

This paper presents a new method for deriving the energy yield generated by photovoltaic solar energy systems in the Netherlands, on a daily and regional basis, in 2016 and 2017. We identify two new data sources to construct our methodology: pvoutput, an online portal with real–time solar energy yield information, and high resolution irradiance data, from the Royal Netherlands Meteorological Institute. Combining these sources, allows us to link irradiance and energy on a daily basis. We apply this information to our PV systems database, allowing us to derive daily and annual solar energy yields. We examine the variation in our daily and annual estimates as a result of taking different subsets of pvoutput systems with certain specifications such as orientation, tilt and inverter to PV capacity ratio. We obtain specific annual energy yields of 877–946 and 838–899 for 2016 and 2017 respectively, implying that the current method used at Statistics Netherlands underestimated and overestimated the 2016 and 2017 annual yields respectively. Finally, we translate our national estimates into municipality solar energy yields. This research demonstrates that an irradiance based measure of solar energy generation is necessary to produce more accurate energy yields on both a national and regional level.

## 1 Introduction

One of the largest challenges facing our modern day societies is climate change. Around the world, nations are devising policies to transition away from polluting fossil fuels towards clean and renewable energy sources. The 2015 Paris climate agreement, part of the United Nations Framework Convention on Climate Change, has been the biggest driver for these systemic changes: a commitment limiting climate heating to C, relative to pre-industrial levels, was agreed upon (UN2019). On a European level, the European Commission has set various energy and climate targets for its member states. One such objective was formulated in the EU renewable energy directive: by the end of 2020, renewable sources must make up 20% of the EU energy mix (EU2014). These targets differ on member state level depending on its renewable potential: for the Netherlands this target was set at 14% (EU2017) and likely will not be met. In 2018, the share stood at 7.2 % (euobserver). By 2030 the renewable share must increase to 32 %, irrespective of the member state (EU2018).

As increasing policy measures start to take effect, attention for detailed and accurate measurements of energy yields from renewable sources will grow. One of these renewables is solar energy, the subject of this paper. Statistics Netherlands (SN) is the body responsible for publishing annual solar energy yields from photovoltaic (PV) systems in the Netherlands (SNstatline). Currently, the annual yield is calculated by translating the installed PV capacity, using a scaling constant, otherwise known as the specific annual energy yield (Sark2014).
This approach was created to measure the PV yields from small households as these are not directly registered. The yields from larger systems such as solar parks are registed on a monthly basis due to subsidy schemes that were put in place (CertiQ).

The adoption of the specific annual yield factor as a means to calculating the solar energy yield, was commissioned by the Netherlands Enterprise agency and agreed upon by a number of Dutch organisations (Sark2014). This factor was subsequently adopted by the Protocol Monitoring Hernieuwbare Energie, or the protocol of renewable energy (RVO2019).
The current figure is set at 875 . Equation 1 shows how the annual solar energy yield is calculated taking into account the variation of the installed capacity throughout the year:

(1) |

where is the total solar energy generated in a year, the power capacity of one PV system in the database, and the number of PV systems in the database on the first and last days of the year respectively.

The specific annual solar energy yield was determined, using two datasets of PV systems in 2012 and 2013 (Sark2014), yielding and for both years in question. We would like to highlight a few constraints of this method, mainly relating to representativity. The true distribution of PV system parameters (such as orientation and tilt) were not known at the time and not taken into account, therefore the assumption was made that the sampels were representative. Similarly, the geographical distribution of the PV systems in the sample were assumed to be representative of the population as a whole. It is known that Western parts of the country enjoy up to 10% more sunshine, on an annual basis, compared to Eastern parts (litjens2017). Western areas of the country are the most densely populated, implying a high density of household PV systems in these areas. At the same time: Eastern and Northern parts are far more sparsely populated allowing for large scale systems such as solar parks. Finally, the specific annual solar energy yield is an indirect measure of the irradiance, which we know to vary per year: the total incident solar irradiance was 989 and 1003 in 2012 and 2013 respectively, which were two typical years if compared to the thirty year average of 986 . The past four years (2016–2019) have been substantially higher with measured irradiance values of 1040, 1020, 1137 and 1099 (knmi2016_jow; knmi2017_jow; knmi2018_jow; knmi2019_jow). The current factor, however, is applied year–independently.

Our European neighbours have different ways of measuring the yields originating from PV systems. In the UK, the energy yield is estimated through a combination of various sources: a survey and various subsidy schemes for large producers. For small PV systems (up to 5 MW), the capacity is known and average solar load factors are applied to these capacities. These factors are revised annually with generation data from the network operators. In Belgium, the specific annual solar energy yield is determined from PV systems that were awarded green certificates, a subsidy scheme set up to stimulate renewable energy sources. These are usually larger PV systems. This factor is applied to a database containing the installed capacity of households for which no measured data are available. Finally, in Germany, a monthly survey is conducted, asking the network operators how much solar energy is fed into the grid.

The Center for Big Data Statistics at SN examines new methods for official statistics based on new data sources. In this paper, we report on a new, alternative and innovative method to estimate daily and regional solar energy yield for the Netherlands and hence a more accurate annual yield. The emphasis of our new method lies in using readily available measured energy data as well as modelled irradiance data, allowing us to produce an irradiance based yield estimate. Our approach is fundamentally different in that we do not aim to provide one figure of the daily and annual yields, but rather we explore the uncertainty margins of these figures, when different subpopulations of PV systems are selected. The parameters we have chosen to focus on are orientation, tilt and inverter to PV capacity ratio. In Figure 1 we provide the reader with a schematic overview of the many different factors which influence the yield. In section 2 we describe the new data sources we used to construct our methodology. These are pvoutput, an online platform with real–time data of PV energy generation and KNMI (Royal Netherlands Meteorological Institute) irradiance data on 3x5 grid cell level. These sources are cleaned and this is briefly described in section 3. We outline in detail our method for computing daily and annual solar energy yields in section 4 and then describe how we use these to produce regional estimates in section 5. A summary and conclusion is provided in section 6.

## 2 Data

### 2.1 pvoutput

Pvoutput is an Australian online portal with near real–time information of photovoltaic energy generation at various locations throughout the world. After Australia, the Netherlands has the largest share (20.57%) of installed capacity, registered on the website. As of October 2019, this capacity stands just shy of 50 MW (pvoutput_latest). The number of Dutch PV systems in 2016 and 2017 was of the order 5600. Other countries with substantial registered capacity are the United States, Italy, Germany, the United Kingdom and Belgium. Owners register on the website, inputting a range of different specifics relating to their PV systems, such as the orientation and tilt of the system, but also other variables such as the inverter brand or the system size. For more details regarding this metadata, please see pvoutput_overview

### 2.2 Dutch meteorological weather data

The Koninklijk Nederlands Meteorologisch Instituut or Royal Dutch Meteorological Institute (hereafter KNMI) is the body responsible for measuring variables related to the weather in the Netherlands (knmi_about). Besides the 30 different weather stations, which measure different variables such as wind speed, rainfall, temperature and irradiance, the institute has also developed a physics based empirical model to calculate many of these same variables (deneke2008; greuell2013; knmiviewer). This model uses input data from the Spinning Enhanced Visible and Infrared Imager (SEVIRI) instrument (seviri) on board the Meteosat Second Generation satellites, located in a geostationary orbit at 36.000 km. These are operated by EUMETSAT (eumetsat). SEVIRI observes properties of the atmosphere every 15 minutes and has a resolution of 3x3 at nadir (knmisatellite). Due to projection effects, this results in a resolution of 3x5 for the Netherlands. It is important to note that the irradiance data are only available at times when the Sun is higher than above the horizon (greuell2013). The modelling does not work well outside this regime due to 3D cloud effects.

### 2.3 Solar Panel Database

Statistics Netherlands constructs its own database of PV systems based on different data sources. The biggest and most important of these is the Product Installation Register or PIR (SNsolar), provided by the network operators. It is thought that this source is around complete in terms of capacity. Some care should be taken with this figure since it is an estimate. We therefore supplement these data with tax data from the Dutch tax authority. People are incentivised to purchase PV systems by registering a VAT return for the cost and installation of the panels. Although this is not obligatory, it is assumed that this can entice a lot of people to do so. There are also an increasing number of large-scale PV systems in the Netherlands. SN has monthly administrative data on the capacity and yield of such solar farms from the government–led certification process for renewable energy as executed by CertiQ (CertiQ), a 100% subsidiary of TenneT, the European electricity transmission system operator for the Netherlands (TenneT)

. Finally, a project is under way at SN to use machine learning techniques to identify missing solar panels. For more information on this topic, we refer the reader to section 3.2.1 in

debroe2019.## 3 Data Cleaning

### 3.1 pvoutput

We identified two data cleaning challenges for pvoutput. The first was uniformising and correcting the metadata such that apparently similar quantities, inputted by different owners, mean the same thing. The second involved analysing the real-time pvoutput data to determine how realistic the data is, also in relation to the metadata. The choices we make here have implications for the results we derive in sections 4 and 5. We will re–visit these influences in those sections.

#### 3.1.1 Uniformising and correcting the metadata

While most fields are reasonably consistent (e.g. there are only eight cardinal directions to choose from for the orientation), some other variables are less clear in their reliability. Such examples are the panel power, number of panels and the system size. If inputted correctly, the sum of the power of each panel ought to equal the total system size. We discovered a discrepancy between these two figures in around 10% of all PV systems. To resolve this, we chose the total system size to be the more reliable variable, on the assumption that pvoutput owners were more likely to get this correct than the number of panels and panel power.

Two different types of geographical co–ordinates are available for the pvoutput locations: postal codes and latitudes/longitudes (lat/lon hereafter). In the Netherlands postal codes consist of four digits followed by two letters. Pvoutput postal codes have a precision of four digits (PC4), narrowing the location down to a spatial area of 1 to 8 with a varying number of households, depending on the location. The lat/lon co–ordinates allow for pinpointing an exact location. We verified the lat/lon data by calculating the distances between them and the centroids of the PC4 areas for all the systems in the pvoutput data that have a valid PC4. For about 5% of these systems the distance is more than 5 km. Manual inspection of the PC4 – lat/lon differences for some municipalities shows an interesting pattern, namely that about half of the households took the effort to indeed specify a lat/lon location nearby their dwelling, the other half seems to have chosen a generic lat/lon based on the municipality. These findings led us to retain the postal code as the location at the expense of losing – in some cases – more accurate information from the lat/lon. Our choice is motivated by the assumption that people, in general, know what their postal code is. This level of precision is also sufficient for the goal of this research. Figure 2 displays the distances between the lat/lon and PC4 centroids for the reliable set of PV systems in 2016 (see next section 3.1.2 about reliability).

#### 3.1.2 Constructing a reliable set of measurements per day

Having uniformised and corrected the metadata, we now turn to the real–time data itself. For an example of a solar energy profile at a pvoutput location, we point the reader to the lower panels of the two subfigures in Figure 6. We devise and impose four different quality checks on the data such that we can construct a reliable or cleaned set. These checks are performed per day and per system. If the system in question passes every single check, the measurement is retained, otherwise it is discarded. The first two checks relate to the reliability of the cumulative energy measurement at the end of the day, whereas the last two investigate the reliability of a given day based on the number of measurements and the time intervals between measurements.

Two different energy measurements are provided by pvoutput: an instantaneous and a cumulative one. It therefore follows that the last measurement of the cumulative energy, for a given day, ought to equal the cumulative energy, calculated from all the instantaneous energy measurements. When performing this first check, we allow for some leeway on either side: A daily cumulative energy measurement is deemed reliable if it lies in the range of 90-110% of the measurement obtained, based on the instantaneous measurements. Secondly we identify the peak energy per day of a system and determine whether this is a logical value when comparing to the system size quoted in the meta data. Again, allowing for some leeway, we impose that a day is reliable if the maximum energy is smaller than 1.2 times the system size. It should be noted that this final restriction could be too harsh on some days when cloud–induced superirradiance is capable of temporarily producing a peak energy which is higher than the system size (zhang2018). We decide not to take this into account as it is difficult for us to ascertain when this local effect may be occurring.

The third check examines the time intervals between each measurement on a given day. Some PV systems suffer from measuring gaps e.g. a system may measure the generated solar yield per five minutes at the beginning of the day, suddenly have a two hour gap with no measurements, before once more recording the solar yield at five minute intervals. This check, in turn, influences quality check number one. We remove daily observations of PV systems that have gaps which are larger than 15 minutes. The fourth check is closely related to the third one, but is more general: we check whether there is a measurement for each day of the year for a given PV system. It should be noted that the decision to remove PV systems for specific days resulting from quality checks three and four is not straightforward. We cannot ascertain whether the gaps or missing days are as a result of malfunctioning software or whether, for example, the PV system is really not producing any energy or has been turned off. Figure 3 shows the number of PV systems per day for 2016 after data cleaning is performed.

### 3.2 Dutch meteorological weather data

We do not perform data cleaning on the KNMI irradiance data, since these data are very reliable. We do however perform a number of processing steps to get the irradiance data into the most instructive form for our research purposes. We download the quarterly hour total (diffuse and direct) irradiance data. Per grid cell, we sum all the quarter hourly irradiances for a given day. We do this by converting the irradiance from to . We notice that occasionally a quarter of an hour of data is missing. We decide not to input additional data and instead, we account for the longer time difference in our conversion between and . The number of days with some missing data is 32 for 2016. We decided not to compare the grid data with weather station irradiance data because this has already been done in (greuell2013). Little, but negligible, discrepancies arise between the measured and modelled values. One of the reasons for this is that the weather station measurements may not always be entirely accurate due to the apparatus not being continually maintained. Figure 4 shows the total irradiance for two consecutive days in June 2016, nicely illustrating the differences in irradiances for two days with different weather conditions.

## 4 Daily solar energy production in the Netherlands

In this section, we outline our new method for determining solar energy yield in the Netherlands. We do this in very broad terms in section 4.1, followed by sections 4.2–4.4, which expand the method into greater detail, paying particular attention to the question of representativity. Section 4.5 concludes with a presentation of our results, followed by an in–depth discussion and comparison with other literature sources.

### 4.1 Roadmap: towards a daily solar energy yield

Per reliable pvoutput system we determine the daily irradiance and daily cumulative energy. This allows us to visualise all pvoutput locations on a 2D plane, relating irradiance and energy. One could view this as a 2D probability density function describing the PV systems database in the Netherlands. By drawing PV systems from the database and determining the observed daily irradiance at those locations, we can assign the most likely yield given the 2D probability density function. This, in turn, allows us to calculate the total solar yield for the Netherlands on a daily, and hence annual basis. The act of randomly assigning energy values to entries in the PV database can be repeated multiple times, such that we obtain the most likely daily (and annual) energy yield with an associated uncertainty. Since our probability density function is reliant on the subset of systems in pvoutput and we do not know if this set is representative, we can repeat this process many times by assuming different scenarios with regards to the distribution of PV systems in the orientation–tilt–inverter/PV–parameter space. For example, what would the impact on the daily and annual figure be, were we to only select systems facing South? By exploring the extremes, we can pinpoint the total yield within a certain regime.

### 4.2 Combining data sources

We assign each pvoutput system to the nearest irradiance grid cell, by comparing the lat/lon of the PC4 centroid and the grid cells. This allows us to compare total daily irradiance and daily (cumulative) energy per pvoutput system. Finally, we normalise the energy yield by the system size, such that we have the specific daily yield (), allowing us to compare all systems using the same metric. The pvoutput systems can then be binned into 2D irradiance–energy parameter space. Figure 5 shows the cumulative irradiance versus the specific daily yield for all pvoutput systems on two different days in 2016 (13 June and 13 September). The former shows a day which was quite variable, with a wide margin in total irradiance and yield. The latter shows an exceptionally clear day over the whole country; however, this produced a wide spread in output, which must depend on various PV system parameters such as the orientation and tilt. Essentially, one could interpret these plots as a probability density function. In this figure, we have chosen a resolution of x for the bins.

In the aforementioned approach we make some assumptions. The first is that the assignment to the nearest grid cell is correct. This need not be the case since PC4 areas are a proxy of population density and range from 1 to 8 in size (nemo). The grid cells, as we have mentioned earlier are 3x5 in size. It is conceivable that for the larger PC4 areas an adjacent cell would have been more preferable when coupling PC4 areas and grid cells. A second assumption is that weather behaves according to the resolution of a grid cell. Local variations within a cell could lead to different specific yields at different pvoutput locations within the same cell. Figure 6 demonstrates this point: at the first location (upper subfigure) the irradiance data closely follows the pvoutput pattern. At the second location (lower subfigure) this is broadly the case, but it is also obvious that more local effects can be seen in the pvoutput data and which are not captured by the irradiance data. A final noteworty comment is that the energy output reflects a day in its entirity, whereas the irradiance does not. In chapter 2.2, we explained that the KNMI data are only available when the Sun is or more above the horizon. This is not an issue, since we only indirectly use the irradiance to obtain the solar energy. This will become clearer as the chapter progresses.

### 4.3 Sampling and bootstrapping

Before applying the 2D probability density function to our PV systems database, we perform two more actions. We construct daily PV system databases, using the recorded installations dates of the systems, such that we have the correctly recorded amount of installed capacity for any given day and region. Then, we determine the daily yields at the database locations, in an identical fashion as was described in section 4.2. Now, we still need to normalise our 2D probability density function: per irradiance bin, each energy yield bin is normalised with respect to the total number of systems observed in the irradiance bin. Hence, we assign probabilities of an energy yield being observed given an irradiance bin.

We can now bin our daily PV system databases in the irradiance dimension and assign yields by drawing PV systems and randomly scattering them in the energy yield dimension, respecting the proportions, as recorded in the 2D probability density function. The act of randomly drawing systems and assigning energies can be performed many times, i.e. bootstrapping (EfroTibs93). This allows us to obtain the uncertainty margins on our calculation. Figure 7 shows the most likely energy yields for four different days (Spring and Autumn equinoxes, Summer and Winter solstices) and the associated uncertainty margins, obtained through bootstrapping 500 different times. These histograms are computed for scenario 1, one of our 15 different scenarios, for which we calculate daily and annual yields. This will become clearer in section 4.5.

It should be noted that the efficiency of this method depends on the size of our irradiance bins. The higher the resolution in irradiance becomes, the more likely an irradiance bin is not observed in pvoutput, but could be present in the database. We choose bins of size x , such that the resolution is high enough as to be meaningful, but not too high as to not have too many PV system database irradiances that fall outside pvoutput irradiance bins. For the database entries which do not fall in a pvoutput bin, we use the mean specific daily yield obtained from all the other PV systems: multiplying this with the capacity of this small subset of systems will estimate a yield for those PV systems.

### 4.4 Representativity

One of the main concerns of any statistical office and indeed any statistician is the case of representativity. We identify three different representativity issues. The first issue was briefly touched upon in section 3 (Data Cleaning) and concerned how to deal with certain observations in the data such as measuring gaps, zero measurements etc. The (non-) inclusion of such systems can have an impact on the daily and annual yields which we calculate later on in this section. For more details about this, we refer the reader back to section 3. At this point it suffices to say that certain choices are made with regards to data cleaning, which we motivated in section 3.

The outstanding two representativity issues are interconnected and discussed further in the following sections. The first concerns the differences in PV populations on different days. The second relates to how representative the pvoutput population is with respect to the total Dutch PV population.

#### 4.4.1 Two representativity issues

From our PV systems database we know that PV systems are installed throughout the year, with a boost in Spring and Summer (SNinnovatie)

. This means that, in theory, the distributions of PV system parameters, such as orientation and tilt, could vary as a function of time. Unfortunately we do not know what these intrinsic properties are or how they could vary. We make the assumption, on the basis of large number statistics, that on average these PV parameter distributions will remain constant. There are of course some obvious exceptions: a large solar park could start generating solar electricity from one day to the next and have a set up that varies significantly from small households, e.g. the solar panels track the Sun. However, we estimate this effect to also be small at the moment because these types of set ups are very rare. While our assumption would seem to hold over the course of different days, it is less obvious whether this should be the case over several years.

In section 3 we saw that data cleaning on a daily basis produces a varying number of PV systems per day. This, in turn, means that the parameter distributions of the PV systems of our daily reliable set vary throughout the year. For example: on 15 January, the percentage of systems facing South could be 50%, whereas on 15 June this is 20%. Similarly, on the first day the percentage of systems with a tilt smaller or equal to could be 20%, but on the second day 35%. We mention two extreme cases to make our point clear; however, in practice the data cleaning never drastically changes the population of reliable PV systems from one day to the next. To continue the comparison, this means that the probability density function (Figure 5), which links irradiance with specific daily yield, on 15 January is not directly relatable to that of 15 June as it contains different populations.

We now turn to our second representativity issue. If we wish to account for varying daily populations, we need to correct these relative to a reference distribution in parameter space (or an assumed ground truth), which describes the population of Dutch PV systems. We have already established that a complete view on this is unavailable. Other research has also investigated the issue of representativity e.g. killinger2018

derive probability distribution functions for orientation, tilt, capacity and yield. In the case of the Netherlands this is based largely on pvoutput data (75% of PV systems) supplemented with other smaller data sources. The best thing to do therefore, is to assume different distributions for different parameters and propagate these in our calculations. This will, in turn, allow us to get a feeling for our uncertainty margins.

#### 4.4.2 Normalising the daily pvoutput sets

We normalise our reliable set of systems on a daily basis relative to a reference distribution in parameter space. We choose our reference distribution to be the distribution of parameters observed in pvoutput on the first day of the year. In other words, we define a set of parameter distributions and impose that these be the same every day of the year. We can reach this goal by doing two things: randomly dropping systems which have characteristics of which there are a surplus compared to the reference distributions. Similarly, certain systems can be duplicated to boost certain populations that are under represented in the sample as a result of data cleaning.

To drop and duplicate PV systems, we need to determine which factors are crucial in determining the yield. In Figure 1 we identified a plethora of different factors. Ideally, one would like to play around with as many variables as possible, but in practice our sample is so small that we cannot afford that luxury: the more variables we choose, the more normalising of our reliable systems we will need to do. Randomly dropping and duplicating systems can become unwise if certain population groups within the sample become so small that there are not many systems to choose from for which to drop or duplicate. From all those elements, we choose three which we think will be the most influential in determining the daily yield in the Netherlands.

The three PV system variables we choose for normalisation are orientation (), tilt () and inverter to PV capacity ratio (). From literature, it is widely known that the first two factors are very important in determining the yield (eia). From the pvoutput data, it was also apparent that the inverter and installed PV capacity were rarely the same size. It is quite customary for inverters to be undersized, i.e. have a smaller capacity than the PV system’s. This is often done at locations where other specifications, such as the orientation and tilt, will reduce the output from the system. Therefore a higher capacity inverter is not needed. At better performing locations there can be other benefits from doing this: smaller inverters make the PV system more efficient in low–light conditions e.g. mornings, evenings and cloudy days. It can therefore be beneficial to have a smaller inverter optimising these parts of the day at the expense of losing (or clipping) some energy at the peak performance time, e.g. midday (smaoversizing; solaredgeoversizing). While inverters are often undersized, oversizing also occurs. This does not have much benefit in terms of the yield, but can be taken into consideration by owners who may have plans to increase their installed capacity in the future (solarchoice).

Up until now we have devised a method to keep days consistent with respect to one another, by normalising distributions of 3D parameter space. There is, however, a last factor we need to take care of: the geographical spread of PV systems in the Netherlands. The geographical distributions of pvoutput systems and database systems need not be the same. To account for this, we use the satellite irradiance distributions () from the database and pvoutput. In other words: we normalise the daily pvoutput sets such that the same proportion of systems is observed within some irradiance bin compared to that same irradiance bin in the database. We decide not to bin the data geographically, e.g. on a municipal or provincial level. The reasons for this are numerous, which we briefly describe. Weather does not follow administrative boundaries and has different resolution scales on different days, i.e. on a perfectly beautiful and sunny day the weather is the same everywhere, but on another day weather could be a lot more local. The pvoutput sample is also too small to split it up into smaller geographical elements.

To normalise our daily datasets, we begin by binning the pvoutput data in 4D parameter space: , , and . Let us now introduce some more mathematical notation to make things clearer. For any given day, we have observations (or PV systems) in our sample. The number of systems per bin, normalised for the total number of systems is given by equation 5, where is the distribution function and is one of the 4 parameters as given by equation 2. If we sum all the bins for a variable , we obtain equation 6, which must always equal one or 100%. Finally and in equation 5 are defined by equation 4, which does nothing more than defining the lower and upper limits of bin for a given bin size . Finally, the number of bins is defined by equation 3.

(2) |

(3) |

(4) |

(5) |

(6) |

Table 1 shows the minima, maxima and bin sizes for the four different parameters. It should be noted that bins for don’t correspond to anything physically: 0 means the inverter and system sizes are the same, -1 means the system size is smaller than the inverter size and 1 means the system size is larger than the inverter size. Our choice for the other bin sizes is motivated by practical limitations: because pvoutput only allows the input of one of eight cardinal signs, which means that the minimum step must be . We choose such that we retain a significant number of PV systems per bin. Our choice for is motivated by our earlier choice, as discussed in section 4.2 (combining data sources), which led to our 2D probability density functions as was shown in Figure 5.

parameter | min | max | parameter |
---|---|---|---|

-1 | 1 | 1 | |

min() | max() |

Having defined what looks like, we can now express what our normalisation will look like. We normalise the distributions over , and with respect to our reference distribution on the first day of the year, expressed by equation 7, whereas we normalise the distribution over for a day with respect to the distribution in the database (), given by equation 8:

(7) |

(8) |

To satisfy equations 7 and 8, we need that every in equation 5 for a day equals (or comes close to) . Hence, for a bin , we have a number of ’s (equation 5

), in other words a vector of values falling in a bin. We can randomly draw an element from this vector and, either add that element to the vector or remove it from the vector. This act can be repeated until the two equations above have been satisfied. Since it is not possible for these equations to be exactly satisfied, we allow for a leeway of 1.5%. This figure is chosen as a compromise between satisfying equations

7 and 8 such that the distributions are equal to each other, but not too strict such that our method is easy to implement.Figure 8 shows how the number of PV systems changes after duplicating and dropping certain systems. On some days, the overall number of systems increases, on most days however the set decreases. While the effect does not seem very large in the figure, we draw the reader’s attention to the fact that many more systems could have been dropped and duplicated, resulting in a similar total to the original total. This does not, however, mean that the specifications remained very similar. Here again, we perform bootstrapping since the act of duplicating and dropping PV systems is done in a random way. We simulate 50 different realisations of the daily metadata. Figure 9 shows the distribution of normalised reliable PV systems on 1 June 2016 for orientation, tilt, inverter to PV capacity ratio and irradiance. The distribution of PV systems on 1 January 2016 was chosen as the ground truth and 1 June was normalised accordingly. The irradiance, on the other hand, is normalised by comparing to the irradiance distributions observed in the database on that particular day.

### 4.5 Putting it all together: daily solar energy in the Netherlands

The framework set out in sections 4.2 to 4.4 can now be used to calculate the specific daily and annual energy yields for different scenarios. By scenario, we mean different configurations in 3D parameter space (orientation, tilt and inverter to PV capacity ratio). These 15 scenarios and the annual results these deliver can be seen in Tables 2 and 3. Scenario 1 takes all the pvoutput data as it is and re–balances the parameters for all subsequent days. Scenarios 2–6 examine the effects of different orientations whereas scenarios 7–9 explore different tilt regimes. The choice for different inverter to PV capacity ratios can be seen in scenarios 10–13. Finally, scenarios 14 and 15 explore combinations of all three different parameters.

Scenario 2, closely followed by 15, is the best performing scenario. This is exactly what you would expect given that the systems are facing South in scenario 2. Restricting the tilt to a more optimal angle range – as is the case in scenario 15 – at the expense of relaxing the orientation, gives similar results. Scenario 15 outperforms scenario 14 by a small margin, hinting that a smaller inverter may pay off over a year, as was previously mentioned in section 4.4.2. Scenario 6 delivers the poorest yield, which is unsurprising, given that no systems face South. Finally, another noteworthy outcome is scenario 7 which restricts PV systems to quite flat or fully flat systems. This is the second lowest yield, presumably due to losing out on a lot of possible energy on Winter days. The yields of the other scenarios lie in between these extrema. Figure 10 shows the yield on a daily basis for 2016 according to scenario 2 (the best performing one). This visualisation nicely demonstrates what the differences are in solar energy in Summer compared to Winter. It is also apparent that a good Winter’s day is a lot better than what one might expect, especially when comparing to a bad Summer’s day.

#### 4.5.1 A discussion on uncertainty margins

The uncertainty margins of the (specific) annual yields are incredibly small (of the order of 0.05%). We briefly return to the method we described in sections 4.2–4.4 to examine why this could be the case. The outcomes in Tables 2 and 3 were obtained through a two–part process. First we made 50 different "observations" of the metadata through our normalisation process: for a chosen scenario, we imposed that the 3D distributions of our parameters were the same for every day of the year. Secondly, we randomly selected one of these 50 observations and used this to construct a 2D probability density function, which we applied to our PV systems database. This step was performed 500 different times. The small uncertainty margins could arise due to both steps. Since the pvoutput sample is so small, making 50 different observations of the meta data may not produce an overall large variation in the different probability density functions. Another reason could be the resolution of the 2D probability density function ( x ) being too coarse, smoothing out a lot of smaller scale effects. It would be interesting to check by how much the uncertainties change as a function of the grid resolution. Another contributing factor is that the PV systems database is very large: several hundreds of thousands of records are contained within it. Therefore the high number implies that differences on a micro level average out. This is further compounded by the fact that differences on a daily basis will average out when summing up 365 (or 366 in the case of 2016) days. This is confirmed by the observation that the uncertainties on the daily yield are an order of magnitude higher than the annual yield. Finally, we would like to remind the reader that a bias is possible due to the choices we made when we were cleaning the data, which will have influenced the types of populations that made our final cut.

Sc | Specific annual yield | Specific annual yield | |||
---|---|---|---|---|---|

in 2016 () | in 2017 () | ||||

1 | |||||

2 | |||||

3 | |||||

4 | |||||

5 | |||||

6 | |||||

7 | |||||

8 | |||||

9 | |||||

10 | |||||

11 | |||||

12 | |||||

13 | |||||

14 | |||||

15 |

Sc | Energy yield in 2016 (GWh) | Energy yield in 2017 (GWh) |
---|---|---|

1 | ||

2 | ||

3 | ||

4 | ||

5 | ||

6 | ||

7 | ||

8 | ||

9 | ||

10 | ||

11 | ||

12 | ||

13 | ||

14 | ||

15 |

#### 4.5.2 Comparison with current SN figures

All the various specific annual yields for 2016 are in the range 877–946 , which is significantly higher than the figure adopted by SN (875 ), especially when considering that 877 corresponds to an unrealistic set up: no panels facing South (scenario 6). This naturally translates into the energy yields also being consistenly higher ranging from 1605 GWh to 1697 GWh (with the exception of scenario 6), compared to 1602 GWh (SNstatline). The picture is more mixed for 2017, with the SN figure lying in between the range of 838–899 . What is more striking is that the current solar energy yield, computed by SN, ends up being higher for a specific energy yield (875 ). This can be seen by normalising our 2016 and 2017 specific yields (910 and 868 , using scenario 1, which we think is the most representative) by 875 and multiplying them by the energy yield, as computed by SN (1602 GWh). In the case of 2016 and 2017 this delivers 1666 Gwh and 2190 GWh, slightly higher than the results we obtain which are 1632 GWh and 2131 GWh.

We identify two possible sources for this discrepancy. The first is quite simply that the sample of scenario 1 is very different to that used to calculate 875 in 2012 and 2013. This is probably true to some extent, since improvements will have been made to PV systems. Systems in 2012 and 2013 could have been relatively young compared to systems in our sample, since solar energy is now a lot more established. The second and far more likely source for the discrepancy is our approach for calculating the yield, compared to the current method, as was shown in equation 1. We calculate the yield bottom up: we use installation dates from the database to determine the installed capacity per day which we multiply with specific daily yield factors. This, in turn, gives us annual yields. Currently, an average installed capacity is calculated on a yearly basis (see equation 1) and multiplied by the specific annual yield of 875 .

#### 4.5.3 Comparison with large PV systems

Now we compare our results for 2016 with data we receive from large PV systems (sample size of 1800 systems). We only have this data on a monthly basis and don’t know the configuration details of the systems (orientation, tilt etc.). We calculated the relative share of the monthly yield with respect to the annual yield for both pvoutput and the large PV systems. This can be seen in Table 4. While the Winter months seem to be spot on, there are small discrepancies for the Summer months, with pvoutput seemingly underestimating the output. The most striking offset can be seen for June and September. It is difficult to ascertain why this is the case. It could be that many of the large PV systems are more optimally oriented, thus delivering better yields in the Summer months. Finally, the specific annual yield from the large systems is 904 which comes quite close to our results of 910 in scenario 1.

2016 | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec |
---|---|---|---|---|---|---|---|---|---|---|---|---|

pvoutput | 2.5 | 4.9 | 8.1 | 11.6 | 13.9 | 11.9 | 13.2 | 12.5 | 10.6 | 5.8 | 2.9 | 2.2 |

large installations | 2.5 | 4.7 | 8.0 | 11.5 | 14.2 | 12.5 | 13.5 | 12.4 | 10.1 | 5.6 | 2.9 | 2.2 |

#### 4.5.4 Comparison with other estimates

There are a couple of other sources, with which we can compare our results. The most notable ones are Siderea (Siderea) and SolarCare (Solarmagazine) and are summarised in Table 5. The former only contains five PV systems (we quote the average of average oriented systems from the website), spread over the Netherlands, whereas the latter contains some 2500 PV systems. The very small sample size of Siderea most likely explains the optimistic figures that are quoted on their website. We expect the figures quoted by SolarCare to be more robust due to the far larger sample. We notice a consistent offset for 2016 and 2017 of around 10 . Our results therefore look broadly in line with those recorded by other parties.

Year | Siderea | SolarCare | SN | This research |
---|---|---|---|---|

2012 | n.a. | 900 | 875 | n.a. |

2013 | n.a. | 890 | 875 | n.a. |

2016 | 943 | 920 | 875 | 910 |

2017 | 900 | 880 | 875 | 868 |

#### 4.5.5 Comparison between 2016 and 2017

The higher specific annual yield for 2016 seems to be consistent with the higher than (30 year) average of irradiance observed (as measured at the central weather station De Bilt): 1039 compared to 989 and 1003 recorded in 2012 and 2013 (which were average years) (knmi2016_jow; knmi2012_jow; knmi2013_jow). For 2017 this seems to not be the case, given a higher than average irradiance was observed at De Bilt: 1020 (knmi2017_jow). We can only hypothesise as to why this discrepancy has occurred. Again, we could make the remark of the sample being different in representativity compared to the 2012 and 2013 samples. Another possible explanation is that observed irradiance at De Bilt is not representative of irradiance throughout the country. This seems unlikely given the fact that De Bilt is chosen exactly because the weather measurements recorded there are a nice average of the country.

Figure 11 shows the ratio of specific solar yield versus the irradiance measured at De Bilt for 2016 (blue) and 2017 (green), using scenario 1. From this plot, we can clearly see that in May and June 2017 the conversion of irradiance to solar energy seems to have been less efficient, when comparing to 2016. These are of course amongst the best months for solar energy in the Netherlands. We note that June 2017, in particular, was an abnormally warm month with eight Summer days () and two tropical days (). From the weather records, it appears that June 2017 was the warmest June month on record knmi2017june. June 2016 had five Summer days, consistent with the average number expected in June. The average temperature in June 2016 was around 1 degree warmer than a normal June month (knmi2016june). We therefore hypothesise that a possible temperature effect was responsible for the decrease in solar energy.

#### 4.5.6 Differences in daily yields depending on scenario

Figure 12 shows all scenarios for 2 months in particular (January and July). Scenarios 2–15 are shown normalised with respect to scenario 1. The y–axis is thus an index: if the yield for a day is higher than 100, then that particular scenario was more efficient compared to scenario 1. The opposite is true for values lower than 100. The performance of the scenarios in these very different months with respect to e.g. day length and solar angle says something about the validity of the model. This can be seen when contrasting scenario 7 (relative to the other scenarios) in January and July. It may be seen that this scenario underperforms and overperforms in January and July respectively, which is wholly consistent with the fact that these are low tilt systems. Similary scenario 6 underperforms in January, but is about average in July. The less than optimal orientation of these systems acts differently in the Summer with the effects mitigated by the fact that the days are very long, compared to January. Finally contrasting scenarios 2, 3 and 4 in the Summer also indicates whether on some days the weather was better in the morning or in the afternoon or both. These observations support the soundness of the presented model.

## 5 Regional solar energy production in the Netherlands

In this section, we outline a method to translate the national specific daily yields into regional specific daily yields for any of the given scenarios, explored in section 4. Our starting point is the specific daily yield. Table 10 already showed these on an annual basis.

### 5.1 Solar energy on a municipality level

The specific yield, at a database location on a day , is calculated by boosting or decreasing the national specific yield () by a factor proportional to the offset between the irradiance at point on day and the mean irradiance () on that same day . Equation 9 shows the offset in irradiance, equation 10 uses that factor to increase or decrease the national specific yield at a location . Equation 11 predicts the generated solar energy () at the database location by multiplying the installed capacity () by the regional specific yield. Finally, for any given administrative structure, in our case Dutch municipalities, the total generated solar energy on day can be calculated using equation 12. This approach can be used to calculate the local yields for all the different scenarios.

(9) |

(10) |

(11) |

(12) |

Figure 13 shows the specific annual yields per municipality for 2016 and 2017 respectively. The pattern shown here is exactly what one would expect, with coastal regions enjoying more Sun and therefore also more solar energy production. The lower overall yield in 2017 compared to 2016 is also entirely consistent with other results (see section 4.5). The pattern over a whole year does not appear to vary much between 2016 and 2017. Finally Figure 14 shows the regional yield for every single day in June, showing a large variety per day, which is in line with what could be expected.

### 5.2 Caveats and discussion

We would like to point out a few subtleties and caveats here. While we produce daily yields per database location, we would like to emphasise that this does not mean that we can accurately predict the solar energy produced at any location in the Netherlands. Rather, the point of our method is that, when aggregated to sufficiently large enough areas, we expect the totals on those levels to come close to reflecting reality, should the PV systems be installed according to the scenario configurations as specified in Tables 2 and 3

. Even then, it should be noted that if the relative share of e.g. a large solar park is quite large relative to household PV systems in a given municipality, then this could be an important source of skewing, which would not be reflected in our results. Another source of uncertainty comes from our satellite data. As we mentioned in section 2, we do not have data when the Sun is lower than 12 degrees. This means that the offsets we calculate in equation

9 become more uncertain the closer we get to the Winter solstice, since the relative importance of the portion of irradiance lower than 12 degrees is not captured in the data. It is for example possible that the offsets calculated on a Winter’s day in a location are not much higher than the average, but that at that given location in the morning and afternoon there was a lot more irradiance than the average. This is not reflected in the irradiance data. A final caveat of our method is the assumption that irradiance acts on the yield in the same way in two different locations that have received an equal amount of irradiance. This also need not be true, since other local weather factors could have an effect on the yield.## 6 Conclusion and Summary

We have presented a new method to determine the daily and regional solar energy generated in the Netherlands. This was achieved with two new data sources: an online portal called pvoutput with near real–time information of generated solar energy and high resolution irradiance data from the Royal Netherlands Meteorological Institute. By matching pvoutput locations with irradiance grid data, we relate irradiance and energy through a 2D probability density function. We apply this relation to the database containing most solar panels in the Netherlands using bootstrapping to estimate uncertainty margins. Different scenarios, related to PV system specifics such as orientation, tilt and inverter to PV capacity ratio are explored: subsets of pvoutput systems with certain characteristics are selected and the calculations are repeated to see what the effect of these subsets is on the daily and hence annual yield. For 2016, we find a specific annual yield which is consistently higher than 875 currently adopted by Statistics Netherlands. The results are in the range of 877–946 . For 2017, the 875 factor, adopted by Statistics Netherlands may have overestimated the yield. We find annual specific yields in the range of 838–899 . These results highlight the need for a specific yield to be determined on a daily and annual basis which is a function of the irradiance. We convert our daily and annual yields into regional maps per municipality. We believe this can be of great use for policy making at a municipality and provincial level, where efforts are undertaken to stimulate renewable energy initiatives.

## Acknowledgements

The authors wish to thank Wilfried van Sark and Frank Pijpers for carefully reading the manuscript prior to submission, greatly improving its quality. We also wish to thank Alex Priem, Dick Windmeijer, Jurriën Vroom, Reinoud Segers, Anne Miek Kremer, André Meurink, Otto Swertz, Lyana Curier and Sofie De Broe for their help and discussions. The authors also wish to thank colleagues at other European statistical offices which provided information on the calculation of solar PV in their respective countries. These are Matt Laycock, Warren Evans, Jörg Decker, Nadine Dufait, Manon Urbain and Aline Guilmot. Finally, B.P.M. Laevens wishes to thank his colleagues at the ministry of Economic Affairs and Climate Policy for their support.

Comments

There are no comments yet.