In January 2019, Andrew Penn, the CEO of Telstra – Australia’s largest Telecommunications company – announced that the company will be expanding its new ‘Innovation and Capability Centre’ in Bangalore, India. This will create approximately 300 Network and Software Engineering jobs, with the potential for more (Pearce, 2019). Penn cited ’skill shortages’ as the main reason for this outsourcing decision:
“We need these capabilities now, but the fact is we cannot find in Australia enough of the skills that we need on the scale that we need them, such as software engineers. Why? There simply are not enough of them. The pipeline is too small.” (for Economic Development of Australia, )
This coincides with Telstra announcing a goal net reduction of 8,000 jobs by 2022 (mainly in Australia), as the company seeks to automate labor tasks and simplify processes (Chalmers, 2019). While an isolated example, the evolving labor demands of Telstra highlights both the opportunity costs of labor shortages and the precariousness of workers’ security to automation and globalization. As a result of these claimed skill shortages, the Australian labor market will not enjoy the benefits afforded by 300 highly skilled jobs – benefits that materialize in the form of greater economic activity, labor productivity, and economic competitiveness. This is not specific to just Telstra or Australia, such labor shortages limit employment opportunities for individuals, impede firm-level investment and technological adoption, and hamper labor productivity for most industries in all economies.
In this work, we focus on three open problems relating to labor shortages at the occupational level. Labor shortages occur when the labor demand for specific skills required by occupations exceed the supply of workers who possess those skills at a prevailing market wage (Junankar, 2009; Healy et al., 2015). A distinction can also be made between skills, knowledge, abilities, and occupations. ‘Skills’ are the proficiencies developed through training and/or experience (OECD, 2019); ‘knowledge’ is the theoretical and/or practical understanding of an area; ‘ability’ is the competency to achieve a task (Gardiner et al., 2018); and ‘occupations’ are standardised jobs that are the amalgamation of skills, knowledge, and abilities that are used by an individual to perform a set of tasks that are required by their vocation. For simplicity, throughout this paper the term ‘skill’ will include ‘knowledge’ and ‘ability’. Additionally, we define occupational labor shortages as occupations whose required skills are in relatively high demand and short supply at the prevailing market wage rate.
The first open problem relates to predicting labor shortages.
While the adverse effects of labor shortages have been well-documented (OECD, 2019; Brunello and Wruuck, 2019; Haskel and Martin, 1993), predicting labor shortages is difficult.
Even more challenging is predicting temporal changes to the labor shortage status of an occupation.
For example, accurately predicting an occupation shifting from being classified as Not in Shortage in one time period to In Shortage the next.
These difficulties reflect the limited understanding of which variables are most predictive of labor shortages.
The question is therefore can we leverage modern Data Science and Machine Learning techniques to predict occupational labor shortages?
can we leverage modern Data Science and Machine Learning techniques to predict occupational labor shortages?
The second open problem relates to the available sources of data. Labor market data is usually fragmented, measured by a variety of techniques and sources. Matching jobs data of different types and from different sources can be time-consuming or infeasible, particularly when required to organize data into standardized occupational classes. Subsequently, labor market prediction models are often biased toward either the demand or supply sides of labor market data. For clarification, ‘Labor Demand’ refers to the demand for workers by firms as a function of input prices and other exogenous variables; ‘Labor Supply’ refers to the supply of workers who trade off consumption and leisure at a given real wage rate (Nechyba, 2016). Ideally, both Labor Demand and Labor Supply data are represented for labor prediction tasks, such as predicting labor shortages. However, accounting for inertia between both sides is an added complication. Labor Demand data, such as online job advertisements (ads) data, can provide information in near real-time. Whereas Labor Supply data, such as official employment statistics, are lagging indicators. Accounting for such inertia is challenging. The question is therefore can we leverage both labor supply and labor demand features to measure labor shortages, while accounting for auto-regressive inertia and noisy data?
The third open problem relates to leveraging online market data. Traditionally, labor modelling has oriented towards theoretical and Econometric approaches that formally represent hypotheses (Brunello and Wruuck, 2019). However, the confluence of more available labor market data facilitated by the Internet (for example, job ads), advances in computation, and greater access to analytical tools (such as Machine Learning) are enabling more data-driven approaches for the labor prediction tasks. While more data-driven and Machine Learning approaches are becoming more commonplace for labor modelling tasks (Börner et al., 2018), applying such tools and techniques still remains a relatively under-explored area in Labor Economics. The open questions are what are the performances and how robust are these models at predicting periodic changes to labor shortages for occupations, and which variables are most predictive?
We address the above-stated open questions by constructing a supervised Machine Learning model framework to predict occupational classifications as In Shortage
or not. These binary classification models are built using eXtreme Gradient Boosting (XGBoost), a scalable Machine Learning system for tree boosting(Chen and Guestrin, 2016). We incorporate Labor Demand and Labor Supply occupational data from Australia as input, which are organized and matched according to the official Australian occupational standards. On the Labor Demand side, we use 1,229,608 job ads from across all of Australia spanning from 2012-01-01 to 2018-12-31. For the Labor Supply side, we use ‘Detailed Labor Force’ data from the Australian Bureau of Statistics over the same time period (Australian Bureau of Statistics, 2019). Lastly, the ‘ground-truth’, or predictive variable, is taken from the longitudinal list of occupational shortages, recorded by the Australian Federal Department of Education, Skills and Employment (Department of Education, Skills and Employment, Australian Government, 2019). These official labor shortage classifications directly inform national and state policies in the areas of education, training, employment and skilled immigration.
The main contributions of this work include:
We develop a data-driven modeling framework for predicting labor shortages of occupations, and their periodic changes, by using robust Data Science and Machine Learning techniques;
We analyze the predictability of different modeled features, which includes both Labor Demand and Labor Supply data sources;
We share the source code of the constructed models and the compiled new dataset, which will be available for download upon publication.
2. Related Work
We structure the discussion of the related work into three areas. First, in Section 2.1, we visit work dealing with measuring labor shortages; next, in Section 2.2, we discuss work addressing the cyclical and structural factors affecting labor shortages; finally, in Section 2.3, we investigate the economic costs of labor shortages.
2.1. Measuring Labor Shortages
The broader problem. Labor shortages can be considered a subset of the broader problem of ‘Labor Mismatch’ (or ‘Skills Mismatch’). At the macro-level, labor mismatch refers to the disequilibrium of aggregate supply and demand of labor skills, usually with reference to a specific geographic unit (Brunello and Wruuck, 2019). While labor shortages arise when the demand for specific skills exceed, and are insufficient to be met by the available supply of workers at real wage rates, ‘Labor Surpluses’ are a product of excess skill supply (Quintini, 2011). That is, there are more workers who possess specific skills than the labor market demands on aggregate. Therefore, labor shortages are usually calculated as a component of measuring labor mismatches.
Measures using surveys. Labor shortages are typically measured at the firm-level through the use of surveys to examine the extent of unfilled and hard-to-fill vacancies (McGuinness et al., 2017). A shortcoming of this approach is that labor shortages can be overstated and such surveys are often unrepresentative. For instance, employers may label as labor shortages their own inability to offer a sufficient wage-level, attractive working conditions, or a desirable location. These micro-level factors can distort the presence of genuine labor shortages, where employers extrapolate their firm-specific challenges as macro-level issues (CEDEFOP, 2015). This is particularly problematic in unrepresentative surveys of labor shortages.
Over several studies, the European Centre for the Development of Vocational Training (Cedefop) has attempted to differentiate ‘genuine’ labor shortages from misclassified labor shortage claims from surveyed employers (Cedefop, 2018; CEDEFOP, 2015). With reference to the Flash Eurobarometer 304 survey on ‘Employers’ perception of graduate employability’ (Directorate-General for Communication, EU Open Data Portal, ), 47 per cent of employers report difficulties in hiring suitably skill graduates. However, the research found that graduate employers facing genuine labor shortages was much lower, at 34 per cent.
Use of indirect measures. To differentiate between perceived and genuine labor shortages, other studies have complemented survey results with indirect measures, such as wage growth, employment growth, vacancy rates, and work intensity. The rationale underlying these approaches is that occupations experiencing labor shortages are typically characterised by wage premiums, greater employment growth, growing vacancy rates, and higher levels of overtime (Brunello and Wruuck, 2019). The OECD implemented such indirect measures in concert with employer surveys to construct a series of indicators and composite indexes on skills for employment, including labor shortages (OECD, 2017). The ‘World Indicators of Skills for Employment’ (WISE) database calculates an occupational indicator of skill shortages based on wage growth, employment growth, and growth in the hours worked (OECD, 2015). Next, this indicator is transformed into a composite skill index that uses the O*NET database (National Center for O*NET Development, ) to map occupations into groups of skills and tasks. This allows for international comparability between OECD countries for skills challenges and performances, including the extent of labor shortages.
Other approaches have used indicators from job ads data to assess labor shortages. Dawson et al. (Dawson et al., 2019) analyzed a large temporal dataset of online job ads to detect labor shortages of Data Science and Analytics occupations in Australia. The authors use a range of indicators to evaluate the presence and extent of shortages, such as posting frequency, salary levels, educational requirements, and experience demands. They also contend that error metrics from Machine Learning models that are trained to predict posting frequency could be useful features for predicting labor shortages. Essentially, occupations experiencing high posting growth are difficult to predict. Given that high and growing posting frequency is often used as a proxy for rising Labor Demand for occupations, the authors argue that high error metrics, combined with the other indicators, can help detect labor shortages. We use a range of the features outlined by Dawson et al. as Labor Demand features in this research, seen in Table 1.
The current work.
Unlike most of the prior work, which relies on surveys and composite measures, the present work takes a data-driven machine learning approach to measure and predict labor shortages. We leverage a set of recently proposed labor demand features extracted from job ads data(Dawson et al., 2019), together with labor supply features to build a machine learning model which classifies whether an occupation is in shortage, or more importantly whether an occupation will transition to a Shortage state (where before it was Not in Shortage).
2.2. Cyclical and Structural Factors Affecting Labor Shortages
Macroeconomic cycles can affect labor shortages. During periods of economic expansion, labor shortages tend to increase as firms seek to hire skilled labor to meet new and growing market demands (Brunello and Wruuck, 2019). The ‘Manpower Talent Shortage Survey’ (ManpowerGroup, 2018) is the largest labor shortage survey in the world. The global survey found that labor shortages have increased from 30% in 2009 to 45% in 2018, equating to a 12 year high. Similarly, the annual Cedefop skills mismatch survey in Europe (Cedefop, 2018) found that labor market shifts in the aftermath of the economic crisis have resulted in the stated inability of employers to fill their vacancies with suitably skilled workers.
Structural changes to labor markets also influence labor shortages. These most notably take the form of demographic changes, technological advances, and globalisation. Demographic changes affect the demand for goods and services. For instance, as the average age of a population increases, so does their demand for healthcare services. This subsequently increases the aggregate labor demand for workers with healthcare related skills (OECD, ). As the average age is increasing for almost all advanced economies (World Health Organization, 2018), these structural demographic changes are likely to affect labor shortages for specific occupational classes, such as healthcare services.
Technological advances introduce structural changes that can exacerbate labor shortages. As firms adopt new technologies, they seek skilled labor to implement and make productive use of these new technologies. This can create dynamics of ‘skill biased technological change’ (Katz and others, 1999; Acemoglu and Autor, 2011), whereby the acceleration of demand for technical skills outweighs the available supply of workers who possess such skills. There is evidence of these dynamics currently occurring as a result of the growing demands for Data Science and Machine Learning skills (Dawson et al., 2019). While the capacity to collect, store, and process information may have sharply risen, it is argued that these advances have far outstripped present capacities to analyse and make productive use of such information (Hey et al., 2009). Claims of Data Science and Advanced Analytics (DSA) skill shortages are being made in labor markets around the world (Blake, 2019; LinkedIn Economic Graph Team, 2018; Manyika et al., 2011). Two studies conducted using job ads data assessed DSA labor demands and the extent of labor shortages. The first was an industry research collaboration between Burning Glass Technologies (BGT), IBM, and the Business-Higher Education Forum in the US (Markow et al., 2017). The research found that in 2017 DSA jobs earned a wage premium of more than US$8,700 and DSA job postings were projected to grow 15% by 2020, which is significantly higher than average. In another study commissioned by the The Royal Society UK (Blake, 2019), job ads data were analysed for DSA jobs in the UK. The results again also showed high and growing levels of demand for DSA skills (measured through posting frequency) and wage premiums for DSA related occupations.
Globalisation can act as a shock to labor markets that induce or deepen labor shortages. The offshoring of labor tasks can increase the polarisation of labor markets by reducing the domestic demand for middle-skilled jobs (Brunello and Wruuck, 2019). This causes a process of labor reallocation, as workers attempt to transition between jobs. If the reallocation of labor is inefficient, labor shortages can increase as the supply of skilled workers is insufficient to meet the evolving labor demands of growing sectors.
The current work proposes a robust data-driven method that assesses labor shortages and which uses machine learning to account for the factors that affect labor shortages.
2.3. Economic Costs of Labor Shortages
The costs of labor shortages manifest at both the micro and macro-levels of economies. They affect individuals, firms, and aggregate markets.
Individual-level. Labor shortages can negatively affect earnings and reduce development opportunities for workers whose skills are not in shortage. Markets experiencing labor shortages can force individuals to accept less desirable and insecure work. In 2011, Quintini (2011) analysed household survey data from the European Community Household Panel to investigate the effects of qualification mismatch on earning. Quintini found that ‘over-qualified’ individuals earn approximately three per cent less than individuals with the same occupations but who have been appropriately matched. The presence of labor shortages exacerbates the inefficient allocation of labor, which can negatively affect the earnings and employment opportunities for individuals.
Firm-level. Several studies have examined the implications of labor shortages on firm-level productivity. Bennett and McGuinness (2009), Forth and Mason (2006), Tang and Wang (2005) and Haskel and Martin (1996) all concluded that labor shortages negatively impact firm-level productivity. In a study using the Australian Business Longitudinal Database, Healy et al. (2015) found that most Australian firms respond to labor shortages through longer working hours and higher wages for occupations experiencing in shortage. However, there is evidence to suggest that such labor shortages are usually short-lived. Bellmann and Hübler (2014) analyse the existence of labor shortages in German firms and conclude that while their effects can be acute, they are typically a temporary and short-term phenomena.
Macroeconomic-level. Lastly, the economic costs of labor shortages accumulate to macroeconomic effects. Frogner (2002) uses data from the Employers Skill Survey to identify the negative impacts of labor shortages on productivity, Gross Domestic Product, employment levels, and wage earnings. From the perspective of private investment, Nickell et al. (1997) calculate that a 10 percent increase in firms reporting labor shortages decreases private investment by 10 percent and Research & Development investment by 4 percent. The inefficient allocation of resources caused by labor shortages therefore hampers productivity, which can compromise macroeconomic growth.
The current work proposes a method to predict in advance labor shortages, using publicly available labor demand and supply data. This could in turn be used by companies to prepare for such situations.
|Name||Meaning and explanation|
|Labour Demand||Posting Frequency:||number of job advertisement vacancies|
|Max Median Salary:||maximum median salary advertised|
|Min Median Salary:||minimum median salary advertised|
|Max Average Salary:||maximum average salary advertised|
|Min Average Salary:||minimum average salary advertised|
|Max Average Experience:||maximum average years of experience required|
|Min Average Experience:||minimum average years of experience required|
|Max Average Education:||maximum average years of formal education required|
|Min Average Education:||minimum average years of formal education required|
|Specialised Count:||total count of required skills considered specialised to a specific vocation|
|Baseline Count:||total count of skills that are considered applicable across vocations|
|Software Count:||total count of skills that are software-related|
|Labour Supply||Unit Total Employed:||total number employed at ANZSCO Unit level (000’s)|
|Unit Total Hours Worked:||total hours worked at ANZSCO Unit level (000’s)|
|Sub FT Employed:||total employed full-time at ANZSCO Sub-Major level (000’s)|
|Sub PT Employed:||total employed part-time at ANZSCO Sub-Major level (000’s)|
|Sub Total Employed:||total employed at ANZSCO Sub-Major level (000’s)|
|Sub FT Hours Worked:||total full-time hours worked at ANZSCO Sub-Major level (000’s)|
|Sub PT Hours Worked:||total part-time hours worked at ANZSCO Sub-Major level (000’s)|
|Sub Total Hours Worked:||total hours worked at ANZSCO Sub-Major level (000’s)|
|Major FT Employed:||total employed full-time at ANZSCO Major level (000’s)|
|Major PT Employed:||total employed part-time at ANZSCO Major level (000’s)|
|Major Total Employed:||total employed at ANZSCO Major level (000’s)|
|Major FT Hours Worked:||total full-time hours worked at ANZSCO Major level (000’s)|
|Major PT Hours Worked:||total part-time hours worked at ANZSCO Major level (000’s)|
|Major Total Hours Worked:||total hours worked at ANZSCO Major level (000’s)|
|Major Unemployed FT Seekers:||total unemployed seekers full-time at ANZSCO Major level (000’s)|
|Major Unemployed PT Seekers:||total unemployed seekers part-time at ANZSCO Major level (000’s)|
|Major Unemployed Total Seekers:||total unemployed seekers at ANZSCO Major level (000’s)|
|Major Total Weeks Searching:||total number of weeks unemployed persons job searching at ANZSCO Major level (000’s)|
|Major Underemployed Total:||total number of persons underemployed at ANZSCO Major level (000’s)|
|Major Underemployed Ratio:||ratio of underemployed persons at ANZSCO Major level|
3. Data and Model Framework
In this section, we first detail the employed data sources and the constructed labor demand and labor supply features (Section 3.1). Next, we perform an exploratory data analysis of the constructed data set (Section 3.2), and finally we detail the prediction model and setup, hyper-parameter tuning and the used performance measures (Section 3.3).
3.1. Data Sources and Constructed Features
In this work, we employ both Labor Demand and Labor Supply data as explanatory variables (features, henceforth) for predicting occupational labor shortages. The data set we construct relates to occupations in Australia during the period 2012-2018. Table 1 summarizes the constructed features, which we further detail in the rest of this section.
Labor demand features. For labor demand, we have used job ads data, which was generously provided by Burning Glass Technologies111BGT is a leading vendor of online job ads data: https://www.burning-glass.com/ (BGT). The data has been collected via web scraping and systematically processed into structured formats. The dataset consists of detailed information on individual job ads, such as location, salary, employer, educational requirements, experience demands, and more. Each job ad is also categorized into its relevant occupational classification. The full list of labor demand features is shown in Table 1 and in the compiled dataset available for download222The dataset will be made publicly available upon publication.. We build upon the results of Dawson et al. (2019) and we incorporate a range of the engineered job ads indicators that the authors found predictive of labor shortages, as discussed in Section 2.1.
While job ads data can provide a useful and near real-time proxy for Labor Demand, it is argued that that are an incomplete representation. Some employers continue to use traditional forms of advertising for vacancies, such as newspaper classifieds, their own hiring platforms, or recruitment agency procurement. Job ads data also over-represent occupations with higher-skill requirements and higher wages, colloquially referred to as ‘white collar’ jobs (Carnevale et al., 2014).
Labor supply features. The labor supply data used for this research was been collated from the ‘Quarterly Detailed Labor Force’ statistics by the Australian Bureau of Statistics (Australian Bureau of Statistics, 2019). This consists of statistics on employment levels, unemployment, underemployment, and hours worked. The full list of Labor Supply features can be viewed in Table 1 and the compiled data set. As the Labor Supply statistics are measured quarterly, the yearly average for each feature was calculated to match the labor shortage target variable, which is measured in yearly periods.
Labor shortage ground truth. The ground truth that we employ in this work originates from the ‘Historical List of Skill Shortages in Australia’, measured by the Australian Federal Department of Education, Skills and Employment (DoE, henceforth) (Department of Education, Skills and Employment, Australian Government, 2019). For over three decades, the DoE has conducted ongoing labor shortage research in Australia. The main aim of their research has been to identify shortages for skilled occupations where long lead times for training mean that such shortages cannot be addressed immediately. The DoE track 136 occupations nationally, and also provides more detailed analyses on select occupations at the State and Territory levels. To assess labor shortages, the DoE survey employers every year, called the ‘Survey of Employers who have Recently Advertised’ (SERA). The SERA collects both qualitative data from employers and recruitment professionals, and quantifiable data on employers’ recruitment experiences (Australian Government, 2018). The output of this DoE activity is that, for every year, each of the 136 tracked occupations is classified as In Shortage or Not In Shortage at the national-level. The results of these classifications have direct implications for education, training, employment and migration policies.
There are, however, several important limitations of the DoE’s methodology for measuring labor shortages. First, the DoE acknowledge that the survey is not a statistically valid sample of Australia’s labor market. Second, there are inherent limitations of determining labor shortages from surveying employers, as discussed above. Nonetheless, the Australian Bureau of Statistics evaluated the methodology and found that it was ”appropriate for its purpose” (Australian Government, 2018). Third, the surveyed occupations in this research are biased towards ‘Technicians and Trades’ workers. Forth, the dataset is imbalanced with a greater number of occupations classified as Not in Shortage. Firth and finally, there are limitations that emerge from employing a standardized occupation taxonomy, detailed in the rest of this section.
Using a standardized occupation taxonomy – ANZSCO. All data sources mentioned above correspond to their respective occupational classes according to the Australian and New Zealand Standard Classification of Occupations (ANZSCO). (Australian Bureau of Statistics, 2013) ANZSCO provides a basis for the standardized collection, analysis and dissemination of occupational data for Australia and New Zealand. The structure of ANZSCO has five hierarchical levels - major group, sub-major group, minor group, unit group and occupation. The categories at the most detailed level of the classification are termed ’occupations’. Depending on data availability, labor statistics were included in the models from the occupation level through to the major group level.
There are some significant shortcomings to analyzing occupations within ANZSCO classifications. Official occupational classifications, like ANZSCO, are often static taxonomies and are rarely updated. They therefore fail to capture and adapt to emerging skills, which can misrepresent the true labor dynamics of particular jobs. For example, a ‘Data Scientist’ is a relatively new occupation that has not yet received its own ANZSCO classification. Instead, it is classified as an ‘ICT Business & Systems Analyst’ by ANZSCO, grouped with other job titles like ‘Data Analysts’, ‘Data Engineers’, and ‘IT Business Analysts’. However, as ANZSCO is the official and prevailing occupational classification system, all data used for this research are in accordance with the ANZSCO standards.
3.2. Exploratory Data Analysis and Dataset Profiling
In this section, we perform an exploratory data analysis and profiling of the dataset. The purpose is to understand the biases and unbalances introduced during the dataset’s construction.
The constructed dataset. The compiled dataset describes 132 unique occupations during the period 2012-2018. Each row consists of a tuple (occupation, year), and it describes the given occupation during that particular year using its ANZSCO identifiers, the values for each of the descriptive features described in Table 1, and the auto-regressive lagged features (described in Section 3.3). The target variable (i.e. the ground-truth label to be predicted) is its shortage status during that year: In Shortage or Not in Shortage. In constructing this dataset, we analyzed the auto-correlations within the constructed features, which are presented in the online supplement (Appendix, 2020). We next profile the contributed data set, and we uncover a series of specifics that should be considered during the modeling process.
Prevalence of Technicians and Professionals. Fig. (a)a shows that the occupational classes measured by the DoE disproportionately represent ‘Technicians and Trades’ and ‘Professionals’. Collectively, these two major occupational groups account for 94% of occupations included in the dataset. This is significantly higher than the number of workers actually employed in these occupational classes. For instance, the Australian Bureau of Statistics (ABS) indicates that ‘Professionals’ represent approximately 24% of employment in Australia (Australian Bureau of Statistics, 2019).
Most occupations are Not in Shortage. The labor shortage ground-truth data are biased towards occupations ‘Not in Shortage’. Fig. (b)b shows that there are over three times as many occupations classified as Not in Shortage than In Shortage. This has important important modeling implications and requires hyper-parameter tuning to sufficiently adjust for these imbalances, as discussed below.
Some classes are more often In Shortage than others. Fig. (c)c shows that ‘Technicians and Trades’ workers have been classified as In Shortage at greater amounts than all other occupational classes, including ‘Professionals’. This finding, coupled with the distribution of occupational classes seen in Fig. (a)a, shows that the ground-truth exhibits biases toward the ‘Technicians and Trades’ workers occupational class. Therefore, we conclude that it is worthwhile testing the occupational classes of ‘Technicians and Trades’ and ‘Professionals’ in separate models, in addition to constructing models for all occupational classes in the dataset.
Temporal changes in labor shortage status. Changes to labor shortages of occupations are a key factor that determine education, skilled immigration, and labor market policies. The ability to predict such yearly classification changes is therefore critical to models attempting to predict labor shortages. In Fig. (d)d, we show the incidences of yearly shortage changes compared to the total yearly count of occupational shortages (dashed gray line). The blue line shows occupations that were classified as In Shortage one year that changed to Not in Shortage the next. The orange line shows the opposite. Yearly shortage changes have become less commonplace since 2013, which suggests that some occupations are experience periods of entrenched shortage. This trend also suggests that the ground-truth could contain auto-regressive properties, which is an important modeling consideration, particularly for predicting shortage changes. This is explored further below.
Highest and lowest levels of growth. Lastly, changes to employment levels of the occupations represented in the dataset provide interesting insights. Fig. 2 shows the occupations that have experienced the highest levels of growth from 2012 to 2018. Observing Fig. 2, it is clear that these high-growth occupations orient towards higher skilled and service-based jobs. For instance, ‘Corporate Service Managers’, ‘Economists’, ‘Engineers’, ‘Medical Professionals’, ‘Barristers’, and others typically require higher levels of education and training to meet non-routine and socio-cognitive skill demands. In contrast, Fig. 3 shows the occupations that have experienced the lowest levels of growth from 2012 to 2018. The types of occupations differ significantly. These declining occupations are much more oriented towards routine and/or physical labor that tend to require lower levels of education and training. For example, ‘Upholsterers’, ‘Switchboard Operators’, ‘Survey Interviewers’ and ‘Secretaries’ are all occupational classes whose labor tasks contain high levels of routine labor (National Center for O*NET Development, ). In general, these changes in occupational employment levels are consistent with other research stating that occupations with lower levels of education and training, and that are more routine-based, face greater risks of technological automation and globalization (Acemoglu and Autor, 2011; Frey and Osborne, 2017; Frank et al., 2019).
3.3. Setting up the Predictive Model
Training a regressive model. In this work, we predict labor shortages by employing XGBoost (Chen and Guestrin, 2016) – an of-the-shelf regression algorithm. XGBoost – which stands for eXtreme Gradient Boosting – is an implementation of gradient boosted tree algorithms. XGBoost has achieved state-of-the-art results on many standard classification benchmarks and is a well established Machine Learning framework (Orzechowski et al., 2018)
. In a nutshell, these are a machine learning technique that produce a prediction model in the form of an ensemble of weak prediction models (here decision trees), by optimizing a differentiable loss function(Chen, 2014)
. We chose to use XGBoost because it is the currently the state of the art in regression tasks for medium sized amounts of data (i.e. where neural networks cannot be fully deployed). It also features several advantages that leverage in our regression task: it handles automatically missing data values, and it supports parallelization of tree construction.
Accounting for the temporal inertia of shortage classifications. Labor shortages are constantly evolving and labor markets take time to adjust. Therefore, any models that are constructed to predict labor shortages must account for these temporal characteristics. This can be observed in Section 4, where the labor shortage ground-truth exhibits strong auto-regressive properties.
XGBoost, however, was not specifically built for time series prediction tasks. A fundamental assumption of XGBoost is the independence of observations. however, XGBoost has been applied for several time series prediction tasks and achieved impressive results (Zhou et al., 2019; Ji et al., 2019; Pavlyshenko, 2016). We also use XGBoost to make predictions on temporal data in this research. To account for the temporal nature of labor shortages, we engineer and implement ‘auto-regressive lagged features’. This means that for each feature included in the models, we also include the offset values for included features over a specified number of past periods. As discussed in Section 4, we implement models with two auto-regressive lag periods for this research. The inclusion of such auto-regressive lagged features provides each observation with temporal characteristics.
We also include all available information for a given period to make predictions on the ground-truth. This reflects the nature of the DoE labor shortages data. These data are lagging indicators and are usually published by the DoE over a year after the latest reported year. Therefore, it is realistic to assume that all yearly data would be available for the latest available target period.
Training model hyper-parameters. Like most machine learning algorithms, XGBoost has a set of hyper-parameters – parameters related to the internal design of the algorithm that cannot be fit from the training data. The hyper-parameters are usually tuned through search and cross-validation. In this work, we employ a Randomized-Search (Bergstra and Bengio, 2012) which selects randomly a (small) number of hyper-parameter configurations, which it evaluates on the training set via cross-validation. We tune the hyper-parameters for each learning, at each learning fold using 2500 random combinations, evaluated using a 5 cross-validation. We also implemented ‘oversampling’ to accommodate for the imbalanced ground-truth that is biased towards occupations Not in Shortage, as seen in Fig. (b)b and Fig. (c)c. This technique involves randomly duplicating observations from the minority class (In Shortage) and adding them to the training dataset. The main benefit of oversampling is that it creates a balanced distribution of target variables without ‘data leakage’ that occurs from ‘under-sampling’ (that is, randomly removing observations from the majority class). Creating a balanced distribution of predictive classes is particularly important for a range of classification algorithms (Branco et al., 2015). However, a shortcoming of oversampling is that it can increase the likelihood of overfitting, as exact copies of the minority class are constructed (Fernández et al., 2018). The oversampling ratio is defined as:
The output of this ratio was specified as a hyper-parameter value in each model type that we constructed.
Here, we measure the performance of our prediction using three standard Machine Learning performance measures: precision, recall, and F1. Precision measures how many of the predictions were correct. Recall measures the completeness of the prediction – how many of the true answers were correctly uncovered. The F1 is the harmonic mean of precision and recall – a classifier needs to achieve both a high precision and a high recall in order to obtain a high F1. Formally, these are defined as:
where are the number of true positives – number of correctly identified items of the class of interest; are false positives (items incorrectly predicted as pertaining to the class of interest); and are false negatives (items incorrectly predicted as not being of interest). Note that one can compute the precision, recall and F1 for each class of interest (here, both the shortage and the not shortage), and the scores for each class could be wildly different as one class might be more predictive than the other. In our results in Section 4, we report the macro-precision, macro-recall and macro-F1, which are the means of the indicators over the two classes. Therefore, we are unsure that the minority class (here the Shortage class) are not under-represented in the results.
Train-test split. Consistent with established Machine Learning practices, we separated the dataset into ‘training’ and ‘testing’ sets. This split was implemented temporally, with observations from 2012-2016 included in the training dataset, and observations from 2017-2018 included in the testing dataset. When all occupations are modeled, the training dataset consisted of 660 observations (71% of total observations) and the testing dataset consisted of 264 observations (29% of total observations). However, these total observations decreased when constructing occupation-specific prediction models. The models that exclusively consisted of occupations in the ‘Technicians and Trades’ class had 340 observations in the training set and 136 in the testing set. Whereas the models that exclusively consisted of occupations in the ‘Professionals’ class had 280 observations in the training set and 112 in the testing set. Segmenting the dataset into temporal training and testing sets is done to ensure objectivity in the evaluation process and reflect the temporal nature of the ground-truth.
As the first step, we trained and evaluated models to determine the appropriate number of auto-regressive lagged periods to include. These models consisted of all available features seen in Table 1. After experimenting with different lag periods, we find that 2 lag periods provides the best trade-off between overall prediction performance and to maximize the amount of information included in the classifiers. Further details are provided in the online supplement(Appendix, 2020).
We then constructed three prediction model classes, all implementing the approach described in Section 3.3. These consist of: (1) All Occupations; (2) Technicians and Trades occupations; and (3) Professionals. We created separate models for these two occupational classes in response to the biases observed in the dataset, as seen in Fig. 1. For each of these modeling categories, we then trained and evaluated prediction models using the following feature input configurations:
All-In: All features included;
LD: Labor Demand features included only;
LS: Labor Supply features included only;
LD + LS: Labor Demand and Labor Supply features included;
Auto-regressive Predictor: Lagged target features included only;
Naive Predictor: Copy target variable from the previous time period.
We also evaluated the model performances to predict yearly changes to occupations’ labor shortage status. We did this for each of the three model classes above. To achieve this, we filtered occupations in the testing dataset to include only those that had a different labor shortage classification to the previous year. For example, as ‘Architects’ were classified as In Shortage in 2017 but were Not in Shortage in 2016, they were therefore included in the performance evaluation. However, after filtering for occupational shortage changes, we did not retrain these models to make new predictions. Instead, we evaluated performance based on the prediction outputs for all included occupations in each of the models. The rationale for this decision was to assess how well each of the models performed at predicting changes in the context of it being rare that an occupation’s labor shortage status changes, as seen in Fig. (d)d.
The Naive Predictor and the Auto-regressive Predictor achieve the highest performance scores for when all occupations in the dataset are modeled. This shows that the ground-truth exhibits strong auto-regressive properties. However, Labor Demand alone achieves an F1 macro average score within 10 per cent of the highest performing, auto-regressive models.
With regards to predicting changes, the models containing auto-regressive features experience significant deterioration of performance results. The All-In and Auto-regressive Predictor achieves F1 macro average scores of 3 per cent and 8 per cent, respectively. While the Labor Demand model also experiences a decline in performance, it is not nearly as significant as the other models and is able to better predict labor shortage changes according to all performance metrics.
4.2. Technicians & Trades
Exclusively using ‘Technicians and Trades’ occupations achieved the highest levels of modeling performance in this research. The All-In model achieved an F1 macro average of 86 per cent, which was the highest score recorded in this research. Similarly, the Auto-regressive Predictor and Naive Predictor both achieved 85 per cent, which further demonstrates the auto-regressive properties of the ground-truth. Interestingly, the Labor Demand model here performed the worst according to the F1 macro average. We believe that this is due to the inherent biases of job ads data towards ‘Professionals’ and ‘Managers’, as discussed in the previous section.
Despite these limitations with job ads data, the Labor Demand model is again the highest performing for predicting changes for labor shortages. While the differences of performance scores comparing Labor Demand to other models are less significant than the equivalent All-In models (Fig. (b)b), they nonetheless achieve higher F1 macro average scores. This shows that labor shortages for ‘Trades and Technicians’ are more predictable than other occupations, both in general and for predicting labor shortage changes.
The prediction models built exclusively with ‘Professionals’ resulted in a deterioration of performance across all model types when compared to ‘Technicians and Trades’ and ‘Overall’ models. However, the general performance trends remained largely consistent. That is, there is the presence of strong auto-regressive properties when modeling all relevant occupations. And similar to the ‘Overall’ models, the ‘Professionals’ Labor Demand model performed only marginally worse than the high performing, auto-regressive models. A key difference, however, is the high macro average Precision scores. This can be explained from the relatively low number of ‘Professionals’ that are classified as In Shortage in the testing dataset, as seen in Fig. (c)c from 2017-2018.
When predicting labor shortage changes for ‘Professionals’, the Labor Demand model again exceeds the performance of all other model types across all metrics. While the predictability scores are not as strong as ‘Technicians and Trades’, these results nonetheless reinforce the relative strengths of Labor Demand features for predicting labor shortage changes on this dataset.
In this section, we interpret the outlined results above in reference to the research questions.
As shown in the Results section, we leveraged established Data Science and Machine Learning techniques to predict labor shortage classifications for occupations. We did so achieving F1 macro average scores up to 86 per cent. However, performance outcomes were variable across the different models we constructed. The highest performing models that used all relevant occupational data as input were the models that included lagged target features. This shows that the labor shortage ground-truth data exhibits strong auto-regressive properties.
There were also some notable differences between the model results ‘Technicians and Trades’ compared to ‘Professionals’. Specifically, that the input data was more predictive of labor shortages for occupations in the ‘Technicians and Trades’ class. This could be partly due to a greater number of represented ‘Technicians and Trades’ occupations compared to ‘Professionals’. More significant, however, is that there are many more ‘Technicians and Trades’ occupations In Shortage than ‘Professionals’ in the 2017-2018 testing dataset, as seen in Fig. (c)c. Consequently, the ‘Technicians and Trades’ models are able to learn from a greater diversity of occupations classified as In Shortage. This allows the classifier models to uncover more nuanced patterns from the features to predict labor shortages.
Leveraging Labor Supply and Labor Demand features while accounting for auto-regressive inertia and noisy data. In building these prediction models, we incorporated 32 occupational indicators, plus their offset lagged values, from both Labor Demand and Labor Supply data sources. The advantage of a Machine Learning approach, and specifically the XGBoost algorithm, is that the models autonomously learn the relative importance of input features for the prediction task. It can therefore account for noisy data, controlling for problems such as strong correlations among features.
To test the predictive capabilities of the different feature classes, and manage for the strong auto-regressive properties of the labor shortage ground-truth data, we isolated features that did not contain the auto-regressive lagged targets. After retraining the models on the Labor Demand and/or Labor Supply features with all relevant occupations, we demonstrated that in many instances the performance deterioration was clear but not overly drastic. This highlights the predictive capabilities of the Labor Demand and Labor Supply feature sets, albeit not as strong as the auto-regressive features.
Predicting labor shortage changes. Performance scores deteriorated when evaluating model predictions based on yearly labor shortage changes for occupations. These difficulties reflect the complexities inherent to labor market dynamics and the aforementioned shortcomings of the ground-truth. Nonetheless, an important finding was uncovered through evaluating such predictions. Namely, that Labor Demand data achieve the highest performance scores for predicting yearly changes in labor shortage status for occupations. This finding was consistent across all three model types.
These findings are significant because they highlight the value of job ads data (the proxy used here for Labor Demand) for predicting labor market conditions, such as occupational changes to labor shortages. Additionally, job ads data are a near real-time data source of Labor Demand indicators. Therefore, applying job ads data in the ways that we have demonstrated could assist policy-makers to better preempt labor shortage changes of occupations using near real-time data. This could help with critical tasks such as forward planning for labor market policies pertaining to education, skilled immigration, and workforce transitions.
6. Conclusions and Future Research
In this research, we constructed a Machine Learning framework to predict labor shortages for occupations. We did so by compiling a unique dataset that incorporates both Labor Demand and Labor Supply data for occupations in Australia from 2012-2018. By using established Data Science and Machine Learning techniques, we were able to achieve macro-F1 average scores of up to 86 per cent.
Predicting labor shortage changes proved more difficult, however we uncovered that job ads data were the most predictive features for preempting yearly labor shortage changes for occupations. This was the case across all constructed model types and all measured performance metrics. These findings are significant because they highlight the predictive value of job ads data when used as proxies for Labor Demand and incorporated into labor market prediction models.
Ultimately, changes to an occupation’s labor shortage status have wide impact on policies and on people’s lives. Education, skilled immigration, and broader labor policies adapt to the evolving skill demands of labor markets. Therefore, greater capacities to preempt (or at least timely predict) such changes can assist policy-makers and business operators. This research provides a robust framework for predicting labor shortages, and their changes, which can assist policy-makers and businesses responsible for preparing labor markets for the future of work.
Limitations and future work.
This research employs only Australian data. Future work will apply this framework for predicting labor shortages in other labor markets. Additionally, different features could be constructed as descriptive variables, and more auto-regressive lag periods could be considered. Another research avenue to assess is how these results could be improved by applying other predictive tools, such as Deep Learning approaches.
- Skills, tasks and technologies: implications for employment and earnings. In Handbook of Labor Economics, D. Card and O. Ashenfelter (Eds.), Vol. 4, pp. 1043–1171. Cited by: §2.2, §3.2.
- Appendix: . Note: https://www.dropbox.com/s/ycohp7420qqa7r0/EC2020-online-supplement.pdf?dl=0 Cited by: §3.2, §4.
- 1220.0 - ANZSCO – australian and new zealand standard classification of occupations, 2013, version 1.2. Australian Bureau of Statistics (en). Note: https://www.abs.gov.au/AUSSTATS/abs@.nsf/Lookup/1220.0Main+Features12013,%20Version%201.2?OpenDocumentAccessed: 2019-8-1 Cited by: §3.1.
- 6291.0.55.003 - labour force, australia, detailed, quarterly, may 2019. Australian Bureau of Statistics (en). Cited by: §1, §3.1, §3.2.
- Skill shortage research methodology. Note: https://docs.employment.gov.au/documents/skill-shortage-research-methodology-0Accessed: 2020-2-9 Cited by: §3.1, §3.1.
- The skill shortage in german establishments before, during and after the great recession. Jahrbücher für Nationalökonomie und Statistik 234 (6), pp. 800–828. Cited by: §2.3.
- Assessing the impact of skill shortages on the productivity performance of high-tech firms in northern ireland. Applied Economics 41 (6), pp. 727–737. Cited by: §2.3.
- Random search for hyper-parameter optimization. Journal of machine learning research 13 (Feb), pp. 281–305. Cited by: §3.3.
- Dynamics of data science skills. Technical report The Royal Society. Cited by: §2.2.
- Skill discrepancies between research, education, and jobs reveal the critical need to supply soft skills for the data economy. Proc. Natl. Acad. Sci. U. S. A. 115 (50), pp. 12630–12637 (en). Cited by: §1.
- A survey of predictive modelling under imbalanced distributions. External Links: Cited by: §3.3.
- Skill shortages and skill mismatch in europe: a review of the literature. Cited by: §1, §1, §2.1, §2.1, §2.2, §2.2.
- Understanding online job ads data. Technical report Georgetown University. Cited by: §3.1.
- Shortages and gaps in european entreprises: stricking a balance between vocational education and training and the labour market. Cedefop reference series, Luxembourg 102. Cited by: §2.1, §2.1.
- Insights into skill shortages and skill mismatch: learning from cedefop’s european skills and jobs survey. Publications Office Luxembourg. Cited by: §2.1, §2.2.
- ’9,500 jobs to save one’: how telstra is slashing its workforce to meet NBN challenge. ABC News (en). Cited by: §1.
- XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16, Vol. 13-17-Augu, New York, New York, USA, pp. 785–794. External Links: Cited by: §1, §3.3.
- Introduction to boosted trees. University of Washington Computer Science 22, pp. 115. Cited by: §3.3.
- Adaptively selecting occupations to detect skill shortages from online job ads. In Proceedings for the 2019 IEEE International Conference on Big DataIEEE International Conference on Big Data, Cited by: §2.1, §2.1, §2.2, §3.1.
- Historical list of skill shortages in australia. Cited by: §1, §3.1.
-  Flash eurobarometer 304: employers’ perception of graduate employability. Cited by: §2.1.
- Learning from imbalanced data sets. 1st ed. 2018 edition edition, Springer (en). Cited by: §3.3.
-  We need to build skills, not walls: telstra CEO. Note: https://www.ceda.com.au/News-and-analysis/CEDA-Events/We-need-to-build-skills-not-walls-Telstra-CEOAccessed: 2020-2-6 Cited by: §1.
- Do ict skill shortages hamper firms’ performance? evidence from uk benchmarking surveys. National Institute of Economic and Social Research, Discussion Paper 281. Cited by: §2.3.
Toward understanding the impact of artificial intelligence on labor. Proc. Natl. Acad. Sci. U. S. A., pp. 201900949 (en). Cited by: §3.2.
- The future of employment: how susceptible are jobs to computerisation?. Technol. Forecast. Soc. Change 114, pp. 254–280. Cited by: §3.2.
- Skills shortages an examination of the supply and demand for skills, and the links between skills shortages and the labour market and earnings. Labour Market Trends 110 (1), pp. 17–28. Cited by: §2.3.
- Skill requirements in big data: a content analysis of job advertisements. Journal of Computer Information Systems 58 (4), pp. 374–384. Cited by: §1.
- Do skill shortages reduce productivity? theory and evidence from the united kingdom. The Economic Journal 103 (417), pp. 386–394. Cited by: §1.
- Skill shortages, productivity growth and wage inflation. Acquiring Skills: Market Failures: Their Symptoms and Policy Responses, pp. 147–174. Cited by: §2.3.
- Adjusting to skill shortages in australian smes. Applied Economics 47 (24), pp. 2470–2487. Cited by: §1, §2.3.
- The fourth paradigm: Data-Intensive scientific discovery. 1 edition edition, Microsoft Research (en). Cited by: §2.2.
- XG-SF: an XGBoost classifier based on shapelet features for time series classification. Procedia Comput. Sci. 147, pp. 24–28. Cited by: §3.3.
- Was there a skills shortage in australia?. Cited by: §1.
- Changes in the wage structure and earnings inequality. In Handbook of labor economics, Vol. 3, pp. 1463–1555. Cited by: §2.2.
- LinkedIn workforce report — united states. Technical report LinkedIn. Cited by: §2.2.
- Solving the talent shortage: 2018 talent shortage survey. Technical report ManpowerGroup. Cited by: §2.2.
- Big data: the next frontier for innovation, competition, and productivity. Technical report McKinsey Global Institute. Cited by: §2.2.
- The quant crunch: how the demand for data science skills is disrupting the job market. Technical report Burning Glass Technologies. Cited by: §2.2.
- How useful is the concept of skills mismatch?. Cited by: §2.1.
-  O*NET OnLine. Cited by: §2.1, §3.2.
- Microeconomics: an intuitive approach with calculus. Nelson Education. Cited by: §1.
- Human capital, investment and innovation: what are the connections?. Technical report Cited by: §2.3.
-  Long-term care workforce: caring for the ageing population with dignity. Note: https://www.oecd.org/health/health-systems/long-term-care-workforce.htmAccessed: 2020-2-8 Cited by: §2.2.
- World indicators of skills for employment (WISE) database. Cited by: §2.1.
- Getting skills right: skills for jobs indicators. External Links: Cited by: §2.1.
- OECD skills strategy 2019 - skills to shape a better future. Technical report OECD. Cited by: §1.
- Getting skills right: future-ready adult learning systems. External Links: Cited by: §1.
Where are we now? a large benchmark study of recent symbolic regression methods.
Proceedings of the Genetic and Evolutionary Computation Conference, GECCO’18, New York, NY, USA, pp. 1183–1190. External Links: Cited by: §3.3.
- Linear, machine learning and probabilistic approaches for time series analysis. In 2016 IEEE First International Conference on Data Stream Mining Processing (DSMP), pp. 377–381. Cited by: §3.3.
- Telstra seeks to boost tech skills pipeline. Note: https://www.computerworld.com/article/3462420/telstra-seeks-to-boost-tech-skills-pipeline.htmlAccessed: 2020-2-6 Cited by: §1.
- Right for the job. Cited by: §2.1, §2.3.
- Product market competition, skill shortages and productivity: evidence from canadian manufacturing firms. Journal of Productivity Analysis 23 (3), pp. 317–339. Cited by: §2.3.
- Ageing and health. Note: https://www.who.int/news-room/fact-sheets/detail/ageing-and-healthAccessed: 2020-2-8 Cited by: §2.2.
- A CEEMDAN and XGBOOST-Based approach to forecast crude oil prices. Complexity 2019 (en). Cited by: §3.3.