Overview

The sub-module covered in this document focuses on industry skills change from data shared by the LinkedIn-World Bank partnership. Simple data visualisations are publicly available at <https://linkedindata.worldbank.org/data>.

Literature Review

The interactive visualisation in Industry Skills Needs panel allows users to pick an industry group and industry, but only the skills valued in Year 2019 are shown. No option to change the year is provided.

One is hence unable to see the change in skills over the years in an industry. By presenting the changes over the years, the user is able to review trends and gather insights to the industry skills shift and sought-after skills in a particular industry.

Data Preparation and Exploration

We will be creating two visualizations.

Data table to show top 10 skill needs changes in each industry over the years
Bar graph to show the change in distribution of skill group categories in each industry over the years

Data Source

Excel: Industry Skills Needs is downloaded from https://datacatalog.worldbank.org/dataset/skills-linkedin-data. We open public_use-industry-skills-needs.xlsx in Microsoft Excel and save the tab: Industry Skills Needs as a csv file titled industry_skills_needs.csv (to be imported in R Studio later).

Installing and launching R Packages

Below is a list of of packages and their purpose. We will install and launch all these packages via a loop in R.

readr: to load data from csv file to R
dplyr: data manipulation select()
tidyr: pivot_wider()
DT: explore data presented in tabular form (1)
formattable: explore data presented in tabular form (2)
reactable: explore data presented in tabular form (3)
crosstalk: allow reactable() to be filtered by user input from filter_select()
forcats: randomly permute factor levels to categorical data i.e. skill_group_name
randomcoloR: color code skill_group_name, which are text values, in the table
tidyverse: ggplot2
formatR: wrap r code chunks

packages = c("readr", "dplyr", "tidyr", "DT", "formattable", 
    "reactable", "crosstalk", "randomcoloR", "forcats", 
    "tidyverse", "formatR")

for (p in packages) {
    if (!require(p, character.only = T)) {
        install.packages(p)
    }
    library(p, character.only = T)
}

Loading the data with read_csv

We use read_csv to load the csv file.

industryskillsneeds <- read_csv("assignment/data/industry_skills_needs.csv")

Then we use str() to view the structure of the data. Since we want see the shift in industry skill needs over the years, we have to pivot our data to have year as columns and skill_group_name as values.

str(industryskillsneeds)

spec_tbl_df [3,500 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ year                : num [1:3500] 2015 2015 2015 2015 2015 ...
 $ isic_section_index  : chr [1:3500] "B" "B" "B" "B" ...
 $ isic_section_name   : chr [1:3500] "Mining and quarrying" "Mining and quarrying" "Mining and quarrying" "Mining and quarrying" ...
 $ industry_name       : chr [1:3500] "Mining & Metals" "Mining & Metals" "Mining & Metals" "Mining & Metals" ...
 $ skill_group_category: chr [1:3500] "Specialized Industry Skills" "Soft Skills" "Business Skills" "Business Skills" ...
 $ skill_group_name    : chr [1:3500] "Mining" "Negotiation" "Project Management" "Business Management" ...
 $ skill_group_rank    : num [1:3500] 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "spec")=
  .. cols(
  ..   year = col_double(),
  ..   isic_section_index = col_character(),
  ..   isic_section_name = col_character(),
  ..   industry_name = col_character(),
  ..   skill_group_category = col_character(),
  ..   skill_group_name = col_character(),
  ..   skill_group_rank = col_double()
  .. )

Pivoting the data

To pivot our data to show the change in top 10 skills by industry by year, we wrangle the data: industryskillsneeds as follows.

1. select(): select only columns industry_name, year, skill_group_name, skill_group_rank

2. pivot_wider(): pivot the data with year in columns and skill_group_name as values

3. select(): select all data frame columns to allow sorting

4. arrange(): sort the data by industry name, then by skill_group_rank

We use str() to view the structure of the data.

pt_by_year <- industryskillsneeds %>% select(industry_name, 
    year, skill_group_name, skill_group_rank) %>% pivot_wider(names_from = year, 
    values_from = skill_group_name) %>% select(1:7) %>% 
    arrange(industry_name, skill_group_rank)

str(pt_by_year)

tibble [700 × 7] (S3: tbl_df/tbl/data.frame)
 $ industry_name   : chr [1:700] "Accounting" "Accounting" "Accounting" "Accounting" ...
 $ skill_group_rank: num [1:700] 1 2 3 4 5 6 7 8 9 10 ...
 $ 2015            : chr [1:700] "Auditing" "Financial Accounting" "Tax Accounting" "Bookkeeping" ...
 $ 2016            : chr [1:700] "Financial Accounting" "Auditing" "Tax Accounting" "Bookkeeping" ...
 $ 2017            : chr [1:700] "Financial Accounting" "Auditing" "Tax Accounting" "Teamwork" ...
 $ 2018            : chr [1:700] "Financial Accounting" "Auditing" "Tax Accounting" "Teamwork" ...
 $ 2019            : chr [1:700] "Financial Accounting" "Tax Accounting" "Auditing" "Bookkeeping" ...

Exploring various packages to show skill needs changes

To create interactivity in tables, we explored various data table packages and their functions. Below is a summary of our observations. Of all the different packages, we decided on reactable to incorporate with crosstalk for our interactive table.

DT package

By default, there is only a global search function in the top right hand corner; we add column filters by introducing the filter argument, and set page length to 10 observations.

However the table width exceeds the width of the post.

(The results of the code below are hidden intentionally.)

datatable(pt_by_year, rownames = FALSE, filter = list(position = "top"), 
    options = list(pageLength = 10))

formattable package

Formattable presents all 700 observations immediately and is hence not suitable to show as a datatable on a dashboard.

(The results of the code below are hidden intentionally.)

formattable(pt_by_year, align = c("c", "c", "c", "c", 
    "c", "c"))

reactable package

Table width is adjusted to the width of the post with horizontal scroll bar. Column filters are introduced and we can possible restrict the filters to only the column: industry_name. However, the text input is free text while we would prefer to include a dropdown for selection of industry_name.

reactable(pt_by_year, style = list(fontSize = "12px"), 
    filterable = TRUE, )

Of all three packages, we decide to use reactable. The width of the table auto-fits the post. We will explore further customization for reactable in the following section.

Creating data table to show industry skills shift

Introducing crosstalk’s filtering API filter_select() to reactable

Only the filter for Industry is relevant. Instead of entering free text in the search bar for industry name, we use crosstalk’s filtering API to insert a searchbar with dropdown. Industry names are populated in the dropdown for the user to select.
Header row is styled with grey background for differentiation.
Column names for industry_name and skill_rank are renamed to Industry Name and Skill Rank respectively.
To reflect the movement in skill ranks over the years, we introduce colors to the skills in the table using two packages: forcats and randomcoloR.
- fct_shuffle() in forcats package is called to randomly permute factor levels to each skill_group_name (categorical)
- distinctColorPalette in randomcoloR package is called to set a color palette that has the same number of colors as the number of distinct skill_group_name
Since the industry_name that are equal to the skill_group_name have their text colored, we override the color for industry_name by introducing a style argument ni the column definition of industry_name.
We indicate minRows=10

We want to show only one industry on one page and there are total of 10 skills ranked from 1 to 10 per industry per year.
Pagination is set to jump

There are many pages for 700 observations and it would be time-consuming for the user to keep clicking the forward arrow to get through the pages. Hence we choose jump to allow user to jump to a particular page by directly entering the number in the box.
Highlight=TRUE is set to highlight the row that the cursor hovers onto.

sd <- SharedData$new(pt_by_year)

skillFactor <- fct_shuffle(as.factor(unique(industryskillsneeds$skill_group_name)))
colors <- distinctColorPalette(length(skillFactor))

table1 <- bscols(widths = c(6, 12), list(filter_select("industry_name", 
    "Industry Name", sd, ~industry_name, multiple = FALSE, 
    )), reactable(sd, defaultColDef = colDef(header = function(value) gsub(".", 
    " ", value, fixed = TRUE), align = "center", minWidth = 70, 
    headerStyle = list(background = "#f7f7f8"), style = function(value) {
        color <- colors[match(value, levels(skillFactor))]
        list(color = color, fontSize = "12px")
    }), highlight = TRUE, minRows = 10, paginationType = "jump", 
    columns = list(industry_name = colDef(name = "Industry Name", 
        style = function(value) {
            list(color = "black", fontSize = "12px")
        }, show = TRUE), skill_group_rank = colDef(name = "Skill Rank", 
        align = "center", maxWidth = 60))))

table1

Industry Name

Exploring various packages to show skill category distribution changes

For this section we will filter the data for industry_name = Accounting as example data for the various plots. When creating the app in Shiny, we will allow industry name as an input variable for users to select.

Unless otherwise stated, we are using the tidyverse package, in particular ggplot, to create the plots.

acctg <- industryskillsneeds %>% filter(industry_name == 
    "Accounting")

Stacked bar plot

A stacked bar plot is created with the legend positioned at the top. The title of the bar plot is intended to by dynamic as the industry (i.e. Accounting) should change according to user selection.

The stacked bars are great in showcasing the sum of skills (10 skills in total). However, skills groups in the middle of the bar cannot be compared easily across the years because their bottom and top edges of each bar vary across the years.

plot1 <- ggplot(data = acctg, aes(year, fill = skill_group_category)) + 
    geom_bar() + scale_y_continuous(breaks = c(2, 4, 
    6, 8, 10)) + ggtitle("Distribution of top 10 skills in Accounting") + 
    theme_minimal() + theme(legend.position = "top", 
    legend.title = element_blank()) + scale_fill_brewer(palette = "Set2")

plot1

Dodge bar plots

The bar plot below has a slight edge over the stacked bar as viewers can easily see the changes in distribution of skill group categories over the years. Each color represents a skill group cateogry.

In the bar below for the accounting industry, it is apparent that Specialized Industry Skills was not identified from 2015 - 2018 (both years included) and this skill group category was included only in 2019.

plot2 <- ggplot(data = acctg, aes(year, fill = skill_group_category)) + 
    geom_bar(position = "dodge") + scale_y_continuous(breaks = c(2, 
    4, 6, 8, 10)) + ggtitle("Distribution of top 10 skills in Accounting") + 
    theme_minimal() + theme(legend.position = "top", 
    legend.title = element_blank()) + scale_fill_brewer(palette = "Set2")

plot2

The facets below clearly shows each skill group category changes across the years. However, it does have its limitations in showing the distribution within the same year. There are 5 different skill group categories in total and the odd number five makes it impossible to evenly arrange the grids in more than 1 row or more than 1 column.

plot4 <- ggplot(data = acctg, aes(x = year, group = skill_group_category, 
    fill = skill_group_category)) + geom_bar(stat = "count") + 
    scale_fill_brewer(palette = "Set2") + facet_wrap(~skill_group_category)

plot4

Dot Plot

The size of the dots reflect the count of skills in each category. However, as the size differences are not apparent to the human eye, the dot plot below is not an ideal visualization to show distribution of skill group categories over time.

plot3 <- ggplot(data = acctg, aes(x = year, y = skill_group_category)) + 
    geom_count()

plot3

Line plot

In the line chart below where each skill group category is a series, the lines overlap and hence several data points are hidden. Below, Specialized Industry Skills is hidden - it has the value of 1 in 2019.

We decide not to introduce facets for the same reasons discussed previously.

plot4 <- ggplot(data = acctg, aes(x = year, group = skill_group_category, 
    color = skill_group_category)) + geom_line(stat = "count", 
    ) + geom_point(stat = "count") + scale_colour_brewer(palette = "Set2")

plot4

Creating bar plot to show distribution changes in skill group categories

We decide on the dodged bar plot to reflect the changes in skill group categories’ distribution over the years. Below are examples for four industries.

barplot_acctg <- ggplot(data = acctg, aes(year, fill = skill_group_category)) + 
    geom_bar(position = "dodge") + scale_y_continuous(breaks = c(2, 
    4, 6, 8, 10)) + ggtitle("Distribution of top 10 skills in Accounting") + 
    theme_minimal() + theme(legend.position = "top", 
    legend.title = element_blank()) + scale_fill_brewer(palette = "Set2")

barplot_vet <- industryskillsneeds %>% filter(industry_name == 
    "Veterinary") %>% ggplot(aes(year, fill = skill_group_category)) + 
    geom_bar(position = "dodge") + scale_y_continuous(breaks = c(2, 
    4, 6, 8, 10)) + ggtitle("Distribution of top 10 skills in Veterinary") + 
    theme_minimal() + theme(legend.position = "top", 
    legend.title = element_blank()) + scale_fill_brewer(palette = "Set2")

barplot_pp <- industryskillsneeds %>% filter(industry_name == 
    "Paper & Forest Products") %>% ggplot(aes(year, 
    fill = skill_group_category)) + geom_bar(position = "dodge") + 
    scale_y_continuous(breaks = c(2, 4, 6, 8, 10)) + 
    ggtitle("Distribution of top 10 skills in Paper & Forest Products") + 
    theme_minimal() + theme(legend.position = "top", 
    legend.title = element_blank()) + scale_fill_brewer(palette = "Set2")


barplot_banking <- industryskillsneeds %>% filter(industry_name == 
    "Banking") %>% ggplot(aes(year, fill = skill_group_category)) + 
    geom_bar(position = "dodge") + scale_y_continuous(breaks = c(2, 
    4, 6, 8, 10)) + ggtitle("Distribution of top 10 skills in Banking") + 
    theme_minimal() + theme(legend.position = "top", 
    legend.title = element_blank()) + scale_fill_brewer(palette = "Set2")

barplot_acctg
barplot_vet
barplot_pp
barplot_banking

Storyboard

Putting the bar plot and the table side-by-side, the selection of Industry Name is meant to apply to both figures. The table complements the bar plot by showing the top 10 skills in each year.

For our shiny app, when a user hovers over a particular bar in the bar plot, we intend to highlight the corresponding cells (skills) in the table.

The selection of industry is a dropdown filter.

The bar plot is colored according to skill categories.
The skills are values in the data table and their text will be colored accordingly (see Creating data table to show industry skills shift).
For layout interactivity, when the width of the dashboard is reduced, the table will move to the bottom of the bar plot.

On the shiny app, we will have multiple tabs for analyses for migration, industry skills and growth trends. In particular, the change in skills and skill categories in each industry over the years is reflected with the visualization above.

For instance, it is apparent from the bar plot below that Veterinary is a highly specialized industry as specialized industry skills forms majority of the top 10 skills in the industry, with soft skills becoming increasingly desired over the years.

Assignment