Advanced data visualizations using R
 

SIMP59: Data Selection and Visualisation VT25
7.5 credits

nils.holmberg@iko.lu.se

Course literature

Wickham, Çetinkaya-Rundel, and Grolemund (2023)

Wilke (2019)

💻

Watt and Naidoo (2025)

Lecture overview


  • Multidimensional (lecture)
  • Geospatial and networks
  • Uncertainty
  • Multi-panel, cowplot
  • Data and context
  • Don’t go 3D
  • Data storytelling
  • Interactive plots (lab)
  • Plotly
  • Shiny data app
  • Self-publishing to web
  • Github codespaces
  • Flowcharts (mermaid)
  • Observable (quarto)

Data analysis framework

A diagram displaying the data science cycle: Import -> Tidy -> Understand  (which has the phases Transform -> Visualize -> Model in a cycle) -> Communicate. Surrounding all of these is Program Import, Tidy, Transform, and Visualize is highlighted.

Figure 1: In this section of the book, you’ll learn how to import, tidy, transform, and visualize data.

Multivariate analysis

  • Use mtcars to create a summary table showing the mean and standard error of miles per gallon (mpg) for different cylinder counts (cyl) and transmission types (am) in the mtcars dataset using dplyr.
cyl am n mean_mpg sem_mpg
4 Automatic 3 22.90000 0.8386497
4 Manual 8 28.07500 1.5852839
6 Automatic 4 19.12500 0.8158584
6 Manual 3 20.56667 0.4333333
8 Automatic 12 15.05000 0.8008991
8 Manual 2 15.40000 0.4000000

#| echo: true
#| output: false

# Load required library
library(dplyr)

# Use mtcars dataset (built-in R dataset)
data(mtcars)

# Create summary table with mean and SEM of mpg grouped by cyl and am
summary_table <- mtcars %>%
  group_by(cyl, am) %>%
  summarise(
    mean_mpg = mean(mpg, na.rm = TRUE),
    sem_mpg = sd(mpg, na.rm = TRUE) / sqrt(n()),
    .groups = "drop"  # Drop grouping structure after summarising
  ) %>%
  mutate(
    am = factor(am, levels = c(0, 1), labels = c("Automatic", "Manual")),
    cyl = factor(cyl)
  )

# Display the summary table
print(summary_table)

Aesthetic: Mapping

Figure 2.5: Fuel efficiency versus displacement, for 32 cars (1973–74 models). This figure uses five separate scales to represent data: (i) the x axis (displacement); (ii) the y axis (fuel efficiency); (iii) the color of the data points (power); (iv) the size of the data points (weight); and (v) the shape of the data points (number of cylinders). Four of the five variables displayed (displacement, fuel efficiency, power, and weight) are numerical continuous. The remaining one (number of cylinders) can be considered to be either numerical discrete or qualitative ordered. Data source: Motor Trend, 1974.

Amounts: Grouped And Stacked Bars

Figure 6.7: 2016 median U.S. annual household income versus age group and race. Age groups are shown along the x axis, and for each age group there are four bars, corresponding to the median income of Asian, white, Hispanic, and black people, respectively. Data source: United States Census Bureau

Amounts: Grouped And Stacked Bars

Figure 6.8: 2016 median U.S. annual household income versus age group and race. In contrast to Figure 6.7 , now race is shown along the x axis, and for each race we show seven bars according to the seven age groups. Data source: United States Census Bureau

Amounts: Bar Plots (code)

boxoffice %>%
  ggplot(aes(x = fct_reorder(title_short, rank), y = amount)) +
    geom_col(fill = "#56B4E9", width = 0.6, alpha = 0.9) +
    scale_y_continuous(expand = c(0, 0),
                       breaks = c(0, 2e7, 4e7, 6e7),
                       labels = c("0", "20", "40", "60"),
                       name = "weekend gross (million USD)") +
    scale_x_discrete(name = NULL,
                     expand = c(0, 0.4)) +
    coord_cartesian(clip = "off") +
    theme_dviz_hgrid(12, rel_small = 1) +
    theme(
      #axis.ticks.length = grid::unit(0, "pt"),
      axis.line.x = element_blank(),
      axis.ticks.x = element_blank()
    )

Amounts: Dot Plots And Heatmaps

Figure 6.14: Internet adoption over time, for select countries. Color represents the percent of internet users for the respective country and year. Countries were ordered by percent internet users in 2016. Data source: World Bank

Histograms: Density plots

Figure 7.8: Density estimates of the ages of male and female Titanic passengers. To highlight that there were more male than female passengers, the density curves were scaled such that the area under each curve corresponds to the total number of male and female passengers with known age (468 and 288, respectively).

Visualizing Associations

Figure 12.2: Head length versus body mass for 123 blue jays. The birds’ sex is indicated by color. At the same body mass, male birds tend to have longer heads (and specifically, longer bills) than female birds. Data source: Keith Tarvin, Oberlin College

Visualizing Associations

Figure 12.3: Head length versus body mass for 123 blue jays. The birds’ sex is indicated by color, and the birds’ skull size by symbol size. Head-length measurements include the length of the bill while skull-size measurements do not. Head length and skull size tend to be correlated, but there are some birds with unusually long or short bills given their skull size. Data source: Keith Tarvin, Oberlin College

Time Series

Figure 13.7: Monthly submissions to three preprint servers covering biomedical research. By direct labeling the lines instead of providing a legend, we have reduced the cognitive load required to read the figure. And the elimination of the legend removes the need for points of different shapes. Thus, we could streamline the figure further by eliminating the dots. Data source: Jordan Anaya, http://www.prepubmed.org/

Geospatial Data

Figure 4.4: Median annual income in Texas counties. The highest median incomes are seen in major Texas metropolitan areas, in particular near Houston and Dallas. No median income estimate is available for Loving County in West Texas and therefore that county is shown in gray. Data source: 2015 Five-Year American Community Survey. Code

Geospatial Data

Figure 4.6: Percentage of people identifying as white in Texas counties. Whites are in the majority in North and East Texas but not in South or West Texas. Data source: 2010 Decennial U.S. Census. Code

Geospatial Data

Figure 15.11: Population density in every U.S. county, shown as a choropleth map. Population density is reported as persons per square kilometer. Data source: 2015 Five-Year American Community Survey. Code

Geospatial and Networks

#| echo: true
#| output: false

library(igraph)

# Create a random graph
g <- sample_gnm(10, 20)

# Detect communities using the Louvain method
comm <- cluster_louvain(g)
plot(comm, g)

Visualizing Uncertainty

Figure 16.5: Relationship between sample, sample mean, standard deviation, standard error, and confidence intervals, in an example of chocolate bar ratings. The observations (shown as jittered green dots) that make up the sample represent expert ratings of 125 chocolate bars from manufacturers in Canada, rated on a scale from 1 (unpleasant) to 5 (elite). The large orange dot represents the mean of the ratings. Error bars indicate, from top to bottom, twice the standard deviation, twice the standard error (standard deviation of the mean), and 80%, 95%, and 99% confidence intervals of the mean. Data source: Brady Brelinski, Manhattan Chocolate Society. Code

Visualizing Uncertainty

Figure 16.6: Confidence intervals widen with smaller sample size. Chocolate bars from Canada and Switzerland have comparable mean ratings and comparable standard deviations (indicated with simple black error bars). However, over three times as many Canadian bars were rated as Swiss bars, and therefore the confidence intervals (indicated with error bars of different colors and thickness drawn on top of one another) are substantially wider for the mean of the Swiss ratings than for the mean of the Canadian ratings. Data source: Brady Brelinski, Manhattan Chocolate Society. Code

Visualizing Uncertainty

Figure 16.7: Mean chocolate flavor ratings and associated confidence intervals for chocolate bars from manufacturers in six different countries. Data source: Brady Brelinski, Manhattan Chocolate Society. Code

Visualizing Uncertainty

Figure 16.8: Mean chocolate flavor ratings for manufacturers from five different countries, relative to the mean rating of U.S. chocolate bars. Canadian chocolate bars are significantly higher rated that U.S. bars. For the other four countries there is no significant difference in mean rating to the U.S. at the 95% confidence level. Confidence levels have been adjusted for multiple comparisons using Dunnett’s method. Data source: Brady Brelinski, Manhattan Chocolate Society. Code

Visualizing Uncertainty

Figure 16.9: Mean chocolate flavor ratings for manufacturers from four different countries, relative to the mean rating of U.S. chocolate bars. Each panel uses a different approach to visualizing the same uncertainty information. (a) Graded error bars with cap. (b) Graded error bars without cap. (c) Single-interval error bars with cap. (d) Single-interval error bars without cap. (e) Confidence strips. (f) Confidence distributions. Code

Visualizing Uncertainty

Figure 16.10: Mean butterfat contents in the milk of four cattle breeds. Error bars indicate +/- one standard error of the mean. Visualizations of this type are frequently seen in the scientific literature. While they are technically correct, they represent neither the variation within each category nor the uncertainty of the sample means particularly well. See Figure 7.11 for the variation in butterfat contents within individual breeds. Data Source: Canadian Record of Performance for Purebred Dairy Cattle. Code

Visualizing Uncertainty

Figure 16.15: Head length versus body mass for male blue jays, as in Figure 14.7. The straight blue line represents the best linear fit to the data, and the gray band around the line shows the uncertainty in the linear fit. The gray band represents a 95% confidence level. Data source: Keith Tarvin, Oberlin College. Code

Visualizing Uncertainty

Figure 16.17: Head length versus body mass for male blue jays. As in the case of error bars, we can draw graded confidence bands to highlight the uncertainty in the estimate. Data source: Keith Tarvin, Oberlin College. Code

Multi Panel figures, cowplots

Figure 21.1: Breakdown of passengers on the Titanic by gender, survival, and class in which they traveled (1st, 2nd, or 3rd). Code

Multi Panel figures, cowplots

Figure 21.3: Trends in Bachelor’s degrees conferred by U.S. institutions of higher learning. Shown are all degree areas that represent, on average, more than 4% of all degrees. This figure is labeled as “bad” because all panels use different y-axis ranges. This choice obscures the relative sizes of the different degree areas and it over-exagerates the changes that have happened in some of the degree areas. Data Source: National Center for Education Statistics. Code

Multi Panel figures, cowplots

Figure 21.4: Trends in Bachelor’s degrees conferred by U.S. institutions of higher learning. Shown are all degree areas that represent, on average, more than 4% of all degrees. Data Source: National Center for Education Statistics. Code

Multi Panel figures, cowplots

Figure 21.5: Trends in Bachelor’s Degrees conferred by U.S. institutions of higher learning. (a) From 1970 to 2015, the total number of degrees nearly doubled. (b) Among the most popular degree areas, social sciences, history, and education experienced a major decline, while business and health professions grew. Data Source: National Center for Education Statistics. Code

Multi Panel figures, cowplots

Figure 21.6: Variation of Figure 21.5 with poor labeling. The labels are too large and thick, they are in the wrong font, and they are placed in an awkward location. Also, while labeling with capital letters is fine and is in fact quite common, labeling needs to be consistent across all figures in a document. In this book, the convention is that multi-panel figures use lower lower-case labels, and thus this figure is inconsistent with the other figures in this book. Code

Multi Panel figures, cowplots

Figure 21.7: Physiology and body-composition of male and female athletes. (a) The data set encompasses 73 female and 85 male professional athletes. (b) Male athletes tend to have higher red blood cell (RBC, reported in units of (10^{12}) per liter) counts than female athletes, but there are no such differences for white blood cell counts (WBC, reported in units of (10^{9}) per liter). (c) Male athletes tend to have a lower body fat percentage than female athletes performing in the same sport. Data source: Telford and Cunningham (1991). Code

Multi Panel figures, cowplots

Figure 21.8: Physiology and body-composition of male and female athletes. This figure shows the exact same data as Figure 21.7, but now using a consistent visual language. Data for female athletes is always shown to the left of the corresponding data for male athletes, and genders are consistently color-coded throughout all elements of the figure. Data source: Telford and Cunningham (1991). Code

Multi Panel figures, cowplots

Figure 21.9: Variation of Figure 21.8 where all figure panels are slightly misaligned. Misalignments are ugly and should be avoided. Code

Data and context

Figure 23.1: Percent body fat versus height in professional male Australian athletes. Each point represents one athlete. This figure devotes way too much ink to non-data. There are unnecessary frames around the entire figure, around the plot panel, and around the legend. The coordinate grid is very prominent, and its presence draws attention away from the data points. Data source: Telford and Cunningham (1991). Code

Data and context

Figure 23.2: Percent body fat versus height in professional male Australian athletes. This figure is a cleaned-up version of Figure 23.1. Unnecessary frames have been removed, minor grid lines have been removed, and major grid lines have been drawn in light gray to stand back relative to the data points. Data source: Telford and Cunningham (1991). Code

Data and context

Figure 23.3: Percent body fat versus height in professional male Australian athletes. In this example, the concept of removing non-data ink has been taken too far. The axis tick labels and title are too faint and are barely visible. The data points seem to float in space. The points in the legend are not sufficiently set off from the data points, and the casual observer might think they are part of the data. Data source: Telford and Cunningham (1991)

Data and context

Figure 23.4: Percent body fat versus height in professional male Australian athletes. This figure adds a frame around the plot panel of Figure 23.2, and this frame helps separate the legend from the data. Data source: Telford and Cunningham (1991)

Data and context

Figure 23.5: Survival of passengers on the Titanic, broken down by gender and class. This small-multiples plot is too minimalistic. The individual factes are not framed, so it’s difficult to see which part of the figure belongs to which facet. Further, the individual bars are not anchored to a clear baseline, and they seem to float.

Data and context

Figure 23.6: Survival of passengers on the Titanic, broken down by gender and class. This is an improved version of Figure 23.5. The gray background in each facet clearly delineates the six groupings (survived or died in first, second, or third class) that make up this plot. Thin horizontal lines in the background provide a reference for the bar heights and facility comparison of bar heights among facets.

Data and context

Figure 23.7: Stock price over time for four major tech companies. The stock price for each company has been normalized to equal 100 in June 2012. This figure mimics the ggplot2 default look, with white major and minor grid lines on a gray background. In this particular example, I think the grid lines overpower the data lines, and the result is a figure that is not well balanced and that doesn’t place sufficient emphasis on the data. Data source: Yahoo Finance

Data and context

Figure 23.8: Indexed stock price over time for four major tech companies. In this variant of Figure 23.7, the data lines are not sufficiently anchored. This makes it difficult to ascertain to what extent they have deviated from the index value of 100 at the end of the covered time interval. Data source: Yahoo Finance

Data and context

Figure 23.9: Indexed stock price over time for four major tech companies. Adding a thin horizontal line at the index value of 100 to Figure 23.8 helps provide an important reference throughout the entire time period the plot spans. Data source: Yahoo Finance

Data and context

Figure 23.10: Indexed stock price over time for four major tech companies. Adding thin horizontal lines at all major y axis ticks provides a better set of reference points than just the one horizontal line of Figure 23.9. This design also removes the need for prominent x and y axis lines, since the evenly spaced horizontal lines create a visual frame for the plot panel. Data source: Yahoo Finance

Don’t go 3D

Figure 26.1: The same 3D pie chart shown from four different angles. Rotating a pie into the third dimension makes pie slices in the front appear larger than they really are and pie slices in the back appear smaller. Here, in parts (a), (b), and (c), the blue slice corresponding to 25% of the data visually occupies more than 25% of the area representing the pie. Only part (d) is an accurate representation of the data. Code

Don’t go 3D

Figure 26.2: Numbers of female and male passengers on the Titanic traveling in 1st, 2nd, and 3rd class, shown as a 3D stacked bar plot. The total numbers of passengers in 1st, 2nd, and 3rd class are 322, 279, and 711, respectively (see Figure 6.10). Yet in this plot, the 1st class bar appears to represent fewer than 300 passengers, the 3rd class bar appears to represent fewer than 700 passengers, and the 2nd class bar seems to be closer to 210–220 passengers than the actual 279 passengers. Furthermore, the 3rd class bar visually dominates the figure and makes the number of passengers in 3rd class appear larger than it actually is. Code

Don’t go 3D

Figure 26.3: Fuel efficiency versus displacement and power for 32 cars (1973–74 models). Each dot represents one car, and the dot color represents the number of cylinders of the car. The four panels (a)–(d) show exactly the same data but use different perspectives. Data source: Motor Trend, 1974. Code

Don’t go 3D

Figure 26.4: Fuel efficiency versus displacement and power for 32 cars (1973–74 models). The four panels (a)–(d) correspond to the same panels in Figure 26.3, only that all grid lines providing depth cues have been removed. Data source: Motor Trend, 1974. Code

Don’t go 3D

Figure 26.5: Fuel efficiency versus displacement (a) and power (b). Data source: Motor Trend, 1974.

Don’t go 3D

Figure 26.6: Power versus displacement for 32 cars, with fuel efficiency represented by dot size. Data source: Motor Trend, 1974.

Don’t go 3D

Figure 26.7: Mortality rates in Virginia in 1940, visualized as a 3D bar plot. Mortality rates are shown for four groups of people (urban and rural females and males) and five age categories (50–54, 55–59, 60–64, 65–69, 70–74), and they are reported in units of deaths per 1000 persons. This figure is labeled as “bad” because the 3D perspective makes the plot difficult to read. Data source: Molyneaux, Gilliam, and Florant (1947)

Don’t go 3D

Figure 26.8: Mortality rates in Virginia in 1940, visualized as a Trellis plot. Mortality rates are shown for four groups of people (urban and rural females and males) and five age categories (50–54, 55–59, 60–64, 65–69, 70–74), and they are reported in units of deaths per 1000 persons. Data source: Molyneaux, Gilliam, and Florant (1947)

Data Storytelling

Figure 29.1: Growth in monthly submissions to the quantitative biology (q-bio) section of the preprint server arXiv.org. A sharp transition in the rate of growth can be seen around 2014. While growth was rapid up to 2014, almost no growth occurred from 2014 to 2018. Note that the y axis is logarithmic, so a linear increase in y corresponds to exponential growth in preprint submissions. Data source: Jordan Anaya, http://www.prepubmed.org/. Code

Data Storytelling

Figure 29.2: The leveling off of submission growth to q-bio coincided with the introduction of the bioRxiv server. Shown are the growth in monthly submissions to the q-bio section of the general-purpose preprint server arxiv.org and to the dedicated biology preprint server bioRxiv. The bioRxiv server went live in November 2013, and its submission rate has grown exponentially since. It seems likely that many scientists who otherwise would have submitted preprints to q-bio chose to submit to bioRxiv instead. Data source: Jordan Anaya, http://www.prepubmed.org/. Code

Data Storytelling

Figure 29.3: Mean arrival delay versus distance from New York City. Each point represents one destination, and the size of each point represents the number of flights from one of the three major New York City airports (Newark, JFK, or LaGuardia) to that destination in 2013. Negative delays imply that the flight arrived early. Solid lines represent the mean trends between arrival delay and distance. Delta has consistently lower arrival delays than other airlines, regardless of distance traveled. American has among the lowest delays, on average, for short distances, but has among the highest delays for longer distances traveled. This figure is labeled as “bad” because it is overly complex. Most readers will find it confusing and will not intuitively grasp what it is the figure is showing. Data source: U.S. Dept. of Transportation, Bureau of Transportation Statistics. Code

Data Storytelling

Figure 29.4: Mean arrival delay for flights out of the New York City area in 2013, by airline. American and Delta have the lowest mean arrival delays of all airlines flying out of the New York City area. Data source: U.S. Dept. of Transportation, Bureau of Transportation Statistics.

Data Storytelling

Figure 29.5: Number of flights out of the New York City area in 2013, by airline. Delta and American are fourth and fifths largest carrier by flights out of the New York City area. Data source: U.S. Dept. of Transportation, Bureau of Transportation Statistics.

Data Storytelling

Figure 29.6: United Airlines departures out of Newark Airport (EWR) in 2013, by weekday. Most weekdays show approximately the same number of departures, but there are fewer departures on weekends. Data source: U.S. Dept. of Transportation, Bureau of Transportation Statistics.

Data Storytelling

Figure 29.7: Departures out of airports in the New York city area in 2013, broken down by airline, airport, and weekday. United Airlines and ExpressJet make up most of the departures out of Newark Airport (EWR), JetBlue, Delta, American, and Endeavor make up most of the departures out of JFK, and Delta, American, Envoy, and US Airways make up most of the departures out of LaGuardia (LGA). Most but not all airlines have fewer departures on weekends than during the work week. Data source: U.S. Dept. of Transportation, Bureau of Transportation Statistics.

Lecture overview


  • Multidimensional (lecture)
  • Geospatial and networks
  • Uncertainty
  • Multi-panel, cowplot
  • Data and context
  • Don’t go 3D
  • Data storytelling
  • Interactive plots (lab)
  • Plotly
  • Shiny data app
  • Self-publishing to web
  • Github codespaces
  • Flowcharts (mermaid)
  • Observable (quarto)

Course literature

Sievert (2020)

Wilke (2019)

Wickham, Çetinkaya-Rundel, and Grolemund (2023)

Some use cases

  1. Exploratory Data Analysis (EDA)
    • Advantage: Interactive plots allow users to zoom, filter, and hover over data points, making it easier to detect outliers, trends, and anomalies.
  2. Time Series Analysis
    • Advantage: Users can dynamically adjust time ranges, pan through historical data, and highlight trends, improving insights into seasonality and forecasting.
  3. Geospatial Visualization
    • Advantage: Interactive maps (e.g., with leaflet or ggplot2 + plotly) enable users to explore geographic patterns, drill down into regions, and overlay multiple data layers.

Some use cases

  1. Network and Relationship Analysis
    • Advantage: Graph visualizations (e.g., with ggraph or visNetwork) allow users to interactively explore relationships, zoom into clusters, and track influence across networks.
  2. Dashboard and Reporting
    • Advantage: Interactive dashboards (e.g., using shiny or flexdashboard) allow decision-makers to dynamically filter data, generate custom reports, and make data-driven decisions in real-time.

Plotly: Origins and Philosophy

  • Plotly started in 2012, founded by Alex Johnson, Jack Parmer, Chris Parmer, and Matthew Sundquist to democratize interactive graphing.
  • Its R package, launched in 2015, builds on the Plotly.js library using WebGL for high-performance visuals.
  • Plotly’s philosophy centers on interactivity, favoring dynamic exploration over static charts.
  • It promotes intuitive tools like zooming and hovering to make data accessible and engaging.
  • The focus on browser-based rendering supports seamless sharing across platforms.

Plotly, Code Sample

#| echo: true
#| output: false

# Load required packages
library(plotly)
library(ggplot2)

# Load the mtcars dataset (built-in R dataset)
data(mtcars)

# --- Example 1: Standalone Plotly Plot ---
# Create an interactive scatter plot with Plotly
plotly_plot <- plot_ly(
  data = mtcars,
  x = ~wt,              # Weight (x-axis)
  y = ~mpg,             # Miles per gallon (y-axis)
  color = ~factor(cyl), # Color by number of cylinders
  size = ~hp,           # Size points by horsepower
  type = "scatter",
  mode = "markers",
  text = ~paste("HP:", hp, "<br>Cyl:", cyl), # Hover text
  hoverinfo = "text"
) %>%
  layout(
    title = "Standalone Plotly: Weight vs MPG",
    xaxis = list(title = "Weight (1000 lbs)"),
    yaxis = list(title = "Miles per Gallon"),
    legend = list(title = list(text = "Cylinders"))
  )

# Display the standalone Plotly plot
plotly_plot

Plotly: Extending ggplot2

  • Plotly extends ggplot2 with ggplotly(), adding interactivity to static plots effortlessly.
  • It enhances ggplot2’s 2D visuals with 3D plots, animations, and hover details natively.
  • As an alternative, Plotly offers standalone functions, skipping ggplot2’s grammar for direct plotting.
  • It outperforms ggplot2 for large datasets and web deployment with JavaScript-driven rendering.
  • Plotly complements ggplot2’s print-ready design with a focus on browser-compatible, interactive outputs.

Plotly, Code Sample

#| echo: true
#| output: false

# Load required packages
library(plotly)
library(ggplot2)

# Load the mtcars dataset (built-in R dataset)
data(mtcars)

# --- Example 2: ggplot2 with ggplotly ---
# Create a ggplot2 scatter plot
ggplot_plot <- ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl), size = hp)) +
  geom_point() +
  labs(
    title = "ggplot2: Weight vs MPG",
    x = "Weight (1000 lbs)",
    y = "Miles per Gallon",
    color = "Cylinders",
    size = "Horsepower"
  ) +
  theme_minimal()

# Convert the ggplot2 plot to an interactive Plotly plot
ggplotly_plot <- ggplotly(ggplot_plot)

# Display the ggplotly plot
ggplotly_plot

# Optional: Save both plots as HTML files for sharing
#htmlwidgets::saveWidget(plotly_plot, "plotly_standalone.html")
#htmlwidgets::saveWidget(ggplotly_plot, "ggplotly_converted.html")

Plotly: Practical Applications in R

  • Plotly integrates with Shiny to build interactive dashboards for user-driven data exploration.
  • It animates time-series data, revealing trends with dynamic trace updates.
  • Researchers use Plotly’s 3D scatter plots to visualize spatial data in R effectively.
  • It creates network graphs with hover tooltips, perfect for social or biological networks.
  • Plotly’s HTML export makes advanced visualizations shareable online with ease.

Plotly: Basic Building Blocks

  • Plotly’s core is the plot_ly() function, initializing a plot with data and type specifications.
  • Traces define the plot type (e.g., scatter, bar, surface), added via add_trace() or plot_ly().
  • Layouts customize axes, titles, and legends using the layout() function for fine-tuning.
  • Data frames or vectors supply the input, mapped to aesthetics like x, y, or color.
  • Attributes like mode (e.g., “lines”, “markers”) and marker control visual styling within traces.

Plotly: Key Functions in R

  • plot_ly() starts a plot, combining data, type (e.g., “scatter”), and mappings in one call.
  • add_trace() layers additional data series, such as lines or points, onto an existing plot.
  • layout() adjusts non-data elements, like titles or axis ranges, for presentation polish.
  • ggplotly() converts ggplot2 objects into interactive Plotly plots with minimal effort.
  • config() tweaks interactivity options, like hiding the toolbar or enabling downloads.

Shiny: Origins and Philosophy

  • Shiny was created in 2012 by Joe Cheng at RStudio to simplify interactive web app development in R.
  • Its R package debuted as an open-source tool, making R’s analytics accessible via browsers.
  • Shiny’s philosophy emphasizes reactivity, linking user inputs to real-time outputs seamlessly.
  • It aims to empower R users to build web apps without needing HTML, CSS, or JavaScript expertise.
  • The focus on rapid prototyping supports data scientists in sharing insights interactively.

Shiny, Code Sample

Shiny: Extending R Base Graphics

  • Shiny extends R’s base graphics by embedding plots in dynamic, user-controlled web interfaces.
  • It adds interactivity to static outputs like plot(), enabling live data exploration.
  • As an alternative, Shiny bypasses base graphics’ static nature with reactive, web-ready visuals.
  • It outperforms base graphics for real-time applications, leveraging browser rendering over R’s defaults.
  • Shiny complements R’s plotting ecosystem by integrating with tools like Plotly or ggplot2.

Shiny: Practical Applications in R

  • Shiny builds dashboards for business analytics, letting users filter and visualize data on the fly.
  • It powers educational tools, offering interactive simulations for teaching statistics or modeling.
  • Researchers use Shiny to share experiment results with customizable, web-based interfaces.
  • It creates data exploration apps, allowing non-R users to interact with complex datasets.
  • Shiny’s deployment options make it ideal for hosting visualizations on the web or locally.

Shiny: Basic Building Blocks

  • Shiny’s core is the ui object, defining the app’s layout and input/output elements.
  • The server function handles logic, linking inputs to reactive outputs dynamically.
  • Inputs like sliderInput() or selectInput() capture user interactions for real-time updates.
  • Outputs such as plotOutput() or tableOutput() display results driven by server calculations.
  • Reactivity ties it all together, automatically updating outputs when inputs change.

Shiny: Key Functions in R

  • shinyApp() launches an app by combining the ui and server components into one call.
  • renderPlot() generates dynamic plots in the server, responding to user inputs.
  • reactive() creates reactive expressions, caching computations for efficient updates.
  • observe() triggers side effects, like printing messages, based on input changes.
  • runApp() starts a Shiny app from a directory or file, streamlining development and testing.

Shiny, Code Sample

Shiny, Code Sample

#| echo: true
#| output: false

# app.R
library(shiny)
library(threejs)
library(igraph)
library(htmlwidgets)

# Load the Zachary Karate Club network
zachary <- make_graph("Zachary")

# Extract nodes and edges for initial setup
nodes <- data.frame(id = 1:vcount(zachary), label = paste("Node", 1:vcount(zachary)))
edges <- as.data.frame(as_edgelist(zachary))
colnames(edges) <- c("from", "to")

# Define the Shiny UI
ui <- fluidPage(
  titlePanel("Interactive 3D Zachary Network"),
  sidebarLayout(
    sidebarPanel(
      h4("Select Nodes"),
      checkboxGroupInput(
        inputId = "selected_nodes",
        label = "Choose nodes to display:",
        choices = nodes$id,
        selected = nodes$id  # All nodes selected by default
      )
    ),
    mainPanel(
      htmlOutput("network", width = "100%", height = "600px")
    )
  )
)

# Define the Shiny server
server <- function(input, output, session) {
  # Reactive graph based on selected nodes
  filtered_graph <- reactive({
    # Get selected nodes
    selected <- as.numeric(input$selected_nodes)
    
    # If no nodes are selected, return an empty graph
    if (length(selected) == 0) {
      return(make_empty_graph())
    }
    
    # Filter edges to include only those between selected nodes
    filtered_edges <- edges[edges$from %in% selected & edges$to %in% selected, ]
    
    # Create a new igraph object with selected nodes and filtered edges
    g <- graph_from_edgelist(as.matrix(filtered_edges), directed = FALSE)
    
    # Add isolated nodes (selected nodes that have no edges in the filtered set)
    isolated_nodes <- setdiff(selected, unique(c(filtered_edges$from, filtered_edges$to)))
    if (length(isolated_nodes) > 0) {
      g <- add_vertices(g, length(isolated_nodes), name = isolated_nodes)
    }
    
    return(g)
  })
  
  # Render the 3D network graph
  output$network <- renderUI({
    g <- filtered_graph()
    
    # If the graph is empty, return a blank HTML message
    if (vcount(g) == 0) {
      return(HTML("<p>No nodes selected.</p>"))
    }
    
    # Generate 3D layout
    coords <- layout_with_fr(g, dim = 3)
    V(g)$x <- coords[, 1]
    V(g)$y <- coords[, 2]
    V(g)$z <- coords[, 3]
    
    # Create the graphjs widget
    graph_widget <- graphjs(
      g,
      vertex.size = 0.5,
      vertex.label = paste("Node", V(g)$name),
      edge.width = 1,
      edge.color = "gray",
      vertex.color = "blue"
    )
    
    # Return the widget as HTML
    graph_widget
  })
}

# Run the Shiny app
shinyApp(ui, server)

Shiny, Code Sample

#| echo: true
#| output: false

# app.R
library(shiny)
library(visNetwork)
library(igraph)

# Load the Zachary Karate Club network
zachary <- make_graph("Zachary")

# Extract nodes and edges
nodes <- data.frame(id = 1:vcount(zachary), label = paste("Node", 1:vcount(zachary)))
edges <- as.data.frame(as_edgelist(zachary))
colnames(edges) <- c("from", "to")

# Define UI
ui <- fluidPage(
  titlePanel("Interactive Zachary Network"),
  sidebarLayout(
    sidebarPanel(
      h4("Select Nodes"),
      checkboxGroupInput(
        inputId = "selected_nodes",
        label = "Choose nodes to display:",
        choices = nodes$id,
        selected = nodes$id
      )
    ),
    mainPanel(visNetworkOutput("network"))
  )
)

# Define Server
server <- function(input, output, session) {
  filtered_graph <- reactive({
    selected <- as.numeric(input$selected_nodes)
    
    # Return empty graph early if no nodes selected
    if (length(selected) == 0) {
      return(make_empty_graph(n = 0))
    }
    
    filtered_edges <- edges[edges$from %in% selected & edges$to %in% selected, ]
    
    # Create graph based on whether there are edges
    if (nrow(filtered_edges) > 0) {
      g <- graph_from_edgelist(as.matrix(filtered_edges), directed = FALSE)
    } else {
      g <- make_empty_graph(n = 0)  # Start with truly empty graph
    }
    
    # Add isolated nodes with their original IDs
    isolated_nodes <- setdiff(selected, unique(c(filtered_edges$from, filtered_edges$to)))
    if (length(isolated_nodes) > 0) {
      g <- add_vertices(g, length(isolated_nodes), name = as.character(isolated_nodes))
    }
    
    # Set vertex names based on actual vertices present
    if (vcount(g) > 0) {
      # Use the vertex IDs that are actually in the graph
      current_vertices <- unique(as.numeric(c(filtered_edges$from, filtered_edges$to, isolated_nodes)))
      V(g)$name <- as.character(current_vertices[1:vcount(g)])
    }
    
    return(g)
  })
  
  output$network <- renderVisNetwork({
    g <- filtered_graph()
    
    # Handle empty graph case
    if (vcount(g) == 0) {
      return(visNetwork(nodes = data.frame(id = integer(0), label = character(0)), 
                        edges = data.frame(from = integer(0), to = integer(0))))
    }
    
    # Create node data frame with unique IDs
    node_data <- data.frame(
      id = as.numeric(V(g)$name),  # Ensure IDs are numeric and unique
      label = paste("Node", V(g)$name)
    )
    # Remove any duplicate IDs (safeguard)
    node_data <- node_data[!duplicated(node_data$id), ]
    
    # Create edge data frame, handling case with no edges
    edge_data <- if (ecount(g) > 0) {
      edge_matrix <- ends(g, E(g))
      data.frame(
        from = as.numeric(edge_matrix[,1]),
        to = as.numeric(edge_matrix[,2])
      )
    } else {
      data.frame(from = integer(0), to = integer(0))
    }
    
    visNetwork(nodes = node_data, edges = edge_data) %>%
      visEdges(arrows = "to") %>%
      visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE)
  })
}

# Run the app
shinyApp(ui, server)

Next steps

Computer lab 4

References

Sievert, Carson. 2020. Interactive Web-Based Data Visualization with r, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com/.
Watt, H., and T. Naidoo. 2025. “Data Wrangling Recipes in r.” https://bookdown.org/hcwatt99/Data_Wrangling_Recipes_in_R/#why-data-wrangling-recipes-in-r.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. 2nd ed. "O’Reilly Media, Inc.". https://r4ds.hadley.nz/.
Wilke, Claus O. 2019. Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. O’Reilly Media. https://clauswilke.com/dataviz/.