Basic data visualizations using R
 

SIMP59: Data Selection and Visualisation VT25
7.5 credits

nils.holmberg@iko.lu.se

Course literature

Wickham, Çetinkaya-Rundel, and Grolemund (2023)

Wilke (2019)

💻

Watt and Naidoo (2025)

Lecture overview


  • Aesthetics
  • Coordinates
  • Colors
  • Amounts
  • Distribution, histograms
  • Distribution, ecdf-qq
  • Distribution, boxplots
  • Proportions
  • Associations
  • Time-series
  • Trends
  • Geospatial
  • Uncertainty
  • Advanced data visualizations

Data analysis framework

A diagram displaying the data science cycle: Import -> Tidy -> Understand  (which has the phases Transform -> Visualize -> Model in a cycle) -> Communicate. Surrounding all of these is Program Import, Tidy, Transform, and Visualize is highlighted.

Figure 1: In this section of the book, you’ll learn how to import, tidy, transform, and visualize data.

Open science communication

  • Replicability: R enables reproducible research with shareable scripts.
  • Transparency: Open-source code makes analyses fully visible.
  • Collaboration: Supports teamwork via GitHub and R Markdown.
  • Open Source: Free and community-driven for open science. 🚀

Notebook formats

  • R Markdown (.Rmd)
    • RStudio
    • VS Code
  • Jupyter notebooks (.ipynb)
    • Python
    • Google Colab
  • Quarto (.qmd)
    • Multilingual

Variable descriptives

  • Measures of central tendency and dispersion, mpg and hp
  • Univariate analysis, summary table of descriptive stats


Variable Count Mean SD SEM
mpg 32 20.09062 6.026948 1.065424
hp 32 146.68750 68.562868 12.120317

Research questions

  • How does penguin weight vary by species?
  • Expectation (hypothesis) about relationships
  • Measurement variable (weight) by explanatory variable (species)
  • Bivariate analysis, summary table with descriptive stats


species Count NA_Body_Mass Mean_Body_Mass SD_Body_Mass SEM_Body_Mass
Adelie 152 1 3700.662 458.5661 37.19462
Chinstrap 68 0 3733.088 384.3351 46.60747
Gentoo 124 1 5076.016 504.1162 45.27097

Aesthetic: Mapping

Figure 2.1: Commonly used aesthetics in data visualization: position, shape, size, color, line width, line type. Some of these aesthetics can represent both continuous and discrete data (position, size, line width, color) while others can usually only represent discrete data (shape, line type).

Aesthetic: Mapping

Figure 2.2: Scales link data values to aesthetics. Here, the numbers 1 through 4 have been mapped onto a position scale, a shape scale, and a color scale. For each scale, each number corresponds to a unique position, shape, or color and vice versa.

Aesthetic: Mapping

Figure 2.5: Fuel efficiency versus displacement, for 32 cars (1973–74 models). This figure uses five separate scales to represent data: (i) the x axis (displacement); (ii) the y axis (fuel efficiency); (iii) the color of the data points (power); (iv) the size of the data points (weight); and (v) the shape of the data points (number of cylinders). Four of the five variables displayed (displacement, fuel efficiency, power, and weight) are numerical continuous. The remaining one (number of cylinders) can be considered to be either numerical discrete or qualitative ordered. Data source: Motor Trend, 1974.

Color: Basics

Figure 4.1: Example qualitative color scales. The Okabe Ito scale is the default scale used throughout this book (Okabe and Ito 2008). The ColorBrewer Dark2 scale is provided by the ColorBrewer project (Brewer 2017). The ggplot2 hue scale is the default qualitative scale in the widely used plotting software ggplot2.

Color: Basics

Figure 4.2: Population growth in the U.S. from 2000 to 2010. States in the West and South have seen the largest increases, whereas states in the Midwest and Northeast have seen much smaller increases or even, in the case of Michigan, a decrease. Data source: U.S. Census Bureau

Color: Basics

Figure 4.3: Example sequential color scales. The ColorBrewer Blues scale is a monochromatic scale that varies from dark to light blue. The Heat and Viridis scales are multi-hue scales that vary from dark red to light yellow and from dark blue via green to light yellow, respectively.

Color: Basics

Figure 4.4: Median annual income in Texas counties. The highest median incomes are seen in major Texas metropolitan areas, in particular near Houston and Dallas. No median income estimate is available for Loving County in West Texas and therefore that county is shown in gray. Data source: 2015 Five-Year American Community Survey

Color: Basics

Figure 4.5: Example diverging color scales. Diverging scales can be thought of as two sequential scales stiched together at a common midpoint color. Common color choices for diverging scales include brown to greenish blue, pink to yellow-green, and blue to red.

Color: Basics

Figure 4.6: Percentage of people identifying as white in Texas counties. Whites are in the majority in North and East Texas but not in South or West Texas. Data source: 2010 Decennial U.S. Census

Coordinate: Systems axes

Figure 3.1: Standard cartesian coordinate system. The horizontal axis is conventionally called x and the vertical axis y. The two axes form a grid with equidistant spacing. Here, both the x and the y grid lines are separated by units of one. The point (2, 1) is located two x units to the right and one y unit above the origin (0, 0). The point (-1, -1) is located one x unit to the left and one y unit below the origin.

Coordinate: Systems axes

Figure 3.2: Daily temperature normals for Houston, TX. Temperature is mapped to the y axis and day of the year to the x axis. Parts (a), (b), and (c) show the same figure in different aspect ratios. All three parts are valid visualizations of the temperature data. Data source: NOAA.

Coordinate: Systems axes

Figure 3.4: Relationship between linear and logarithmic scales. The dots correspond to data values 1, 3.16, 10, 31.6, 100, which are evenly-spaced numbers on a logarithmic scale. We can display these data points on a linear scale, we can log-transform them and then show on a linear scale, or we can show them on a logarithmic scale. Importantly, the correct axis title for a logarithmic scale is the name of the variable shown, not the logarithm of that variable.

Coordinate: Systems axes

Figure 3.5: Population numbers of Texas counties relative to their median value. Select counties are highlighted by name. The dashed line indicates a ratio of 1, corresponding to a county with median population number. The most populous counties have approximately 100 times more inhabitants than the median county, and the least populous counties have approximately 100 times fewer inhabitants than the median county. Data source: 2010 Decennial U.S. Census.

Coordinate: Systems axes

Figure 3.6: Population sizes of Texas counties relative to their median value. By displaying a ratio on a linear scale, we have overemphasized ratios > 1 and have obscured ratios < 1. As a general rule, ratios should not be displayed on a linear scale. Data source: 2010 Decennial U.S. Census.

Coordinate: Systems axes

Figure 3.7: Relationship between linear and square-root scales. The dots correspond to data values 0, 1, 4, 9, 16, 25, 36, 49, which are evenly-spaced numbers on a square-root scale, since they are the squares of the integers from 0 to 7. We can display these data points on a linear scale, we can square-root-transform them and then show on a linear scale, or we can show them on a square-root scale.

Coordinate: Systems axes

Figure 3.8: Areas of Northeastern U.S. states. (a) Areas shown on a linear scale. (b) Areas shown on a square-root scale. Data source: Google.

Coordinate: Systems axes

Figure 3.9: Relationship between Cartesian and polar coordinates. (a) Three data points shown in a Cartesian coordinate system. (b) The same three data points shown in a polar coordinate system. We have taken the x coordinates from part (a) and used them as angular coordinates and the y coordinates from part (a) and used them as radial coordinates. The circular axis runs from 0 to 4 in this example, and therefore x = 0 and x = 4 are the same locations in this coordinate system.

Amounts: Bar Plots

Figure 6.1: Highest grossing movies for the weekend of December 22-24, 2017, displayed as a bar plot. Data source: Box Office Mojo ( http://www.boxofficemojo.com/ ). Used with permission

Amounts: Bar Plots (code)

boxoffice %>%
  ggplot(aes(x = fct_reorder(title_short, rank), y = amount)) +
    geom_col(fill = "#56B4E9", width = 0.6, alpha = 0.9) +
    scale_y_continuous(expand = c(0, 0),
                       breaks = c(0, 2e7, 4e7, 6e7),
                       labels = c("0", "20", "40", "60"),
                       name = "weekend gross (million USD)") +
    scale_x_discrete(name = NULL,
                     expand = c(0, 0.4)) +
    coord_cartesian(clip = "off") +
    theme_dviz_hgrid(12, rel_small = 1) +
    theme(
      #axis.ticks.length = grid::unit(0, "pt"),
      axis.line.x = element_blank(),
      axis.ticks.x = element_blank()
    )

Amounts: Bar Plots

Figure 6.2: Highest grossing movies for the weekend of December 22-24, 2017, displayed as a bar plot with rotated axis tick labels. Rotated axis tick labels tend to be difficult to read and require awkward space use undearneath the plot. For these reasons, I generally consider plots with rotated tick labels to be ugly. Data source: Box Office Mojo ( http://www.boxofficemojo.com/ ). Used with permission

Amounts: Bar Plots

Figure 6.3: Highest grossing movies for the weekend of December 22-24, 2017, displayed as a horizontal bar plot. Data source: Box Office Mojo ( http://www.boxofficemojo.com/ ). Used with permission

Amounts: Bar Plots

Figure 6.4: Highest grossing movies for the weekend of December 22-24, 2017, displayed as a horizontal bar plot. Here, the bars have been placed in descending order of the lengths of the movie titles. This arrangement of bars is arbitrary, it doesn’t serve a meaningful purpose, and it makes the resulting figure much less intuitive than Figure 6.3 . Data source: Box Office Mojo ( http://www.boxofficemojo.com/ ). Used with permission

Amounts: Bar Plots

Figure 6.5: 2016 median U.S. annual household income versus age group. The 45–54 year age group has the highest median income. Data source: United States Census Bureau

Amounts: Bar Plots

Figure 6.6: 2016 median U.S. annual household income versus age group, sorted by income. While this order of bars looks visually appealing, the order of the age groups is now confusing. Data source: United States Census Bureau

Amounts: Dot Plots And Heatmaps

Figure 6.11: Life expectancies of countries in the Americas, for the year 2007. Data source: Gapminder project

Amounts: Dot Plots And Heatmaps

Figure 6.12: Life expectancies of countries in the Americas, for the year 2007, shown as bars. This dataset is not suitable for being visualized with bars. The bars are too long and they draw attention away from the key feature of the data, the differences in life expectancy among the different countries. Data source: Gapminder project

Amounts: Dot Plots And Heatmaps

Figure 6.13: Life expectancies of countries in the Americas, for the year 2007. Here, the countries are ordered alphabetically, which causes a dots to form a disordered cloud of points. This makes the figure difficult to read, and therefore it deserves to be labeled as “bad.” Data source: Gapminder project

Amounts: Dot Plots And Heatmaps

Figure 6.14: Internet adoption over time, for select countries. Color represents the percent of internet users for the respective country and year. Countries were ordered by percent internet users in 2016. Data source: World Bank

Amounts: Grouped And Stacked Bars

Figure 6.7: 2016 median U.S. annual household income versus age group and race. Age groups are shown along the x axis, and for each age group there are four bars, corresponding to the median income of Asian, white, Hispanic, and black people, respectively. Data source: United States Census Bureau

Amounts: Grouped And Stacked Bars

Figure 6.8: 2016 median U.S. annual household income versus age group and race. In contrast to Figure 6.7 , now race is shown along the x axis, and for each race we show seven bars according to the seven age groups. Data source: United States Census Bureau

Amounts: Grouped And Stacked Bars

Figure 6.9: 2016 median U.S. annual household income versus age group and race. Instead of displaying this data as a grouped bar plot, as in Figures 6.7 and 6.8 , we now show the data as four separate regular bar plots. This choice has the advantage that we don’t need to encode either categorical variable by bar color. Data source: United States Census Bureau

Amounts: Grouped And Stacked Bars

Figure 6.10: Numbers of female and male passengers on the Titanic traveling in 1st, 2nd, and 3rd class.

Histograms: Density plots

Figure 7.1: Histogram of the ages of Titanic passengers.

Histograms: Density plots

Figure 7.2: Histograms depend on the chosen bin width. Here, the same age distribution of Titanic passengers is shown with four different bin widths: (a) one year; (b) three years; (c) five years; (d) fifteen years.

Histograms: Density plots

Figure 7.3: Kernel density estimate of the age distribution of passengers on the Titanic. The height of the curve is scaled such that the area under the curve equals one. The density estimate was performed with a Gaussian kernel and a bandwidth of 2.

Histograms: Density plots

Figure 7.4: Kernel density estimates depend on the chosen kernel and bandwidth. Here, the same age distribution of Titanic passengers is shown for four different combinations of these parameters: (a) Gaussian kernel, bandwidth = 0.5; (b) Gaussian kernel, bandwidth = 2; (c) Gaussian kernel, bandwidth = 5; (d) Rectangular kernel, bandwidth = 2.

Histograms: Density plots

Figure 7.5: Kernel density estimates can extend the tails of the distribution into areas where no data exist and no data are even possible. Here, the density estimate has been allowed to extend into the negative age range. This is clearly nonsensical and should be avoided.

Histograms: Density plots

Figure 7.6: Histogram of the ages of Titanic passengers stratified by gender. This figure has been labeled as “bad” because stacked histograms are easily confused with overlapping histograms (see also Figure 7.7). In addition, the heights of the bars representing female passengers cannot easily be compared to each other.

Histograms: Density plots

Figure 7.7: Age distributions of male and female Titanic passengers, shown as two overlapping histograms. This figure has been labeled as “bad” because there is no clear visual indication that all blue bars start at a count of 0.

Histograms: Density plots

Figure 7.8: Density estimates of the ages of male and female Titanic passengers. To highlight that there were more male than female passengers, the density curves were scaled such that the area under each curve corresponds to the total number of male and female passengers with known age (468 and 288, respectively).

Boxplots: Violins

Figure 9.1: Mean daily temperatures in Lincoln, Nebraska in 2016. Points represent the average daily mean temperatures for each month, averaged over all days of the month, and error bars represent twice the standard deviation of the daily mean temperatures within each month. This figure has been labeled as “bad” because because error bars are conventionally used to visualize the uncertainty of an estimate, not the variability in a population. Data source: Weather Underground

Boxplots: Violins

Figure 9.2: Anatomy of a boxplot. Shown are a cloud of points (left) and the corresponding boxplot (right). Only the y values of the points are visualized in the boxplot. The line in the middle of the boxplot represents the median, and the box encloses the middle 50% of the data. The top and bottom whiskers extend either to the maximum and minimum of the data or to the maximum or minimum that falls within 1.5 times the height of the box, whichever yields the shorter whisker. The distances of 1.5 times the height of the box in either direction are called the upper and the lower fences. Individual data points that fall beyond the fences are referred to as outliers and are usually showns as individual dots.

Boxplots: Violins

Figure 9.3: Mean daily temperatures in Lincoln, Nebraska, visualized as boxplots.

Boxplots: Violins

Figure 9.4: Anatomy of a violin plot. Shown are a cloud of points (left) and the corresponding violin plot (right). Only the y values of the points are visualized in the violin plot. The width of the violin at a given y value represents the point density at that y value. Technically, a violin plot is a density estimate rotated by 90 degrees and then mirrored. Violins are therefore symmetric. Violins begin and end at the minimum and maximum data values, respectively. The thickest part of the violin corresponds to the highest point density in the dataset.

Boxplots: Violins

Figure 9.5: Mean daily temperatures in Lincoln, Nebraska, visualized as violin plots.

Boxplots: Violins

Figure 9.6: Mean daily temperatures in Lincoln, Nebraska, visualized as strip chart. Each point represents the mean temperature for one day. This figure is labeled as “bad” because so many points are plotted on top of each other that it is not possible to ascertain which temperatures were the most common in each month.

Boxplots: Violins

Figure 9.7: Mean daily temperatures in Lincoln, Nebraska, visualized as strip chart. The points have been jittered along the x axis to better show the density of points at each temperature value.

Boxplots: Violins

Figure 9.12: Voting patterns in the U.S. House of Representatives have become increasingly polarized. DW-NOMINATE scores are frequently used to compare voting patterns of representatives between parties and over time. Here, score distributions are shown for each Congress from 1963 to 2013 separately for Democrats and Republicans. Each Congress is represented by its first year. Original figure concept: McDonald (2017).

Lecture overview


  • Aesthetics
  • Coordinates
  • Colors
  • Amounts
  • Distribution, histograms
  • Distribution, ecdf-qq
  • Distribution, boxplots
  • Proportions
  • Associations
  • Time-series
  • Trends
  • Geospatial
  • Uncertainty
  • Advanced data visualizations

Visualizing: Proportions

Figure 10.1: Party composition of the 8th German Bundestag, 1976–1980, visualized as a pie chart. This visualization shows clearly that the ruling coalition of SPD and FDP had a small majority over the opposition CDU/CSU.

Visualizing: Proportions

Figure 10.2: Party composition of the 8th German Bundestag, 1976–1980, visualized as stacked bars. (a) Bars stacked vertically. (b) Bars stacked horizontally. It is not immediately obvious that SPD and FDP jointly had more seats than CDU/CSU.

Visualizing: Proportions

Figure 10.3: Party composition of the 8th German Bundestag, 1976–1980, visualized as side-by-side bars. As in Figure 10.2, it is not immediately obvious that SPD and FDP jointly had more seats than CDU/CSU.

Visualizing: Proportions

Figure 10.4: Market share of five hypothetical companies, A–E, for the years 2015–2017, visualized as pie charts. This visualization has two major problems: 1. A comparison of relative market share within years is nearly impossible. 2. Changes in market share across years are difficult to see.

Visualizing: Proportions

Figure 10.5: Market share of five hypothetical companies for the years 2015–2017, visualized as stacked bars. This visualization has two major problems: 1. A comparison of relative market shares within years is difficult. 2. Changes in market share across years are difficult to see for the middle companies B, C, and D, because the location of the bars changes across years.

Visualizing: Proportions

Figure 10.6: Market share of five hypothetical companies for the years 2015–2017, visualized as side-by-side bars.

Visualizing: Associations

Figure 12.1: Head length (measured from the tip of the bill to the back of the head, in mm) versus body mass (in gram), for 123 blue jays. Each dot corresponds to one bird. There is a moderate tendency for heavier birds to have longer heads. Data source: Keith Tarvin, Oberlin College

Visualizing: Associations

Figure 12.2: Head length versus body mass for 123 blue jays. The birds’ sex is indicated by color. At the same body mass, male birds tend to have longer heads (and specifically, longer bills) than female birds. Data source: Keith Tarvin, Oberlin College

Visualizing: Associations

Figure 12.3: Head length versus body mass for 123 blue jays. The birds’ sex is indicated by color, and the birds’ skull size by symbol size. Head-length measurements include the length of the bill while skull-size measurements do not. Head length and skull size tend to be correlated, but there are some birds with unusually long or short bills given their skull size. Data source: Keith Tarvin, Oberlin College

Visualizing: Associations

Figure 12.4: All-against-all scatter plot matrix of head length, body mass, and skull size, for 123 blue jays. This figure shows the exact same data as Figure 12.2. However, because we are better at judging position than symbol size, correlations between skull size and the other two variables are easier to perceive in the pairwise scatter plots than in Figure 12.2. Data source: Keith Tarvin, Oberlin College

Visualizing: Associations

Figure 12.5: Examples of correlations of different magnitude and direction, with associated correlation coefficient r. In both rows, from left to right correlations go from weak to strong. In the top row the correlations are positive (larger values for one quantity are associated with larger values for the other) and in the bottom row they are negative (larger values for one quantity are associated with smaller values for the other). In all six panels, the sets of x and y values are identical, but the pairings between individual x and y values have been reshuffled to generate the specified correlation coefficients.

Visualizing: Associations

Figure 12.6: Correlations in mineral content for 214 samples of glass fragments obtained during forensic work. The dataset contains seven variables measuring the amounts of magnesium (Mg), calcium (Ca), iron (Fe), potassium (K), sodium (Na), aluminum (Al), and barium (Ba) found in each glass fragment. The colored tiles represents the correlations between pairs of these variables. Data source: B. German

Time: Series

Figure 13.1: Monthly submissions to the preprint server bioRxiv, from its inception in November 2014 until April 2018. Each dot represents the number of submissions in one month. There has been a steady increase in submission volume throughout the entire 4.5-year period. Data source: Jordan Anaya, http://www.prepubmed.org/

Time: Series

Figure 13.2: Monthly submissions to the preprint server bioRxiv, shown as dots connected by lines. The lines do not represent data but are only meant as a guide to the eye. By connecting the individual dots with lines, we emphasize that there is an order between the dots, each dot has exactly one neighbor that comes before and one that comes after. Data source: Jordan Anaya, http://www.prepubmed.org/

Time: Series

Figure 13.3: Monthly submissions to the preprint server bioRxiv, shown as a line graph without dots. Omitting the dots emphasizes the overall temporal trend while de-emphasizing individual observations at specific time points. It is particularly useful when the time points are spaced very densely. Data source: Jordan Anaya, http://www.prepubmed.org/

Time: Series

Figure 13.4: Monthly submissions to the preprint server bioRxiv, shown as a line graph with filled area underneath. By filling the area under the curve, we put even more emphasis on the overarching temporal trend than if we just draw a line (Figure 13.3). Data source: Jordan Anaya, http://www.prepubmed.org/

Time: Series

Figure 13.5: Monthly submissions to three preprint servers covering biomedical research: bioRxiv, the q-bio section of arXiv, and PeerJ Preprints. Each dot represents the number of submissions in one month to the respective preprint server. This figure is labeled “bad” because the three time courses visually interfere with each other and are difficult to read. Data source: Jordan Anaya, http://www.prepubmed.org/

Time: Series

Figure 13.6: Monthly submissions to three preprint servers covering biomedical research. By connecting the dots in Figure 13.5 with lines, we help the viewer follow each individual time course. Data source: Jordan Anaya, http://www.prepubmed.org/

Time: Series

Figure 13.7: Monthly submissions to three preprint servers covering biomedical research. By direct labeling the lines instead of providing a legend, we have reduced the cognitive load required to read the figure. And the elimination of the legend removes the need for points of different shapes. Thus, we could streamline the figure further by eliminating the dots. Data source: Jordan Anaya, http://www.prepubmed.org/

Geospatial: Data

Figure 15.1: Orthographic projection of the world, showing Europe and Northern Africa as they would be visible from space. The lines emanating from the north pole and runing south are called meridians, and the lines running orthogonal to the meridians are called parallels. All meridians have the same length but parallels become shorter the closer we are to either pole.

Geospatial: Data

Figure 15.2: Mercator projection of the world. In this projection, parallels are straight horizontal lines and meridians are straight vertical lines. It is a conformal projection preserving local angles, but it introduces severe distortions in areas near the poles. For example, Greenland appears to be bigger than Africa in this projection, when in reality Africa is fourteen times bigger than Greenland (see Figures 15.1 and 15.3).

Geospatial: Data

Figure 15.3: Interrupted Goode homolosine projection of the world. This projection accurately preserves areas while minimizing angular distortions, at the cost of showing oceans and some land masses (Greenland, Antarctica) in a non-contiguous way.

Geospatial: Data

Figure 15.4: Relative locations of Alaska, Hawaii, and the lower 48 states shown on a globe.

Geospatial: Data

Figure 15.5: Map of the United States of America, using an area-preserving Albers projection (ESRI:102003, commonly used to project the lower 48 states). Alaska and Hawaii are shown in their true locations.

Geospatial: Data

Figure 15.6: Visualization of the United States, with the states of Alaska and Hawaii moved to lie underneath the lower 48 states. Alaska also has been scaled so its linear extent is only 35% of the state’s true size. (In other words, the state’s area has been reduced to approximately 12% of its true size.) Such a scaling is frequently applied to Alaska, to make it visually appear to be of similar size as typical midwestern or western states. However, the scaling is highly misleading, and therefore the figure has been labeled as “bad”.

Geospatial: Data

Figure 15.7: Visualization of the United States, with the states of Alaska and Hawaii moved to lie underneath the lower 48 states.

Geospatial: Data

Figure 15.8: Wind turbines in the San Francisco Bay Area. Individual wind turbines are shown as purple-colored dots. Two regions with a high concentration of wind turbines are highlighted with black rectangles. I refer to the wind turbines near Rio Vista collectively as the Shiloh Wind Farm. Map tiles by Stamen Design, under CC BY 3.0. Map data by OpenStreetMap, under ODbL. Wind turbine data: United States Wind Turbine Database

Geospatial: Data

Figure 15.11: Population density in every U.S. county, shown as a choropleth map. Population density is reported as persons per square kilometer. Data source: 2015 Five-Year American Community Survey

Visualizing: Uncertainty

Figure 16.5: Relationship between sample, sample mean, standard deviation, standard error, and confidence intervals, in an example of chocolate bar ratings. The observations (shown as jittered green dots) that make up the sample represent expert ratings of 125 chocolate bars from manufacturers in Canada, rated on a scale from 1 (unpleasant) to 5 (elite). The large orange dot represents the mean of the ratings. Error bars indicate, from top to bottom, twice the standard deviation, twice the standard error (standard deviation of the mean), and 80%, 95%, and 99% confidence intervals of the mean. Data source: Brady Brelinski, Manhattan Chocolate Society

Visualizing: Uncertainty

Figure 16.6: Confidence intervals widen with smaller sample size. Chocolate bars from Canada and Switzerland have comparable mean ratings and comparable standard deviations (indicated with simple black error bars). However, over three times as many Canadian bars were rated as Swiss bars, and therefore the confidence intervals (indicated with error bars of different colors and thickness drawn on top of one another) are substantially wider for the mean of the Swiss ratings than for the mean of the Canadian ratings. Data source: Brady Brelinski, Manhattan Chocolate Society

Visualizing: Uncertainty

Figure 16.7: Mean chocolate flavor ratings and associated confidence intervals for chocolate bars from manufacturers in six different countries. Data source: Brady Brelinski, Manhattan Chocolate Society

Visualizing: Uncertainty

Figure 16.8: Mean chocolate flavor ratings for manufacturers from five different countries, relative to the mean rating of U.S. chocolate bars. Canadian chocolate bars are significantly higher rated that U.S. bars. For the other four countries there is no significant difference in mean rating to the U.S. at the 95% confidence level. Confidence levels have been adjusted for multiple comparisons using Dunnett’s method. Data source: Brady Brelinski, Manhattan Chocolate Society

Visualizing: Uncertainty

Figure 16.9: Mean chocolate flavor ratings for manufacturers from four different countries, relative to the mean rating of U.S. chocolate bars. Each panel uses a different approach to visualizing the same uncertainty information. (a) Graded error bars with cap. (b) Graded error bars without cap. (c) Single-interval error bars with cap. (d) Single-interval error bars without cap. (e) Confidence strips. (f) Confidence distributions.

Visualizing: Uncertainty

Figure 16.10: Mean butterfat contents in the milk of four cattle breeds. Error bars indicate +/- one standard error of the mean. Visualizations of this type are frequently seen in the scientific literature. While they are technically correct, they represent neither the variation within each category nor the uncertainty of the sample means particularly well. See Figure 7.11 for the variation in butterfat contents within individual breeds. Data Source: Canadian Record of Performance for Purebred Dairy Cattle

Visualizing: Uncertainty

Figure 16.15: Head length versus body mass for male blue jays, as in Figure 14.7. The straight blue line represents the best linear fit to the data, and the gray band around the line shows the uncertainty in the linear fit. The gray band represents a 95% confidence level. Data source: Keith Tarvin, Oberlin College

Visualizing: Uncertainty

Figure 16.16: Head length versus body mass for male blue jays. In contrast to Figure 16.15, the straight blue lines now represent equally likely alternative fits randomly drawn from the posterior distribution. Data source: Keith Tarvin, Oberlin College

Visualizing: Uncertainty

Figure 16.17: Head length versus body mass for male blue jays. As in the case of error bars, we can draw graded confidence bands to highlight the uncertainty in the estimate. Data source: Keith Tarvin, Oberlin College

Next steps

Computer lab 2

References

Watt, H., and T. Naidoo. 2025. “Data Wrangling Recipes in r.” https://bookdown.org/hcwatt99/Data_Wrangling_Recipes_in_R/#why-data-wrangling-recipes-in-r.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. 2nd ed. "O’Reilly Media, Inc.". https://r4ds.hadley.nz/.
Wilke, Claus O. 2019. Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. O’Reilly Media. https://clauswilke.com/dataviz/.