Basic data selection and transformations using R
 

SIMP59: Data Selection and Visualisation VT25
7.5 credits

nils.holmberg@iko.lu.se

Presentation

  • Nils Holmberg
  • Computational
    content analysis
  • Cognitive
    communication effects

Course literature

Wickham, Çetinkaya-Rundel, and Grolemund (2023)

Wilke (2019)

💻

Watt and Naidoo (2025)

Lecture overview


  • Data analysis framework
  • Data types
  • Data frames
  • Data structures
  • Selecting data
  • Importing data
  • Subsetting data
  • Research questions
  • Transforming data
  • Adding data
  • Grouping data
  • Reshaping data
  • Exporting data frames
  • Visualizing data

Data analysis framework

A diagram displaying the data science cycle: Import -> Tidy -> Understand  (which has the phases Transform -> Visualize -> Model in a cycle) -> Communicate. Surrounding all of these is Program Import, Tidy, Transform, and Visualize is highlighted.

Figure 1: In this section of the book, you’ll learn how to import, tidy, transform, and visualize data.

Open science communication

  • Replicability: R enables reproducible research with shareable scripts.
  • Transparency: Open-source code makes analyses fully visible.
  • Collaboration: Supports teamwork via GitHub and R Markdown.
  • Open Source: Free and community-driven for open science. 🚀

Notebook formats

  • R Markdown (.Rmd)
    • RStudio
    • VS Code
  • Jupyter notebooks (.ipynb)
    • Python
    • Google Colab
  • Quarto (.qmd)
    • Multilingual

What is data?

  • plural of latin datum
  • piece of information
  • multiple observations
  • qualitative, quantitative

Data types in R

# Numeric (default type for numbers in R)
print(class(3.14))  # "numeric"
[1] "numeric"
# Integer (specified with an "L" suffix)
print(class(42L))  # "integer"
[1] "integer"
# Character (string)
print(class("Hello, R!"))  # "character"
[1] "character"
# Logical (Boolean: TRUE or FALSE)
print(class(TRUE))  # "logical"
[1] "logical"
# Factor (categorical variable)
print(class(factor(c("low", "medium", "high", "medium"))))  # "factor"
[1] "factor"

What is a variable?

# Checking type conversion
print(class(as.character(3.14)))  # "character"
[1] "character"
print(class(as.numeric("3.14")))  # "numeric"
[1] "numeric"
# Vector (homogeneous data type)
print(class(c(1, 2, 3, 4, 5)))  # "numeric"
[1] "numeric"
# List (heterogeneous data types)
print(class(list(3.14, "Hello, R!", TRUE)))  # "list"
[1] "list"
# Data Frame (tabular structure with columns of different types)
print(class(data.frame(ID = 1:3, Name = c("Alice", "Bob", "Charlie"), Score = c(85.5, 90.3, 78.9))))  # "data.frame"
[1] "data.frame"

Variable descriptives

  • Measures of central tendency and dispersion, mpg and hp
  • Univariate analysis, summary table of descriptive stats


Variable Count Mean SD SEM
mpg 32 20.09062 6.026948 1.065424
hp 32 146.68750 68.562868 12.120317

Data frames in R

  • Structure: A data frame is R’s standard tabular data structure, while a tibble is its modern alternative from the tidyverse.
  • Printing: Data frames print the full dataset by default, whereas tibbles display only the first few rows for readability.
  • Column Behavior: Data frames may simplify column types (e.g., converting strings to factors), while tibbles preserve data types.
  • Subsetting: Extracting a single column from a data frame can return a vector, while tibbles always return a tibble (data frame-like).

Data structures

Three panels, each representing a tidy data frame. The first panel shows that each variable is a column. The second panel shows that each observation is a row. The third panel shows that each value is a cell.

Figure 2: The following three rules make a dataset tidy: variables are columns, observations are rows, and values are cells.

Data structures: Tidy data

  • Tidy data follows a structured format with variables in columns.
  • Each row represents an observation.
  • Each column corresponds to a variable.
  • Tidy data makes data analysis easier using tidyverse tools.
  • Functions like pivot_longer() and pivot_wider() help reshape data.
library(tidyr)
tidy_data <- tibble::tibble(
  Name = c("Alice", "Bob"),
  Score = c(85, 90)
)
print(tidy_data)
#| echo: false
#| output: true

# A tibble: 2 × 2
  Name  Score
  <chr> <dbl>
1 Alice    85
2 Bob      90

Data structures: Hierarchical table

  • Hierarchical tables store nested data.
  • Useful for representing complex relationships.
  • Can be created using lists or nested data frames.
  • tidyr::nest() helps structure hierarchical tables.
  • Often used in grouped data analysis.
library(tidyr)
hierarchical_data <- tibble::tibble(
  Group = c("A", "B"),
  Data = list(
    data.frame(Value = 1:3),
    data.frame(Value = 4:6)
  )
)
print(hierarchical_data)
#| echo: false
#| output: true

# A tibble: 2 × 2
  Group Data           
  <chr> <list>         
1 A     <data.frame [3×1]>
2 B     <data.frame [3×1]>

Data structures: JSON, XML

  • JSON and XML store structured data.
  • JSON is commonly used in web applications.
  • XML is hierarchical and used for document storage.
  • R has jsonlite for JSON parsing.
  • XML can be read using xml2 package.
library(jsonlite)
json_data <- '{"Name": "Alice", "Score": 85}'
parsed_json <- fromJSON(json_data)
print(parsed_json)
#| echo: false
#| output: true

  Name Score
1 Alice    85

Data structures: Networks

  • Networks represent relationships between entities.
  • Graphs consist of nodes and edges.
  • R packages like igraph help analyze network data.
  • Useful in social network and graph analysis.
  • Can visualize networks with ggraph.
library(igraph)
g <- graph_from_edgelist(
  matrix(c("A", "B", "B", "C", "C", "A"), 
         ncol = 2, 
         byrow = TRUE)
)
print(g)
#| echo: false
#| output: true

IGRAPH d6b6bcd DN-- 3 3 -- 
+ edges:
[1] A->B B->C C->A

Data structures: Unstructured text

  • Unstructured text data lacks predefined format.
  • Used in NLP applications like sentiment analysis.
  • Common sources include emails, social media, and articles.
  • R provides tm and tidytext for text analysis.
  • Tokenization helps convert text into structured data.
library(tidytext)
text_data <- tibble::tibble(
  ID = 1, 
  Text = "This is an example sentence."
)
tokenized_text <- text_data |> 
  unnest_tokens(word, Text)
print(tokenized_text)
#| echo: false
#| output: true

# A tibble: 5 × 2
     ID word   
  <dbl> <chr>  
1     1 this   
2     1 is     
3     1 an     
4     1 example
5     1 sentence

Selecting data

  • find, generate a relevant dataset with regard to RQ?
  • import dataset to r dataframe
  • subsetting data, e.g. “select from x where y==1 order by z”
  • selecting columns, e.g. select(x, y)

Selecting data: Built-in datasets

  • R includes several built-in datasets for practice.
  • The datasets package provides easy access to these.
  • Use data() to list available datasets.
  • Use head() to preview a dataset.
  • Built-in datasets are useful for learning data manipulation.
data()
head(mtcars)
#| echo: false
#| output: true

                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Selecting data: Finding a real dataset

  • Real-world datasets can be found online in repositories.
  • Websites like Kaggle and UCI Machine Learning Repository offer datasets.
  • R packages like tidyverse and readr help with importing datasets.
  • Government and academic institutions publish open datasets.
# Example: Loading a dataset from an online source
library(readr)
dataset <- read_csv(
  "https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv"
)
head(dataset)
#| echo: false
#| output: true

  Month   X1958 X1959 X1960
1   JAN    340    360    417
2   FEB    318    342    391
3   MAR    362    406    419

Selecting data: Importing data to R

  • Data can be imported from CSV, Excel, or databases.
  • read.csv() is commonly used for CSV files.
  • read_excel() from readxl package handles Excel files.
  • Databases can be accessed using DBI and dplyr.
  • Always check encoding and delimiters when importing.
# Example: Importing a CSV file
#data <- read.csv("data.csv")
data <- read.csv(
  "../../../../dev/quarto/osm-cda/csv/palmerpenguins.tsv"
)
head(data)
#| echo: false
#| output: true

  ID Name   Score
1  1 Alice  85.5
2  2 Bob    90.3
3  3 Charlie 78.9

Selecting data: Subsetting using base R

  • Base R provides indexing for selecting subsets of data.
  • Use [] for selecting rows and columns.
  • Logical conditions can filter data.
  • subset() simplifies selection based on conditions.
  • Useful for quick exploratory data analysis.
# Example: Subsetting a dataset
data <- mtcars[mtcars$mpg > 20, ]
head(data)
#| echo: false
#| output: true

                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Selecting data: Using dplyr select()

  • dplyr provides efficient data manipulation functions.
  • The select() function chooses specific columns.
  • Columns can be chosen by name or pattern.
  • rename() can be used for renaming columns.
  • tidyverse functions are chainable with |>.
# Example: Selecting columns with dplyr
library(dplyr)
data <- mtcars |> 
  select(mpg, cyl, hp)
head(data)
#| echo: false
#| output: true

                    mpg cyl  hp
Mazda RX4         21.0   6 110
Mazda RX4 Wag     21.0   6 110
Datsun 710        22.8   4  93

Research questions

  • How does penguin weight vary by species?
  • Expectation (hypothesis) about relationships
  • Measurement variable (weight) by explanatory variable (species)
  • Bivariate analysis, summary table with descriptive stats


species Count NA_Body_Mass Mean_Body_Mass SD_Body_Mass SEM_Body_Mass
Adelie 152 1 3700.662 458.5661 37.19462
Chinstrap 68 0 3733.088 384.3351 46.60747
Gentoo 124 1 5076.016 504.1162 45.27097

Transforming data

  • e.g. scaling, log trans, centering, normalizing
  • string variables, regular exp
  • imputing, exptrapolating values, handling missing values
  • reshaping data (basic), merging joining relational data (advanced)
  • summarizing, grouping, aggregating etc

Transforming data: Filtering rows

  • Filtering extracts specific rows based on conditions.
  • The filter() function from dplyr simplifies row selection.
  • Logical conditions define which rows to keep.
  • Multiple conditions can be combined using & or |.
  • Useful for subsetting large datasets.
library(dplyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Score = c(85, 90, 75)
)
filtered_data <- data |> 
  filter(Score > 80)
print(filtered_data)
#| echo: false
#| output: true

# A tibble: 2 × 2
  Name  Score
  <chr> <dbl>
1 Alice    85
2 Bob      90

Transforming data: select() and filter()

  • select() extracts specific columns.
  • filter() selects rows based on conditions.
  • Used together to refine data views.
  • Helps in feature selection for modeling.
  • Makes data frames more manageable.
library(dplyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Score = c(85, 90, 75), 
  Age = c(25, 30, 22)
)
selected_filtered_data <- data |> 
  select(Name, Score) |> 
  filter(Score > 80)
print(selected_filtered_data)
#| echo: false
#| output: true

# A tibble: 2 × 2
  Name  Score
  <chr> <dbl>
1 Alice    85
2 Bob      90

Transforming data: dplyr arrange()

  • Sorting arranges data in ascending or descending order.
  • The arrange() function in dplyr performs sorting.
  • Use desc() for descending order.
  • Sorting helps in ranking and comparisons.
  • Can sort by multiple columns.
library(dplyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Score = c(85, 90, 75)
)
sorted_data <- data |> 
  arrange(desc(Score))
print(sorted_data)
#| echo: false
#| output: true

# A tibble: 3 × 2
  Name    Score
  <chr>   <dbl>
1 Bob        90
2 Alice      85
3 Charlie    75

Transforming data: dplyr mutate()

  • mutate() creates new variables.
  • Can modify or transform existing columns.
  • Useful for feature engineering.
  • Supports mathematical operations.
  • Keeps original data intact while adding new fields.
library(dplyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Score = c(85, 90, 75)
)
mutated_data <- data |> 
  mutate(Grade = ifelse(Score > 80, "Pass", "Fail"))
print(mutated_data)
#| echo: false
#| output: true

# A tibble: 3 × 3
  Name    Score Grade
  <chr>   <dbl> <chr>
1 Alice      85 Pass 
2 Bob        90 Pass 
3 Charlie    75 Fail 

Transforming data: dplyr group_by()

  • group_by() groups data by one or more variables.
  • Used with summarize() to compute group statistics.
  • Aggregates data efficiently.
  • Helps in analyzing categorical variables.
  • Useful for reporting and summaries.
library(dplyr)
data <- tibble::tibble(
  Group = c("A", "A", "B", "B"), 
  Score = c(85, 90, 75, 80)
)
grouped_data <- data |> 
  group_by(Group) |> 
  summarize(Average_Score = mean(Score))
print(grouped_data)
#| echo: false
#| output: true

# A tibble: 2 × 2
  Group Average_Score
  <chr>        <dbl>
1 A               87.5
2 B               77.5

Transforming data: Aggregations

  • Aggregation summarizes data using functions like sum(), mean().
  • group_by() in dplyr helps compute grouped summaries.
  • Common aggregate functions include min(), max(), sd().
  • Aggregations are useful for statistical summaries and reporting.
  • Can be applied across multiple columns.
library(dplyr)
data <- tibble::tibble(
  Group = c("A", "A", "B", "B"), 
  Score = c(85, 90, 75, 80)
)
agg_data <- data |> 
  group_by(Group) |> 
  summarize(Average_Score = mean(Score))
print(agg_data)
#| echo: false
#| output: true

# A tibble: 2 × 2
  Group Average_Score
  <chr>        <dbl>
1 A               87.5
2 B               77.5

Reshaping data

A diagram showing how `pivot_longer()` transforms a simple data set, using color to highlight how column names ("bp1" and "bp2") become the values in a new `measurement` column. They are repeated three times because there were three rows in the input.

Figure 3: The column names of pivoted columns become values in a new column. The values need to be repeated once for each row of the original dataset.

Transforming data: pivot_longer()

  • pivot_longer() converts wide data into long format.
  • Useful for making datasets tidy.
  • Helps reshape data for visualization.
  • Used when multiple columns represent a single variable.
  • Reduces redundant column headers.
library(tidyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob"), 
  Math = c(85, 90), 
  Science = c(88, 92)
)
long_data <- data |> 
  pivot_longer(
    cols = c(Math, Science), 
    names_to = "Subject", 
    values_to = "Score"
  )
print(long_data)
#| echo: false
#| output: true

# A tibble: 4 × 3
  Name  Subject  Score
  <chr> <chr>    <dbl>
1 Alice Math        85
2 Alice Science     88
3 Bob   Math        90
4 Bob   Science     92

Transforming data: pivot_wider()

  • pivot_wider() converts long data into wide format.
  • Useful for restructuring data for analysis.
  • Spreads multiple values into separate columns.
  • Helps in comparisons across categories.
  • Often used for creating summary tables.
library(tidyr)
long_data <- tibble::tibble(
  Name = c("Alice", "Alice", "Bob", "Bob"), 
  Subject = c("Math", "Science", "Math", "Science"), 
  Score = c(85, 88, 90, 92)
)
wide_data <- long_data |> 
  pivot_wider(
    names_from = Subject, 
    values_from = Score
  )
print(wide_data)
#| echo: false
#| output: true

# A tibble: 2 × 3
  Name  Math Science
  <chr> <dbl>   <dbl>
1 Alice    85      88
2 Bob      90      92

Transforming data: Logical vectors

  • Logical vectors store TRUE or FALSE values.
  • Created using logical conditions in filtering.
  • Can be used in ifelse() statements.
  • Useful for subsetting data efficiently.
  • Helps in conditional data manipulation.
library(dplyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Score = c(85, 90, 75)
)
data <- data |> 
  mutate(Pass = Score > 80)
print(data)
#| echo: false
#| output: true

# A tibble: 3 × 3
  Name    Score Pass 
  <chr>   <dbl> <lgl>
1 Alice      85 TRUE 
2 Bob        90 TRUE 
3 Charlie    75 FALSE
Seven Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. x & !y is x but none of y; x & y is the intersection of x and y; !x & y is y but none of x; x is all of x; xor(x, y) is everything except the intersection of x and y; y is all of y; and x | y is everything.
Figure 4: The complete set of Boolean operations. x is the left-hand circle, y is the right-hand circle, and the shaded regions show which parts each operator selects.

Transforming data: Numbers

  • Numeric transformations modify or create numerical values.
  • Common functions include log(), sqrt(), and round().
  • Scaling and standardization are useful for modeling.
  • Arithmetic operations can be applied element-wise.
  • mutate() in dplyr can create new transformed columns.
library(dplyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Score = c(85, 90, 75)
)
data <- data |> 
  mutate(
    Log_Score = log(Score), 
    Scaled_Score = Score / max(Score)
  )
print(data)
#| echo: false
#| output: true

# A tibble: 3 × 4
  Name    Score Log_Score Scaled_Score
  <chr>   <dbl>     <dbl>        <dbl>
1 Alice      85      4.44        0.944
2 Bob        90      4.50        1    
3 Charlie    75      4.32        0.833

Transforming data: Strings

  • String transformations modify text data.
  • Functions like toupper(), tolower(), and str_replace() help clean text.
  • String manipulation is useful for categorical data.
  • stringr package provides additional tools for text handling.
  • Common transformations include trimming, replacing, and extracting patterns.
library(stringr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie")
)
data <- data |> 
  mutate(
    Upper_Name = toupper(Name), 
    Short_Name = str_sub(Name, 1, 3)
  )
print(data)
#| echo: false
#| output: true

# A tibble: 3 × 3
  Name    Upper_Name Short_Name
  <chr>   <chr>      <chr>     
1 Alice   ALICE      Ali       
2 Bob     BOB        Bob       
3 Charlie CHARLIE    Cha       

Transforming data: Regular expressions

  • Regular expressions help search and manipulate patterns in text.
  • str_detect() finds matching patterns.
  • str_replace() replaces specific text parts.
  • Useful for cleaning and extracting structured data.
  • Can be used in filtering and subsetting datasets.
library(stringr)
data <- tibble::tibble(
  Text = c("apple123", "banana456", "cherry789")
)
data <- data |> 
  mutate(
    Extracted_Number = str_extract(Text, "\\d+")
  )
print(data)
#| echo: false
#| output: true

# A tibble: 3 × 2
  Text       Extracted_Number
  <chr>      <chr>           
1 apple123   123             
2 banana456  456             
3 cherry789  789             

Transforming data: Factors

  • Factors represent categorical data in R.
  • factor() function converts character data to categorical format.
  • Ordered factors help in ranked categories.
  • Factors can have predefined levels.
  • Useful for statistical modeling and visualization.
library(dplyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Grade = c("A", "B", "A")
)
data <- data |> 
  mutate(
    Grade = factor(
      Grade, 
      levels = c("A", "B", "C"), 
      ordered = TRUE
    )
  )
print(data)
#| echo: false
#| output: true

# A tibble: 3 × 2
  Name    Grade
  <chr>   <ord>
1 Alice   A    
2 Bob     B    
3 Charlie A    

Transforming data: Dates and times

  • Dates and times are stored as special data types in R.
  • The lubridate package simplifies date-time manipulation.
  • Functions like ymd(), mdy(), and hms() parse dates.
  • Arithmetic operations work on date-time objects.
  • Useful for time-series analysis and scheduling.
library(lubridate)
data <- tibble::tibble(
  Name = c("Alice", "Bob"), 
  Birthdate = c("1990-05-15", "1985-10-30")
)
data <- data |> 
  mutate(
    Birthdate = ymd(Birthdate), 
    Age = as.numeric(Sys.Date() - Birthdate) / 365
  )
print(data)
#| echo: false
#| output: true

# A tibble: 2 × 3
  Name  Birthdate   Age
  <chr> <date>    <dbl>
1 Alice 1990-05-15  33.9
2 Bob   1985-10-30  38.3

Transforming data: Missing values

  • Missing values are represented as NA in R.
  • is.na() checks for missing values.
  • na.omit() removes missing values from a dataset.
  • mutate() with replace_na() fills missing values.
  • Handling missing values is crucial for accurate analysis.
library(dplyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Score = c(85, NA, 75)
)
data <- data |> 
  mutate(
    Score = replace_na(Score, mean(Score, na.rm = TRUE))
  )
print(data)
#| echo: false
#| output: true

# A tibble: 3 × 2
  Name    Score
  <chr>   <dbl>
1 Alice      85
2 Bob        80
3 Charlie    75

Transforming data: Joins

  • Joins combine two data frames based on a common key.
  • left_join() keeps all rows from the first table.
  • inner_join() keeps only matching rows.
  • full_join() includes all rows from both tables.
  • Helps merge datasets efficiently.
library(dplyr)
data1 <- tibble::tibble(
  ID = c(1, 2, 3), 
  Name = c("Alice", "Bob", "Charlie")
)
data2 <- tibble::tibble(
  ID = c(2, 3, 4), 
  Score = c(90, 75, 88)
)
joined_data <- left_join(data1, data2, by = "ID")
print(joined_data)
#| echo: false
#| output: true

# A tibble: 3 × 3
     ID Name    Score
  <dbl> <chr>   <dbl>
1     1 Alice      NA
2     2 Bob        90
3     3 Charlie    75

Exporting data frames

  • Data frames can be saved to CSV, Excel, or RDS formats.
  • write.csv() writes a data frame to a CSV file.
  • write_rds() saves data for use in R.
  • read_csv() and read_rds() reload saved files.
  • Exporting data allows sharing and persistence.
library(readr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Score = c(85, 90, 75)
)
write_csv(data, "../../csv/data.csv")
write_rds(data, "../../csv/data.rds")
#| echo: false
#| output: true

# Files "data.csv" and "data.rds" have been created.

Visualizing data

Next steps

Computer lab 1, feb 21

References

Watt, H., and T. Naidoo. 2025. “Data Wrangling Recipes in r.” https://bookdown.org/hcwatt99/Data_Wrangling_Recipes_in_R/#why-data-wrangling-recipes-in-r.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. 2nd ed. "O’Reilly Media, Inc.". https://r4ds.hadley.nz/.
Wilke, Claus O. 2019. Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. O’Reilly Media. https://clauswilke.com/dataviz/.