Basic data selection and transformations using R

SIMP59: Data Selection and Visualisation VT25
7.5 credits

nils.holmberg@iko.lu.se

Presentation

Nils Holmberg
Computational
content analysis
Cognitive
communication effects

Course literature

Wickham, Çetinkaya-Rundel, and Grolemund (2023)

Wilke (2019)

💻

Watt and Naidoo (2025)

Lecture overview

Data analysis framework
Data types
Data frames
Data structures
Selecting data
Importing data
Subsetting data

Research questions
Transforming data
Adding data
Grouping data
Reshaping data
Exporting data frames
Visualizing data

Data analysis framework

A diagram displaying the data science cycle: Import -> Tidy -> Understand (which has the phases Transform -> Visualize -> Model in a cycle) -> Communicate. Surrounding all of these is Program Import, Tidy, Transform, and Visualize is highlighted.

Figure 1: In this section of the book, you’ll learn how to import, tidy, transform, and visualize data.

Open science communication

Replicability: R enables reproducible research with shareable scripts.
Transparency: Open-source code makes analyses fully visible.
Collaboration: Supports teamwork via GitHub and R Markdown.
Open Source: Free and community-driven for open science. 🚀

Notebook formats

R Markdown (.Rmd)
- RStudio
- VS Code
Jupyter notebooks (.ipynb)
- Python
- Google Colab
Quarto (.qmd)
- Multilingual

What is data?

plural of latin datum
piece of information
multiple observations
qualitative, quantitative

Data types in R

# Numeric (default type for numbers in R)
print(class(3.14))  # "numeric"

[1] "numeric"

# Integer (specified with an "L" suffix)
print(class(42L))  # "integer"

[1] "integer"

# Character (string)
print(class("Hello, R!"))  # "character"

[1] "character"

# Logical (Boolean: TRUE or FALSE)
print(class(TRUE))  # "logical"

[1] "logical"

# Factor (categorical variable)
print(class(factor(c("low", "medium", "high", "medium"))))  # "factor"

[1] "factor"

What is a variable?

# Checking type conversion
print(class(as.character(3.14)))  # "character"

[1] "character"

print(class(as.numeric("3.14")))  # "numeric"

[1] "numeric"

# Vector (homogeneous data type)
print(class(c(1, 2, 3, 4, 5)))  # "numeric"

[1] "numeric"

# List (heterogeneous data types)
print(class(list(3.14, "Hello, R!", TRUE)))  # "list"

[1] "list"

# Data Frame (tabular structure with columns of different types)
print(class(data.frame(ID = 1:3, Name = c("Alice", "Bob", "Charlie"), Score = c(85.5, 90.3, 78.9))))  # "data.frame"

[1] "data.frame"

Variable descriptives

Measures of central tendency and dispersion, mpg and hp
Univariate analysis, summary table of descriptive stats

Variable	Count	Mean	SD	SEM
mpg	32	20.09062	6.026948	1.065424
hp	32	146.68750	68.562868	12.120317

Data frames in R

Structure: A data frame is R’s standard tabular data structure, while a tibble is its modern alternative from the tidyverse.
Printing: Data frames print the full dataset by default, whereas tibbles display only the first few rows for readability.
Column Behavior: Data frames may simplify column types (e.g., converting strings to factors), while tibbles preserve data types.
Subsetting: Extracting a single column from a data frame can return a vector, while tibbles always return a tibble (data frame-like).

Data structures

Three panels, each representing a tidy data frame. The first panel shows that each variable is a column. The second panel shows that each observation is a row. The third panel shows that each value is a cell.

Figure 2: The following three rules make a dataset tidy: variables are columns, observations are rows, and values are cells.

Tidy data follows a structured format with variables in columns.
Each row represents an observation.
Each column corresponds to a variable.
Tidy data makes data analysis easier using tidyverse tools.
Functions like pivot_longer() and pivot_wider() help reshape data.

library(tidyr)
tidy_data <- tibble::tibble(
  Name = c("Alice", "Bob"),
  Score = c(85, 90)
)
print(tidy_data)

#| echo: false
#| output: true

# A tibble: 2 × 2
  Name  Score
  <chr> <dbl>
1 Alice    85
2 Bob      90

Data structures: Hierarchical table

Text
Code
Output

Hierarchical tables store nested data.
Useful for representing complex relationships.
Can be created using lists or nested data frames.
tidyr::nest() helps structure hierarchical tables.
Often used in grouped data analysis.

library(tidyr)
hierarchical_data <- tibble::tibble(
  Group = c("A", "B"),
  Data = list(
    data.frame(Value = 1:3),
    data.frame(Value = 4:6)
  )
)
print(hierarchical_data)

#| echo: false
#| output: true

# A tibble: 2 × 2
  Group Data           
  <chr> <list>         
1 A     <data.frame [3×1]>
2 B     <data.frame [3×1]>

Data structures: JSON, XML

Text
Code
Output

JSON and XML store structured data.
JSON is commonly used in web applications.
XML is hierarchical and used for document storage.
R has jsonlite for JSON parsing.
XML can be read using xml2 package.

library(jsonlite)
json_data <- '{"Name": "Alice", "Score": 85}'
parsed_json <- fromJSON(json_data)
print(parsed_json)

#| echo: false
#| output: true

  Name Score
1 Alice    85

Data structures: Networks

Text
Code
Output

Networks represent relationships between entities.
Graphs consist of nodes and edges.
R packages like igraph help analyze network data.
Useful in social network and graph analysis.
Can visualize networks with ggraph.

library(igraph)
g <- graph_from_edgelist(
  matrix(c("A", "B", "B", "C", "C", "A"), 
         ncol = 2, 
         byrow = TRUE)
)
print(g)

#| echo: false
#| output: true

IGRAPH d6b6bcd DN-- 3 3 -- 
+ edges:
[1] A->B B->C C->A

Data structures: Unstructured text

Text
Code
Output

Unstructured text data lacks predefined format.
Used in NLP applications like sentiment analysis.
Common sources include emails, social media, and articles.
R provides tm and tidytext for text analysis.
Tokenization helps convert text into structured data.

library(tidytext)
text_data <- tibble::tibble(
  ID = 1, 
  Text = "This is an example sentence."
)
tokenized_text <- text_data |> 
  unnest_tokens(word, Text)
print(tokenized_text)

#| echo: false
#| output: true

# A tibble: 5 × 2
     ID word   
  <dbl> <chr>  
1     1 this   
2     1 is     
3     1 an     
4     1 example
5     1 sentence

Selecting data

find, generate a relevant dataset with regard to RQ?
import dataset to r dataframe
subsetting data, e.g. “select from x where y==1 order by z”
selecting columns, e.g. select(x, y)

Selecting data: Built-in datasets

Text
Code
Output

R includes several built-in datasets for practice.
The datasets package provides easy access to these.
Use data() to list available datasets.
Use head() to preview a dataset.
Built-in datasets are useful for learning data manipulation.

data()
head(mtcars)

#| echo: false
#| output: true

                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Selecting data: Finding a real dataset

Text
Code
Output

Real-world datasets can be found online in repositories.
Websites like Kaggle and UCI Machine Learning Repository offer datasets.
R packages like tidyverse and readr help with importing datasets.
Government and academic institutions publish open datasets.

# Example: Loading a dataset from an online source
library(readr)
dataset <- read_csv(
  "https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv"
)
head(dataset)

#| echo: false
#| output: true

  Month   X1958 X1959 X1960
1   JAN    340    360    417
2   FEB    318    342    391
3   MAR    362    406    419

Selecting data: Importing data to R

Text
Code
Output

Data can be imported from CSV, Excel, or databases.
read.csv() is commonly used for CSV files.
read_excel() from readxl package handles Excel files.
Databases can be accessed using DBI and dplyr.
Always check encoding and delimiters when importing.

# Example: Importing a CSV file
#data <- read.csv("data.csv")
data <- read.csv(
  "../../../../dev/quarto/osm-cda/csv/palmerpenguins.tsv"
)
head(data)

#| echo: false
#| output: true

  ID Name   Score
1  1 Alice  85.5
2  2 Bob    90.3
3  3 Charlie 78.9

Selecting data: Subsetting using base R

Text
Code
Output

Base R provides indexing for selecting subsets of data.
Use [] for selecting rows and columns.
Logical conditions can filter data.
subset() simplifies selection based on conditions.
Useful for quick exploratory data analysis.

# Example: Subsetting a dataset
data <- mtcars[mtcars$mpg > 20, ]
head(data)

#| echo: false
#| output: true

                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Selecting data: Using dplyr select()

Text
Code
Output

dplyr provides efficient data manipulation functions.
The select() function chooses specific columns.
Columns can be chosen by name or pattern.
rename() can be used for renaming columns.
tidyverse functions are chainable with |>.

# Example: Selecting columns with dplyr
library(dplyr)
data <- mtcars |> 
  select(mpg, cyl, hp)
head(data)

#| echo: false
#| output: true

                    mpg cyl  hp
Mazda RX4         21.0   6 110
Mazda RX4 Wag     21.0   6 110
Datsun 710        22.8   4  93

Research questions

How does penguin weight vary by species?
Expectation (hypothesis) about relationships
Measurement variable (weight) by explanatory variable (species)
Bivariate analysis, summary table with descriptive stats

species	Count	NA_Body_Mass	Mean_Body_Mass	SD_Body_Mass	SEM_Body_Mass
Adelie	152	1	3700.662	458.5661	37.19462
Chinstrap	68	0	3733.088	384.3351	46.60747
Gentoo	124	1	5076.016	504.1162	45.27097

Transforming data

e.g. scaling, log trans, centering, normalizing
string variables, regular exp
imputing, exptrapolating values, handling missing values
reshaping data (basic), merging joining relational data (advanced)
summarizing, grouping, aggregating etc

Transforming data: Filtering rows

Text
Code
Output

Filtering extracts specific rows based on conditions.
The filter() function from dplyr simplifies row selection.
Logical conditions define which rows to keep.
Multiple conditions can be combined using & or |.
Useful for subsetting large datasets.

library(dplyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Score = c(85, 90, 75)
)
filtered_data <- data |> 
  filter(Score > 80)
print(filtered_data)

#| echo: false
#| output: true

# A tibble: 2 × 2
  Name  Score
  <chr> <dbl>
1 Alice    85
2 Bob      90

Transforming data: select() and filter()

Text
Code
Output

select() extracts specific columns.
filter() selects rows based on conditions.
Used together to refine data views.
Helps in feature selection for modeling.
Makes data frames more manageable.

library(dplyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Score = c(85, 90, 75), 
  Age = c(25, 30, 22)
)
selected_filtered_data <- data |> 
  select(Name, Score) |> 
  filter(Score > 80)
print(selected_filtered_data)

#| echo: false
#| output: true

# A tibble: 2 × 2
  Name  Score
  <chr> <dbl>
1 Alice    85
2 Bob      90

Transforming data: dplyr arrange()

Text
Code
Output

Sorting arranges data in ascending or descending order.
The arrange() function in dplyr performs sorting.
Use desc() for descending order.
Sorting helps in ranking and comparisons.
Can sort by multiple columns.

library(dplyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Score = c(85, 90, 75)
)
sorted_data <- data |> 
  arrange(desc(Score))
print(sorted_data)

#| echo: false
#| output: true

# A tibble: 3 × 2
  Name    Score
  <chr>   <dbl>
1 Bob        90
2 Alice      85
3 Charlie    75

Transforming data: dplyr mutate()

Text
Code
Output

mutate() creates new variables.
Can modify or transform existing columns.
Useful for feature engineering.
Supports mathematical operations.
Keeps original data intact while adding new fields.

library(dplyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Score = c(85, 90, 75)
)
mutated_data <- data |> 
  mutate(Grade = ifelse(Score > 80, "Pass", "Fail"))
print(mutated_data)

#| echo: false
#| output: true

# A tibble: 3 × 3
  Name    Score Grade
  <chr>   <dbl> <chr>
1 Alice      85 Pass 
2 Bob        90 Pass 
3 Charlie    75 Fail

Transforming data: dplyr group_by()

Text
Code
Output

group_by() groups data by one or more variables.
Used with summarize() to compute group statistics.
Aggregates data efficiently.
Helps in analyzing categorical variables.
Useful for reporting and summaries.

library(dplyr)
data <- tibble::tibble(
  Group = c("A", "A", "B", "B"), 
  Score = c(85, 90, 75, 80)
)
grouped_data <- data |> 
  group_by(Group) |> 
  summarize(Average_Score = mean(Score))
print(grouped_data)

#| echo: false
#| output: true

# A tibble: 2 × 2
  Group Average_Score
  <chr>        <dbl>
1 A               87.5
2 B               77.5

Transforming data: Aggregations

Text
Code
Output

Aggregation summarizes data using functions like sum(), mean().
group_by() in dplyr helps compute grouped summaries.
Common aggregate functions include min(), max(), sd().
Aggregations are useful for statistical summaries and reporting.
Can be applied across multiple columns.

library(dplyr)
data <- tibble::tibble(
  Group = c("A", "A", "B", "B"), 
  Score = c(85, 90, 75, 80)
)
agg_data <- data |> 
  group_by(Group) |> 
  summarize(Average_Score = mean(Score))
print(agg_data)

#| echo: false
#| output: true

# A tibble: 2 × 2
  Group Average_Score
  <chr>        <dbl>
1 A               87.5
2 B               77.5

Reshaping data

A diagram showing how `pivot_longer()` transforms a simple data set, using color to highlight how column names ("bp1" and "bp2") become the values in a new `measurement` column. They are repeated three times because there were three rows in the input.

Figure 3: The column names of pivoted columns become values in a new column. The values need to be repeated once for each row of the original dataset.

Transforming data: pivot_longer()

Text
Code
Output

pivot_longer() converts wide data into long format.
Useful for making datasets tidy.
Helps reshape data for visualization.
Used when multiple columns represent a single variable.
Reduces redundant column headers.

library(tidyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob"), 
  Math = c(85, 90), 
  Science = c(88, 92)
)
long_data <- data |> 
  pivot_longer(
    cols = c(Math, Science), 
    names_to = "Subject", 
    values_to = "Score"
  )
print(long_data)

#| echo: false
#| output: true

# A tibble: 4 × 3
  Name  Subject  Score
  <chr> <chr>    <dbl>
1 Alice Math        85
2 Alice Science     88
3 Bob   Math        90
4 Bob   Science     92

Transforming data: pivot_wider()

Text
Code
Output

pivot_wider() converts long data into wide format.
Useful for restructuring data for analysis.
Spreads multiple values into separate columns.
Helps in comparisons across categories.
Often used for creating summary tables.

library(tidyr)
long_data <- tibble::tibble(
  Name = c("Alice", "Alice", "Bob", "Bob"), 
  Subject = c("Math", "Science", "Math", "Science"), 
  Score = c(85, 88, 90, 92)
)
wide_data <- long_data |> 
  pivot_wider(
    names_from = Subject, 
    values_from = Score
  )
print(wide_data)

#| echo: false
#| output: true

# A tibble: 2 × 3
  Name  Math Science
  <chr> <dbl>   <dbl>
1 Alice    85      88
2 Bob      90      92

Transforming data: Logical vectors

Text
Code
Output
Plot

Logical vectors store TRUE or FALSE values.
Created using logical conditions in filtering.
Can be used in ifelse() statements.
Useful for subsetting data efficiently.
Helps in conditional data manipulation.

library(dplyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Score = c(85, 90, 75)
)
data <- data |> 
  mutate(Pass = Score > 80)
print(data)

#| echo: false
#| output: true

# A tibble: 3 × 3
  Name    Score Pass 
  <chr>   <dbl> <lgl>
1 Alice      85 TRUE 
2 Bob        90 TRUE 
3 Charlie    75 FALSE

Seven Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. x & !y is x but none of y; x & y is the intersection of x and y; !x & y is y but none of x; x is all of x; xor(x, y) is everything except the intersection of x and y; y is all of y; and x | y is everything. — Figure 4: The complete set of Boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded regions show which parts each operator selects.

Transforming data: Numbers

Text
Code
Output

Numeric transformations modify or create numerical values.
Common functions include log(), sqrt(), and round().
Scaling and standardization are useful for modeling.
Arithmetic operations can be applied element-wise.
mutate() in dplyr can create new transformed columns.

library(dplyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Score = c(85, 90, 75)
)
data <- data |> 
  mutate(
    Log_Score = log(Score), 
    Scaled_Score = Score / max(Score)
  )
print(data)

#| echo: false
#| output: true

# A tibble: 3 × 4
  Name    Score Log_Score Scaled_Score
  <chr>   <dbl>     <dbl>        <dbl>
1 Alice      85      4.44        0.944
2 Bob        90      4.50        1    
3 Charlie    75      4.32        0.833

Transforming data: Strings

Text
Code
Output

String transformations modify text data.
Functions like toupper(), tolower(), and str_replace() help clean text.
String manipulation is useful for categorical data.
stringr package provides additional tools for text handling.
Common transformations include trimming, replacing, and extracting patterns.

library(stringr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie")
)
data <- data |> 
  mutate(
    Upper_Name = toupper(Name), 
    Short_Name = str_sub(Name, 1, 3)
  )
print(data)

#| echo: false
#| output: true

# A tibble: 3 × 3
  Name    Upper_Name Short_Name
  <chr>   <chr>      <chr>     
1 Alice   ALICE      Ali       
2 Bob     BOB        Bob       
3 Charlie CHARLIE    Cha

Transforming data: Regular expressions

Text
Code
Output

Regular expressions help search and manipulate patterns in text.
str_detect() finds matching patterns.
str_replace() replaces specific text parts.
Useful for cleaning and extracting structured data.
Can be used in filtering and subsetting datasets.

library(stringr)
data <- tibble::tibble(
  Text = c("apple123", "banana456", "cherry789")
)
data <- data |> 
  mutate(
    Extracted_Number = str_extract(Text, "\\d+")
  )
print(data)

#| echo: false
#| output: true

# A tibble: 3 × 2
  Text       Extracted_Number
  <chr>      <chr>           
1 apple123   123             
2 banana456  456             
3 cherry789  789

Transforming data: Factors

Text
Code
Output

Factors represent categorical data in R.
factor() function converts character data to categorical format.
Ordered factors help in ranked categories.
Factors can have predefined levels.
Useful for statistical modeling and visualization.

library(dplyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Grade = c("A", "B", "A")
)
data <- data |> 
  mutate(
    Grade = factor(
      Grade, 
      levels = c("A", "B", "C"), 
      ordered = TRUE
    )
  )
print(data)

#| echo: false
#| output: true

# A tibble: 3 × 2
  Name    Grade
  <chr>   <ord>
1 Alice   A    
2 Bob     B    
3 Charlie A

Transforming data: Dates and times

Text
Code
Output

Dates and times are stored as special data types in R.
The lubridate package simplifies date-time manipulation.
Functions like ymd(), mdy(), and hms() parse dates.
Arithmetic operations work on date-time objects.
Useful for time-series analysis and scheduling.

library(lubridate)
data <- tibble::tibble(
  Name = c("Alice", "Bob"), 
  Birthdate = c("1990-05-15", "1985-10-30")
)
data <- data |> 
  mutate(
    Birthdate = ymd(Birthdate), 
    Age = as.numeric(Sys.Date() - Birthdate) / 365
  )
print(data)

#| echo: false
#| output: true

# A tibble: 2 × 3
  Name  Birthdate   Age
  <chr> <date>    <dbl>
1 Alice 1990-05-15  33.9
2 Bob   1985-10-30  38.3

Transforming data: Missing values

Text
Code
Output

Missing values are represented as NA in R.
is.na() checks for missing values.
na.omit() removes missing values from a dataset.
mutate() with replace_na() fills missing values.
Handling missing values is crucial for accurate analysis.

library(dplyr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Score = c(85, NA, 75)
)
data <- data |> 
  mutate(
    Score = replace_na(Score, mean(Score, na.rm = TRUE))
  )
print(data)

#| echo: false
#| output: true

# A tibble: 3 × 2
  Name    Score
  <chr>   <dbl>
1 Alice      85
2 Bob        80
3 Charlie    75

Transforming data: Joins

Text
Code
Output

Joins combine two data frames based on a common key.
left_join() keeps all rows from the first table.
inner_join() keeps only matching rows.
full_join() includes all rows from both tables.
Helps merge datasets efficiently.

library(dplyr)
data1 <- tibble::tibble(
  ID = c(1, 2, 3), 
  Name = c("Alice", "Bob", "Charlie")
)
data2 <- tibble::tibble(
  ID = c(2, 3, 4), 
  Score = c(90, 75, 88)
)
joined_data <- left_join(data1, data2, by = "ID")
print(joined_data)

#| echo: false
#| output: true

# A tibble: 3 × 3
     ID Name    Score
  <dbl> <chr>   <dbl>
1     1 Alice      NA
2     2 Bob        90
3     3 Charlie    75

Exporting data frames

Text
Code
Output

Data frames can be saved to CSV, Excel, or RDS formats.
write.csv() writes a data frame to a CSV file.
write_rds() saves data for use in R.
read_csv() and read_rds() reload saved files.
Exporting data allows sharing and persistence.

library(readr)
data <- tibble::tibble(
  Name = c("Alice", "Bob", "Charlie"), 
  Score = c(85, 90, 75)
)
write_csv(data, "../../csv/data.csv")
write_rds(data, "../../csv/data.rds")

#| echo: false
#| output: true

# Files "data.csv" and "data.rds" have been created.

Visualizing data

Next steps

Computer lab 1, feb 21

Link to lecture notes: https://cca-cce.github.io/osm-comm/web/edu/simp59-01tra.html

References

Watt, H., and T. Naidoo. 2025. “Data Wrangling Recipes in r.” https://bookdown.org/hcwatt99/Data_Wrangling_Recipes_in_R/#why-data-wrangling-recipes-in-r.

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. 2nd ed. "O’Reilly Media, Inc.". https://r4ds.hadley.nz/.

Wilke, Claus O. 2019. Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. O’Reilly Media. https://clauswilke.com/dataviz/.