[1] "numeric"
[1] "integer"
[1] "character"
[1] "logical"
# Factor (categorical variable)
print(class(factor(c("low", "medium", "high", "medium")))) # "factor"[1] "factor"
SIMP59: Data Selection and Visualisation VT25
7.5 credits
Selecting dataTransforming dataFigure 1: In this section of the book, you’ll learn how to import, tidy, transform, and visualize data.
.Rmd)
.ipynb)
.qmd)
datummpg and hp| Variable | Count | Mean | SD | SEM |
|---|---|---|---|---|
| mpg | 32 | 20.09062 | 6.026948 | 1.065424 |
| hp | 32 | 146.68750 | 68.562868 | 12.120317 |
data frame is R’s standard tabular data structure, while a tibble is its modern alternative from the tidyverse.Figure 2: The following three rules make a dataset tidy: variables are columns, observations are rows, and values are cells.
tidyverse tools.pivot_longer() and pivot_wider() help reshape data.tidyr::nest() helps structure hierarchical tables.jsonlite for JSON parsing.xml2 package.igraph help analyze network data.ggraph.tm and tidytext for text analysis.find, generate a relevant dataset with regard to RQ?import dataset to r dataframesubsetting data, e.g. “select from x where y==1 order by z”columns, e.g. select(x, y)datasets package provides easy access to these.data() to list available datasets.head() to preview a dataset.tidyverse and readr help with importing datasets.read.csv() is commonly used for CSV files.read_excel() from readxl package handles Excel files.DBI and dplyr.[] for selecting rows and columns.subset() simplifies selection based on conditions.dplyr provides efficient data manipulation functions.select() function chooses specific columns.rename() can be used for renaming columns.tidyverse functions are chainable with |>.weight vary by species?| species | Count | NA_Body_Mass | Mean_Body_Mass | SD_Body_Mass | SEM_Body_Mass |
|---|---|---|---|---|---|
| Adelie | 152 | 1 | 3700.662 | 458.5661 | 37.19462 |
| Chinstrap | 68 | 0 | 3733.088 | 384.3351 | 46.60747 |
| Gentoo | 124 | 1 | 5076.016 | 504.1162 | 45.27097 |
summarizing, grouping, aggregating etcfilter() function from dplyr simplifies row selection.& or |.select() extracts specific columns.filter() selects rows based on conditions.arrange() function in dplyr performs sorting.desc() for descending order.mutate() creates new variables.group_by() groups data by one or more variables.summarize() to compute group statistics.sum(), mean().group_by() in dplyr helps compute grouped summaries.min(), max(), sd().Figure 3: The column names of pivoted columns become values in a new column. The values need to be repeated once for each row of the original dataset.
pivot_longer() converts wide data into long format.pivot_wider() converts long data into wide format.TRUE or FALSE values.ifelse() statements.x is the left-hand circle, y is the right-hand circle, and the shaded regions show which parts each operator selects.
log(), sqrt(), and round().mutate() in dplyr can create new transformed columns.toupper(), tolower(), and str_replace() help clean text.stringr package provides additional tools for text handling.str_detect() finds matching patterns.str_replace() replaces specific text parts.factor() function converts character data to categorical format.lubridate package simplifies date-time manipulation.ymd(), mdy(), and hms() parse dates.NA in R.is.na() checks for missing values.na.omit() removes missing values from a dataset.mutate() with replace_na() fills missing values.left_join() keeps all rows from the first table.inner_join() keeps only matching rows.full_join() includes all rows from both tables.write.csv() writes a data frame to a CSV file.write_rds() saves data for use in R.read_csv() and read_rds() reload saved files.Computer lab 1, feb 21
References