Skip to main content
Ajay Dhangar
EditReport

R Programming Notes

Welcome to your comprehensive guide to R. This modern, structured handbook is designed to take you from writing your first line of code to executing advanced data engineering and statistical workflows.

Introduction to the R Ecosystem​

R is more than just a programming language; it is a highly specialized environment designed for data analysis, statistical computing, and stunning graphics.

Why Use R?​

  • Built for Data: Unlike general-purpose languages, R's core data types are tailor-made for statistical modeling.
  • The Tidyverse Ecosystem: A powerful, cohesive collection of packages designed explicitly for data science.
  • Academic & Production Standard: The gold standard for reproducible research, statistical benchmarking, and reporting.

Core Ecosystem Pillars​

  • tidyverse β€” The definitive toolkit for data manipulation and visualization (dplyr, ggplot2, tidyr, readr).
  • data.table β€” High-performance, memory-efficient data manipulation for massive datasets.
  • shiny β€” The go-to framework for building interactive web applications directly in R.
  • tidymodels / caret β€” Modern, unified frameworks for machine learning pipelines.

Installation & Environment Setup​

  1. Install R: Download the base engine from the Comprehensive R Archive Network (CRAN).
  2. Install RStudio: Download the industry-standard IDE from Posit.

Essential Console Commands​

Get familiar with your workspace using these quick foundational commands:

Workspace Navigation
# Check your active R version
version

# Find out where R is looking for files (Working Directory)
getwd()

# Change your working directory to a specific path
setwd("~/my_project")

Managing Packages​

Packages expand R's capabilities. You only need to install a package once, but you must load it every time you start a new R session.

Package Management
# Download from CRAN
install.packages("tidyverse")

# Load into your current session
library(tidyverse)

Variables & Basic Data Types​

In R, we use the arrow operator (<-) for assignment. While = works, <- is the idiomatic standard that explicitly signals a directional value assignment.

Variables & Types
# Assigning values
age <- 42
user_name <- "Alice"
is_active <- TRUE

# Checking data types
class(age) # Returns: "numeric" (stored as a double by default)
class(user_name) # Returns: "character"
class(is_active) # Returns: "logical"

The 5 Core Data Types​

  • Numeric: Decimals/doubles (42.5) and explicit integers (42L).
  • Character: Text strings ("Hello R").
  • Logical: Boolean flags (TRUE or FALSE).
  • Factor: Categorical data with pre-defined levels (e.g., factor(c("Low", "Medium", "High"))).
  • Dates/POSIXct: Standard date formats and calendar times with timezones.

Operators & Expressions​

Operators
# 1. Arithmetic Operators
5 + 3 # Addition
10 / 2 # Division
3 ^ 2 # Exponentiation (3 squared)
11 %% 3 # Modulo (Remainder of division = 2)

# 2. Comparison Operators
5 > 3 # TRUE
age == 42 # TRUE

# 3. Logical Operators
TRUE & FALSE # Element-wise AND -> FALSE
TRUE | FALSE # Element-wise OR -> TRUE
!TRUE # NOT -> FALSE

Control Flow​

Control flow lets your code make decisions based on data conditions.

Conditional Logic (If-Else)​

If-Else Statements
score <- 85

if (score >= 90) {
print("Grade: A")
} else if (score >= 75) {
print("Grade: B")
} else {
print("Grade: C")
}

Pattern Matching (Switch)​

Use switch as a cleaner alternative to multi-layered if-else blocks when matching specific strings or values.

Switch Statements
day <- "Monday"

message <- switch(day,
"Monday" = "Back to the grind!",
"Friday" = "Weekend is almost here!",
"Mid-week blues..." # Default fallback value
)

print(message)

Loops​

While R favors vectorized operations over explicit loops, loops are essential for iterative tasks.

For & While Loops
# For Loop: Iterate over a defined sequence
for (i in 1:5) {
print(paste("Iteration:", i))
}

# While Loop: Repeat as long as a condition is met
count <- 0
while (count < 3) {
print(paste("Count is:", count))
count <- count + 1
}

Functions​

Functions allow you to write reusable blocks of code. R functions automatically return the last evaluated expression, but using return() explicitly makes your intent clearer to others.

Defining Functions
# Standard function with a default parameter
calculate_total <- function(price, tax_rate = 0.05) {
total <- price + (price * tax_rate)
return(total)
}

# Invoking the function
calculate_total(price = 100) # Uses default tax (105)
calculate_total(price = 100, tax_rate = 0.1) # Overrides default tax (110)

Scope Warning

Variables created inside a function stay inside the function. If you must modify a global variable from within a function, use the global assignment operator (<<-), though this should be used sparingly.

Core Data Structures​

R’s superpower lies in its native data structures. Mastering these is crucial for effective data manipulation.

1. Vectors​

The foundational building block of R. They are one-dimensional arrays that must contains the same data type. Created using the combine function c().

Vectors
numbers <- c(1, 2, 3, 4, 5)
fruits <- c("apple", "banana", "cherry")

# Vectorized Math: Operations are automatically applied to every element!
numbers * 2 # Returns: 2, 4, 6, 8, 10
mean(numbers) # Returns summary statistic: 3

2. Matrices​

Two-dimensional structures where all elements must be of the exact same data type.

Matrices
# Create a 3x3 grid using numbers 1 through 9
matrix_grid <- matrix(1:9, nrow = 3, ncol = 3)
print(matrix_grid)

3. Lists​

The chameleons of R. Lists can hold elements of different types and shapes, including other lists.

Lists
user_profile <- list(
name = "John",
age = 30,
scores = c(85, 90, 95)
)

# Access items using named dollar-sign notation or double brackets
user_profile$name # "John"
user_profile[[3]][1] # Accesses first element of scores: 85

4. Data Frames (and Tibbles)​

The most common structure for data analysis. A data frame is a list of vectors of equal length, creating a traditional table with rows and columns. Columns can be different data types.

Data Frames
employees <- data.frame(
name = c("Alice", "Bob"),
age = c(25, 30),
performance_score = c(88, 92),
stringsAsFactors = FALSE # Standard default in modern R
)

# View a snapshot of the top of the data frame
head(employees)

String Manipulation​

Base R String Toolkit​

Base R Strings
phrase <- "Data Science with R"

toupper(phrase) # UPPERCASE
tolower(phrase) # lowercase
nchar(phrase) # Count characters
strsplit(phrase, " ") # Split string into a list of words

Modern Text Processing with stringr​

The tidyverse stringr package offers clean, predictable functions that all start with str_.

stringr Examples
library(stringr)

text <- "Learn R Programming"

# Search for a pattern
str_detect(text, "Programming") # TRUE

# Replace a pattern
str_replace(text, "Learn", "Master") # "Master R Programming"

Data Input/Output (File Handling)​

Getting data into and out of R efficiently is the first step of any analytics pipeline.

Data I/O Workflows
# --- Base R Approach ---
# Read standard CSV
my_data <- read.csv("raw_data.csv")
# Export data safely without row index numbers
write.csv(my_data, "clean_output.csv", row.names = FALSE)


# --- Modern Tidyverse Approach (Faster & Smart Parsing) ---
library(readr)

# Reads as a user-friendly "tibble" data frame
my_data <- read_csv("raw_data.csv")
write_csv(my_data, "clean_output.csv")

Data Visualization​

1. Quick Exploratory Visuals (Base R)​

Great for instantaneous data checks directly from your console.

Base R Plots
# Scatter plot
plot(x = 1:10, y = (1:10)^2, type = "b", main = "Quadratic Growth", xlab = "X", ylab = "X^2")

# Distribution histogram
hist(rnorm(1000), breaks = 30, col = "skyblue", main = "Normal Distribution")

2. Publication-Ready Visuals (ggplot2)​

Built on the Grammar of Graphics, you build plots layer-by-layer: Data β†’\rightarrow Aesthetics (mapping variables) β†’\rightarrow Geometries (visual shapes).

ggplot2 Workflow
library(ggplot2)

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point(color = "steelblue", size = 2) + # Layer 1: Scatter points
geom_smooth(method = "lm", se = FALSE, color = "red") + # Layer 2: Linear trend line
labs(
title = "Fuel Efficiency vs. Vehicle Weight",
x = "Weight (1000 lbs)",
y = "Miles Per Gallon"
) +
theme_minimal() # Layer 3: Clean presentation theme

Data Wrangling with dplyr​

The dplyr package uses the forward-pipe operator %>% (or the native R pipe |>) to chain analytical steps together in readable, logical sequences. Read it as saying "and then".

dplyr Pipelines
library(dplyr)

rank_report <- employees %>%
# 1. Filter rows based on a condition
filter(age >= 25) %>%

# 2. Keep only specific columns
select(name, performance_score) %>%

# 3. Sort by highest score down to lowest
arrange(desc(performance_score)) %>%

# 4. Create or modify columns based on calculations
mutate(grade = ifelse(performance_score >= 90, "Excellent", "Good"))

print(rank_report)

Statistical Analysis​

R was made by statisticians, for statisticians. Complex mathematical calculations take only a line or two.

Basic Modeling Workflow
# 1. Broad Descriptive Statistics Summary
summary(mtcars$mpg)

# 2. Calculate Pearson Correlation Coefficient
cor(mtcars$wt, mtcars$mpg)

# 3. Fit a Multiple Linear Regression Model (y ~ x1 + x2)
# Formula: Predict mpg based on weight (wt) and horsepower (hp)
fit <- lm(mpg ~ wt + hp, data = mtcars)

# 4. View coefficients, p-values, and R-squared values
summary(fit)

Error Handling & Debugging​

Prevent your script from crashing when unexpected data anomalies occur.

Defensive Programming (tryCatch)​

Graceful Failure
tryCatch({
# Attempt an operation that might break
bad_calculation <- log("not a number")
}, error = function(e) {
# Execute fallback logic safely if an error triggers
message("⚠️ Warning: Could not calculate log. Returning NA instead.")
print(e$message)
})

Essential Debugging Toolkit​

  • traceback(): Run this right after a crash to see the exact function call stack where the code failed.
  • browser(): Insert this directly inside a custom function to pause code execution and open an interactive console environment inside the function scope.

Coding Best Practices​

  • Use Explicit Assignments: Always use <- for object storage and = exclusively for assigning function parameters.
  • Readable Naming: Use snake_case for your variables and function names (e.g., clean_customer_data <- ...).
  • Keep Code Clean: Put spaces around mathematical operators (+, -, ==, <-) to maximize scannability.
  • Leverage Notebooks: Use Quarto (.qmd) or R Markdown (.Rmd) to blend your code execution, text commentary, and visualization outputs into beautiful HTML or PDF reports.

Hands-On Practice & Exercises​

Exercise 1: Building a Dynamic Summary Function​

Goal: Create a user-defined function that takes a numeric vector, strips out missing data (NA), and returns both the mean and standard deviation packaged inside a labeled list.

calc_stats <- function(vec) {
clean_vec <- na.omit(vec)

results <- list(
avg = mean(clean_vec),
st_dev = sd(clean_vec)
)

return(results)
}

# Test it out
calc_stats(c(10, 20, 30, NA, 40))

Exercise 2: Exploratory Case Study using iris​

Goal: Load R's native historical iris flower dataset, view a statistical digest of its metrics, and isolate structural patterns visually using ggplot2.

Case Study Execution
# Load built-in data
data(iris)

# Get baseline dimensions and metric summaries
summary(iris)

# Map features to explore differences across classifications
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(size = 2.5, alpha = 0.8) +
labs(
title = "Fisher's Iris Dataset Analysis",
x = "Sepal Length",
y = "Sepal Width"
) +
theme_classic()

References & Resources​

Finished reading? Mark this topic as complete.