01 - Welcome & Setup

Instructor and Agenda Introduction

Welcome to Bioinformatics Analysis of RNA-seq Data! This course is designed for researchers who want to master the computational analysis of RNA sequencing data. Whether you're studying gene expression changes, differential expression analysis, or pathway enrichment, you'll learn to transform raw sequencing data into meaningful biological insights using Python and specialized bioinformatics tools.

My name is Victor Gambarini. I have been working with Python for more than ten years. I got my PhD from The University of Auckland, where I did extensive bioinformatics and programming. One of my greatest research outputs is an online database of microorganisms that can biodegrade plastics called PlasticDB. PlasticDB was 100% coded in Python and has been running stable for many years. Throughout my research, I've analyzed countless RNA-seq datasets, from microbial communities to human samples, and I'm excited to share these practical skills with you!

About This Course

Over the next five days, we'll build your RNA-seq analysis skills from raw data to publication-ready results. You'll learn not just the theory, but hands-on computational approaches that researchers use in real projects. This course emphasizes practical implementation - we'll work with actual RNA-seq datasets and build complete analysis pipelines. Practice is crucial in bioinformatics, so we'll build lots of computational muscle memory!

Course Structure

Welcome & Setup - Local environment and Jupyter Lab setup
RNA-seq Fundamentals - Understanding the biology and data
Quality Control & Preprocessing - FastQC, trimming, and filtering
Read Alignment - Mapping reads to reference genomes
Count Matrix Generation - From alignments to gene counts
Exploratory Data Analysis - PCA, clustering, and visualization
Differential Expression Analysis - Statistical testing with DESeq2/edgeR
Multiple Testing Correction - Controlling false discovery rates
Functional Annotation - Gene ontology and pathway analysis
Advanced Visualization - Heatmaps, volcano plots, and more
Pipeline Integration - Workflow management and reproducibility
Case Study Project - Complete analysis of a real dataset
Additional Resources - Tools, databases, and best practices

What You'll Accomplish

By the end of this course, you'll be able to:

Process raw RNA-seq data from FASTQ to count matrices
Perform comprehensive quality control and preprocessing
Conduct statistical differential expression analysis
Create publication-quality visualizations
Interpret results in biological context
Build reproducible analysis pipelines
Apply best practices for bioinformatics workflows

Prerequisites

Basic Python knowledge is recommended (variables, lists, functions). Familiarity with:

Command line basics
Basic statistics concepts
Molecular biology fundamentals (genes, transcription, etc.)

You'll need: - A computer with at least 8GB RAM (16GB recommended) - Admin privileges to install software - Stable internet connection for downloads

Why Use Local Jupyter Lab

What is Jupyter Lab?

Jupyter Lab is a powerful, interactive development environment that runs locally on your computer. It provides a web-based interface for creating notebooks, managing files, and running code. For bioinformatics, it offers the perfect balance of interactivity and computational power.

Why Local Environment for RNA-seq Analysis

1. Computational Requirements

RNA-seq datasets are large (GBs to TBs)
Analysis requires substantial memory and processing power
Local control over computational resources
No session time limits or cloud restrictions

2. Data Security and Privacy

Sensitive genomic data stays on your machine
No upload limitations or privacy concerns
Full control over data access and sharing
Compliance with institutional data policies

3. Tool Ecosystem

Access to specialized bioinformatics software
Easy integration with command-line tools
Custom environment configuration
Installation of specific package versions

4. Long-Running Analyses

Alignment and processing can take hours/days
No session timeouts or disconnections
Background processing capabilities
Persistent data storage

5. Real-World Workflow

Mirrors professional bioinformatics environments
Better preparation for research computing
Integration with HPC clusters
Industry-standard practices

Jupyter Lab vs. Other Options

Feature	Jupyter Lab	Google Colab	RStudio
Data Size Limits	None	25GB	None
Runtime Limits	None	12 hours	None
Custom Software	Full control	Limited	R-focused
Privacy	Complete	Cloud-based	Complete
Computational Power	Your hardware	Limited free tier	Your hardware
Bioinformatics Tools	Full ecosystem	Limited	R packages

Installing Jupyter Lab and Dependencies

Step 1: Install Python and Conda

We'll use Anaconda or Miniconda for package management:

Option A: Anaconda (Recommended for beginners)

Download from anaconda.com
Run the installer for your operating system
Follow installation prompts (accept defaults)
Restart your terminal/command prompt

Option B: Miniconda (Lightweight alternative)

Download from docs.conda.io/en/latest/miniconda.html
Install following the same process as Anaconda

Step 2: Create a Bioinformatics Environment

Open your terminal/command prompt and create a dedicated environment:

```bash

Create new environment

conda create -n rnaseq python=3.11

Activate the environment

conda activate rnaseq

Install Jupyter Lab

conda install jupyterlab

Install essential packages

conda install pandas numpy matplotlib seaborn scipy conda install -c bioconda biopython pysam conda install -c conda-forge scanpy

Step 3: Install Additional Bioinformatics Tools

```bash

Quality control tools

conda install -c bioconda fastqc multiqc

Alignment tools

conda install -c bioconda star hisat2 bowtie2

Count and analysis tools

conda install -c bioconda subread htseq conda install -c bioconda samtools bedtools

R integration (for DESeq2)

conda install r-base r-essentials conda install -c bioconda bioconductor-deseq2

Step 4: Launch Jupyter Lab

```bash

Make sure your environment is activated

conda activate rnaseq

Launch Jupyter Lab

jupyter lab

Your default web browser should open with Jupyter Lab interface at http://localhost:8888

Jupyter Lab Interface Tour

Main Interface Components

1. File Browser (Left Sidebar)

Purpose: Navigate your file system
Features: Create folders, upload files, rename items
Tip: Right-click for context menu options

2. Launcher Tab

Purpose: Create new notebooks, terminals, and files
Options:
Python 3 notebook
Terminal
Text file
Markdown file

3. Notebook Interface

Code Cells: Execute Python code
Markdown Cells: Add documentation and explanations
Output: Results appear below cells

Creating Your First Notebook

Click "Python 3" under "Notebook" in the Launcher
A new notebook opens with an empty code cell
Rename by right-clicking the tab and selecting "Rename"

Essential Jupyter Lab Features

Cell Types

Code Cell: Run Python code (default)
Markdown Cell: Format text with Markdown
Raw Cell: Plain text (rarely used)

Keyboard Shortcuts

Shift + Enter: Run cell and move to next
Ctrl + Enter: Run cell and stay
A: Insert cell above
B: Insert cell below
DD: Delete cell
M: Change to Markdown cell
Y: Change to Code cell

Magic Commands

Time execution %time code_here Time multiple runs %timeit code_here Run shell commands !ls -la Load external Python files

%load filename.py

Setting Up Project Structure

Creating a Bioinformatics Project

Let's create a well-organized project structure:

Create project folder in Jupyter Lab file browser
Right-click in the file browser and select "New Folder"
Name it: RNA_seq_analysis_2025

Recommended Directory Structure

RNA_seq_analysis_2025/ ├── data/ │ ├── raw/ # Original FASTQ files │ ├── processed/ # Quality-controlled data │ ├── aligned/ # Alignment results │ └── counts/ # Count matrices ├── results/ │ ├── qc/ # Quality control reports │ ├── figures/ # Generated plots │ └── tables/ # Result tables ├── scripts/ │ ├── notebooks/ # Analysis notebooks │ └── python/ # Standalone Python scripts ├── references/ │ ├── genome/ # Reference genome files │ └── annotation/ # Gene annotation files └── README.md # Project documentation

Create the Structure

Open a new terminal in Jupyter Lab (File > New > Terminal) and run:

```bash cd RNA_seq_analysis_2025

Create directories

mkdir -p data/{raw,processed,aligned,counts} mkdir -p results/{qc,figures,tables} mkdir -p scripts/{notebooks,python} mkdir -p references/{genome,annotation}

Create README file

touch README.md

Download Practice Dataset

We'll work with a subset of a real RNA-seq dataset throughout the course.

Step 1: Create Download Script

Create a new notebook called 01_data_download.ipynb in the scripts/notebooks/ folder:


import urllib.request
import os
Create data directories if they don't exist
os.makedirs('../../data/raw', exist_ok=True)
Download sample FASTQ files (small subset for practice)
base_url = "https://example-rnaseq-data.com/samples/"
samples = ["sample1_R1.fastq.gz", "sample1_R2.fastq.gz", 
           "sample2_R1.fastq.gz", "sample2_R2.fastq.gz"]
for sample in samples:
    url = base_url + sample
    filepath = f"../../data/raw/{sample}"
    print(f"Downloading {sample}...")
    # urllib.request.urlretrieve(url, filepath)
    print(f"Saved to {filepath}")
print("Download complete!")

Step 2: Download Reference Files


Download reference genome and annotation
ref_url = "https://example-references.com/"
references = {
    "genome.fa": "references/genome/",
    "genes.gtf": "references/annotation/"
}
for filename, folder in references.items():
    os.makedirs(f"../../{folder}", exist_ok=True)
    # Download code here
    print(f"Downloaded {filename} to {folder}")

Note: We'll use publicly available test datasets. Real download links will be provided in class.

Environment Testing

Test Your Installation

Create a new notebook called 00_environment_test.ipynb:


Test basic Python packages
import sys
print(f"Python version: {sys.version}")
import pandas as pd
print(f"Pandas version: {pd.version}")
import numpy as np
print(f"NumPy version: {np.version}")
import matplotlib.pyplot as plt
print("Matplotlib imported successfully")
import seaborn as sns
print(f"Seaborn version: {sns.version}")
Test bioinformatics packages
try:
    import Bio
    print(f"Biopython version: {Bio.version}")
except ImportError:
    print("Biopython not installed")
try:
    import pysam
    print(f"Pysam version: {pysam.version}")
except ImportError:
    print("Pysam not installed")
Test external tools
import subprocess
tools = ['fastqc', 'multiqc', 'STAR', 'hisat2']
for tool in tools:
    try:
        result = subprocess.run([tool, '--version'], 
                              capture_output=True, text=True)
        if result.returncode == 0:
            print(f"{tool}: Available")
        else:
            print(f"{tool}: Not found")
    except FileNotFoundError:
        print(f"{tool}: Not found")

Expected Output

Python version: 3.11.x Pandas version: 2.x.x NumPy version: 1.x.x Matplotlib imported successfully Seaborn version: 0.x.x Biopython version: 1.x.x Pysam version: 0.x.x fastqc: Available multiqc: Available STAR: Available hisat2: Available

Best Practices for RNA-seq Analysis

1. Documentation and Reproducibility

Document everything: Use markdown cells extensively
Version control: We'll set up Git in a later session
Environment tracking: Export conda environments
Parameter logging: Record all analysis parameters

2. Data Management

Raw data is sacred: Never modify original files
Intermediate files: Keep processing intermediates
Backup important results: Multiple copies of key outputs
File naming: Use consistent, descriptive names

3. Quality Control

QC at every step: Raw data, processed data, results
Visual inspection: Always plot your data
Sanity checks: Verify results make biological sense
Negative controls: Include where appropriate

4. Computational Efficiency

Resource monitoring: Watch memory and CPU usage
Parallel processing: Use multiple cores when possible
Checkpointing: Save intermediate results
Code optimization: Profile slow steps

Troubleshooting Common Issues

Installation Problems

Conda conflicts: ```bash

Clear conda cache

conda clean --all

Update conda

conda update conda

Create fresh environment

conda create -n rnaseq_new python=3.11

Package conflicts: ```bash

Use mamba (faster conda alternative)

conda install mamba mamba install package_name

Jupyter Lab Issues

Lab won't start: ```bash

Check if jupyter is installed

jupyter --version

Try different port

jupyter lab --port=8889

Kernel problems: ```bash

Install kernel for your environment

python -m ipykernel install --user --name rnaseq

Memory Issues

Large datasets: - Process data in chunks - Use memory-efficient formats (HDF5, Parquet) - Monitor memory usage with htop or Activity Monitor - Consider using a high-memory machine or cluster

Quick Reference

Essential Conda Commands

```bash

Environment management

conda create -n envname python=3.11 conda activate envname conda deactivate conda env list

Package management

conda install package_name conda install -c channel package_name conda update package_name conda list

Environment export/import

conda env export > environment.yml conda env create -f environment.yml

Jupyter Lab Shortcuts

Shift + Enter: Run cell and advance
Ctrl + Enter: Run cell in place
A: Insert cell above
B: Insert cell below
DD: Delete cell
M: Markdown cell
Y: Code cell
Ctrl + S: Save notebook

Getting Help

In Jupyter: help(function) or ?function
Documentation: ??function for source code
Package docs: Most have online documentation
Bioinformatics: Biostars, SEQanswers

What's Next?

In our next session, we'll dive into RNA-seq fundamentals - understanding the biology, data formats, and quality metrics. Make sure you can successfully launch Jupyter Lab, create notebooks, and have all required packages installed before we continue.

Homework: 1. Complete the environment test notebook 2. Familiarize yourself with the Jupyter Lab interface 3. Review basic molecular biology concepts (transcription, gene expression) 4. Read about FASTQ file format if unfamiliar

Enjoying this course?

This is just the first episode! Register to unlock 0 more episodes and complete your learning journey.

Bioinformatics Rnaseq Analysis

01 - Welcome & Setup

Instructor and Agenda Introduction

About This Course

Course Structure

What You'll Accomplish

Prerequisites

Why Use Local Jupyter Lab

What is Jupyter Lab?

Why Local Environment for RNA-seq Analysis

Jupyter Lab vs. Other Options

Installing Jupyter Lab and Dependencies

Step 1: Install Python and Conda

Option A: Anaconda (Recommended for beginners)

Option B: Miniconda (Lightweight alternative)

Step 2: Create a Bioinformatics Environment

Create new environment

Activate the environment

Install Jupyter Lab

Install essential packages

Step 3: Install Additional Bioinformatics Tools

Quality control tools

Alignment tools

Count and analysis tools

R integration (for DESeq2)

Step 4: Launch Jupyter Lab

Make sure your environment is activated

Launch Jupyter Lab

Jupyter Lab Interface Tour

Main Interface Components

1. File Browser (Left Sidebar)

2. Launcher Tab

3. Notebook Interface

Creating Your First Notebook

Essential Jupyter Lab Features

Cell Types

Keyboard Shortcuts

Magic Commands

Time execution

Time multiple runs

Run shell commands

Load external Python files

Setting Up Project Structure

Creating a Bioinformatics Project

Recommended Directory Structure

Create the Structure

Create directories

Create README file

Download Practice Dataset

Step 1: Create Download Script

Create data directories if they don't exist

Download sample FASTQ files (small subset for practice)

Step 2: Download Reference Files

Download reference genome and annotation

Environment Testing

Test Your Installation

Test basic Python packages

Test bioinformatics packages

Test external tools

Expected Output

Best Practices for RNA-seq Analysis

1. Documentation and Reproducibility

2. Data Management

3. Quality Control

4. Computational Efficiency

Troubleshooting Common Issues

Installation Problems

Clear conda cache

Update conda

Create fresh environment

Use mamba (faster conda alternative)

Jupyter Lab Issues

Check if jupyter is installed

Try different port

Install kernel for your environment

Memory Issues

Quick Reference

Essential Conda Commands

Environment management

Package management