01 - Welcome & Setup
Instructor and Agenda Introduction
Welcome to Bioinformatics Analysis of RNA-seq Data! This course is designed for researchers who want to master the computational analysis of RNA sequencing data. Whether you're studying gene expression changes, differential expression analysis, or pathway enrichment, you'll learn to transform raw sequencing data into meaningful biological insights using Python and specialized bioinformatics tools.
My name is Victor Gambarini. I have been working with Python for more than ten years. I got my PhD from The University of Auckland, where I did extensive bioinformatics and programming. One of my greatest research outputs is an online database of microorganisms that can biodegrade plastics called PlasticDB. PlasticDB was 100% coded in Python and has been running stable for many years. Throughout my research, I've analyzed countless RNA-seq datasets, from microbial communities to human samples, and I'm excited to share these practical skills with you!
About This Course
Over the next five days, we'll build your RNA-seq analysis skills from raw data to publication-ready results. You'll learn not just the theory, but hands-on computational approaches that researchers use in real projects. This course emphasizes practical implementation - we'll work with actual RNA-seq datasets and build complete analysis pipelines. Practice is crucial in bioinformatics, so we'll build lots of computational muscle memory!
Course Structure
- Welcome & Setup - Local environment and Jupyter Lab setup
- RNA-seq Fundamentals - Understanding the biology and data
- Quality Control & Preprocessing - FastQC, trimming, and filtering
- Read Alignment - Mapping reads to reference genomes
- Count Matrix Generation - From alignments to gene counts
- Exploratory Data Analysis - PCA, clustering, and visualization
- Differential Expression Analysis - Statistical testing with DESeq2/edgeR
- Multiple Testing Correction - Controlling false discovery rates
- Functional Annotation - Gene ontology and pathway analysis
- Advanced Visualization - Heatmaps, volcano plots, and more
- Pipeline Integration - Workflow management and reproducibility
- Case Study Project - Complete analysis of a real dataset
- Additional Resources - Tools, databases, and best practices
What You'll Accomplish
By the end of this course, you'll be able to:
- Process raw RNA-seq data from FASTQ to count matrices
- Perform comprehensive quality control and preprocessing
- Conduct statistical differential expression analysis
- Create publication-quality visualizations
- Interpret results in biological context
- Build reproducible analysis pipelines
- Apply best practices for bioinformatics workflows
Prerequisites
Basic Python knowledge is recommended (variables, lists, functions). Familiarity with:
- Command line basics
- Basic statistics concepts
- Molecular biology fundamentals (genes, transcription, etc.)
You'll need: - A computer with at least 8GB RAM (16GB recommended) - Admin privileges to install software - Stable internet connection for downloads
Why Use Local Jupyter Lab
What is Jupyter Lab?
Jupyter Lab is a powerful, interactive development environment that runs locally on your computer. It provides a web-based interface for creating notebooks, managing files, and running code. For bioinformatics, it offers the perfect balance of interactivity and computational power.
Why Local Environment for RNA-seq Analysis
1. Computational Requirements
- RNA-seq datasets are large (GBs to TBs)
- Analysis requires substantial memory and processing power
- Local control over computational resources
- No session time limits or cloud restrictions
2. Data Security and Privacy
- Sensitive genomic data stays on your machine
- No upload limitations or privacy concerns
- Full control over data access and sharing
- Compliance with institutional data policies
3. Tool Ecosystem
- Access to specialized bioinformatics software
- Easy integration with command-line tools
- Custom environment configuration
- Installation of specific package versions
4. Long-Running Analyses
- Alignment and processing can take hours/days
- No session timeouts or disconnections
- Background processing capabilities
- Persistent data storage
5. Real-World Workflow
- Mirrors professional bioinformatics environments
- Better preparation for research computing
- Integration with HPC clusters
- Industry-standard practices
Jupyter Lab vs. Other Options
Feature | Jupyter Lab | Google Colab | RStudio |
---|---|---|---|
Data Size Limits | None | 25GB | None |
Runtime Limits | None | 12 hours | None |
Custom Software | Full control | Limited | R-focused |
Privacy | Complete | Cloud-based | Complete |
Computational Power | Your hardware | Limited free tier | Your hardware |
Bioinformatics Tools | Full ecosystem | Limited | R packages |
Installing Jupyter Lab and Dependencies
Step 1: Install Python and Conda
We'll use Anaconda or Miniconda for package management:
Option A: Anaconda (Recommended for beginners)
- Download from anaconda.com
- Run the installer for your operating system
- Follow installation prompts (accept defaults)
- Restart your terminal/command prompt
Option B: Miniconda (Lightweight alternative)
- Download from docs.conda.io/en/latest/miniconda.html
- Install following the same process as Anaconda
Step 2: Create a Bioinformatics Environment
Open your terminal/command prompt and create a dedicated environment:
```bash
Create new environment
conda create -n rnaseq python=3.11
Activate the environment
conda activate rnaseq
Install Jupyter Lab
conda install jupyterlab
Install essential packages
conda install pandas numpy matplotlib seaborn scipy conda install -c bioconda biopython pysam conda install -c conda-forge scanpy
Step 3: Install Additional Bioinformatics Tools
```bash
Quality control tools
conda install -c bioconda fastqc multiqc
Alignment tools
conda install -c bioconda star hisat2 bowtie2
Count and analysis tools
conda install -c bioconda subread htseq conda install -c bioconda samtools bedtools
R integration (for DESeq2)
conda install r-base r-essentials conda install -c bioconda bioconductor-deseq2
Step 4: Launch Jupyter Lab
```bash
Make sure your environment is activated
conda activate rnaseq
Launch Jupyter Lab
jupyter lab
Your default web browser should open with Jupyter Lab interface at http://localhost:8888
Jupyter Lab Interface Tour
Main Interface Components
1. File Browser (Left Sidebar)
- Purpose: Navigate your file system
- Features: Create folders, upload files, rename items
- Tip: Right-click for context menu options
2. Launcher Tab
- Purpose: Create new notebooks, terminals, and files
- Options:
- Python 3 notebook
- Terminal
- Text file
- Markdown file
3. Notebook Interface
- Code Cells: Execute Python code
- Markdown Cells: Add documentation and explanations
- Output: Results appear below cells
Creating Your First Notebook
- Click "Python 3" under "Notebook" in the Launcher
- A new notebook opens with an empty code cell
- Rename by right-clicking the tab and selecting "Rename"
Essential Jupyter Lab Features
Cell Types
- Code Cell: Run Python code (default)
- Markdown Cell: Format text with Markdown
- Raw Cell: Plain text (rarely used)
Keyboard Shortcuts
Shift + Enter
: Run cell and move to nextCtrl + Enter
: Run cell and stayA
: Insert cell aboveB
: Insert cell belowDD
: Delete cellM
: Change to Markdown cellY
: Change to Code cell
Magic Commands
Time execution
%time code_here
Time multiple runs
%timeit code_here
Run shell commands
!ls -la
Load external Python files
%load filename.py
Setting Up Project Structure
Creating a Bioinformatics Project
Let's create a well-organized project structure:
- Create project folder in Jupyter Lab file browser
- Right-click in the file browser and select "New Folder"
- Name it:
RNA_seq_analysis_2025
Recommended Directory Structure
RNA_seq_analysis_2025/
├── data/
│ ├── raw/ # Original FASTQ files
│ ├── processed/ # Quality-controlled data
│ ├── aligned/ # Alignment results
│ └── counts/ # Count matrices
├── results/
│ ├── qc/ # Quality control reports
│ ├── figures/ # Generated plots
│ └── tables/ # Result tables
├── scripts/
│ ├── notebooks/ # Analysis notebooks
│ └── python/ # Standalone Python scripts
├── references/
│ ├── genome/ # Reference genome files
│ └── annotation/ # Gene annotation files
└── README.md # Project documentation
Create the Structure
Open a new terminal in Jupyter Lab (File > New > Terminal) and run:
```bash cd RNA_seq_analysis_2025
Create directories
mkdir -p data/{raw,processed,aligned,counts} mkdir -p results/{qc,figures,tables} mkdir -p scripts/{notebooks,python} mkdir -p references/{genome,annotation}
Create README file
touch README.md
Download Practice Dataset
We'll work with a subset of a real RNA-seq dataset throughout the course.
Step 1: Create Download Script
Create a new notebook called 01_data_download.ipynb
in the scripts/notebooks/
folder:
import urllib.request
import os
Create data directories if they don't exist
os.makedirs('../../data/raw', exist_ok=True)
Download sample FASTQ files (small subset for practice)
base_url = "https://example-rnaseq-data.com/samples/"
samples = ["sample1_R1.fastq.gz", "sample1_R2.fastq.gz",
"sample2_R1.fastq.gz", "sample2_R2.fastq.gz"]
for sample in samples:
url = base_url + sample
filepath = f"../../data/raw/{sample}"
print(f"Downloading {sample}...")
# urllib.request.urlretrieve(url, filepath)
print(f"Saved to {filepath}")
print("Download complete!")
Step 2: Download Reference Files
Download reference genome and annotation
ref_url = "https://example-references.com/"
references = {
"genome.fa": "references/genome/",
"genes.gtf": "references/annotation/"
}
for filename, folder in references.items():
os.makedirs(f"../../{folder}", exist_ok=True)
# Download code here
print(f"Downloaded {filename} to {folder}")
Note: We'll use publicly available test datasets. Real download links will be provided in class.
Environment Testing
Test Your Installation
Create a new notebook called 00_environment_test.ipynb
:
Test basic Python packages
import sys
print(f"Python version: {sys.version}")
import pandas as pd
print(f"Pandas version: {pd.version}")
import numpy as np
print(f"NumPy version: {np.version}")
import matplotlib.pyplot as plt
print("Matplotlib imported successfully")
import seaborn as sns
print(f"Seaborn version: {sns.version}")
Test bioinformatics packages
try:
import Bio
print(f"Biopython version: {Bio.version}")
except ImportError:
print("Biopython not installed")
try:
import pysam
print(f"Pysam version: {pysam.version}")
except ImportError:
print("Pysam not installed")
Test external tools
import subprocess
tools = ['fastqc', 'multiqc', 'STAR', 'hisat2']
for tool in tools:
try:
result = subprocess.run([tool, '--version'],
capture_output=True, text=True)
if result.returncode == 0:
print(f"{tool}: Available")
else:
print(f"{tool}: Not found")
except FileNotFoundError:
print(f"{tool}: Not found")
Expected Output
Python version: 3.11.x
Pandas version: 2.x.x
NumPy version: 1.x.x
Matplotlib imported successfully
Seaborn version: 0.x.x
Biopython version: 1.x.x
Pysam version: 0.x.x
fastqc: Available
multiqc: Available
STAR: Available
hisat2: Available
Best Practices for RNA-seq Analysis
1. Documentation and Reproducibility
- Document everything: Use markdown cells extensively
- Version control: We'll set up Git in a later session
- Environment tracking: Export conda environments
- Parameter logging: Record all analysis parameters
2. Data Management
- Raw data is sacred: Never modify original files
- Intermediate files: Keep processing intermediates
- Backup important results: Multiple copies of key outputs
- File naming: Use consistent, descriptive names
3. Quality Control
- QC at every step: Raw data, processed data, results
- Visual inspection: Always plot your data
- Sanity checks: Verify results make biological sense
- Negative controls: Include where appropriate
4. Computational Efficiency
- Resource monitoring: Watch memory and CPU usage
- Parallel processing: Use multiple cores when possible
- Checkpointing: Save intermediate results
- Code optimization: Profile slow steps
Troubleshooting Common Issues
Installation Problems
Conda conflicts: ```bash
Clear conda cache
conda clean --all
Update conda
conda update conda
Create fresh environment
conda create -n rnaseq_new python=3.11
Package conflicts: ```bash
Use mamba (faster conda alternative)
conda install mamba mamba install package_name
Jupyter Lab Issues
Lab won't start: ```bash
Check if jupyter is installed
jupyter --version
Try different port
jupyter lab --port=8889
Kernel problems: ```bash
Install kernel for your environment
python -m ipykernel install --user --name rnaseq
Memory Issues
Large datasets:
- Process data in chunks
- Use memory-efficient formats (HDF5, Parquet)
- Monitor memory usage with htop
or Activity Monitor
- Consider using a high-memory machine or cluster
Quick Reference
Essential Conda Commands
```bash
Environment management
conda create -n envname python=3.11 conda activate envname conda deactivate conda env list
Package management
conda install package_name conda install -c channel package_name conda update package_name conda list
Environment export/import
conda env export > environment.yml conda env create -f environment.yml
Jupyter Lab Shortcuts
Shift + Enter
: Run cell and advanceCtrl + Enter
: Run cell in placeA
: Insert cell aboveB
: Insert cell belowDD
: Delete cellM
: Markdown cellY
: Code cellCtrl + S
: Save notebook
Getting Help
- In Jupyter:
help(function)
or?function
- Documentation:
??function
for source code - Package docs: Most have online documentation
- Bioinformatics: Biostars, SEQanswers
What's Next?
In our next session, we'll dive into RNA-seq fundamentals - understanding the biology, data formats, and quality metrics. Make sure you can successfully launch Jupyter Lab, create notebooks, and have all required packages installed before we continue.
Homework: 1. Complete the environment test notebook 2. Familiarize yourself with the Jupyter Lab interface 3. Review basic molecular biology concepts (transcription, gene expression) 4. Read about FASTQ file format if unfamiliar
Enjoying this course?
This is just the first episode! Register to unlock 0 more episodes and complete your learning journey.
Register for Full Course