Bioinformatics Rnaseq Analysis

01 - Welcome & Setup

Instructor and Agenda Introduction

Welcome to Bioinformatics Analysis of RNA-seq Data! This course is designed for researchers who want to master the computational analysis of RNA sequencing data. Whether you're studying gene expression changes, differential expression analysis, or pathway enrichment, you'll learn to transform raw sequencing data into meaningful biological insights using Python and specialized bioinformatics tools.

My name is Victor Gambarini. I have been working with Python for more than ten years. I got my PhD from The University of Auckland, where I did extensive bioinformatics and programming. One of my greatest research outputs is an online database of microorganisms that can biodegrade plastics called PlasticDB. PlasticDB was 100% coded in Python and has been running stable for many years. Throughout my research, I've analyzed countless RNA-seq datasets, from microbial communities to human samples, and I'm excited to share these practical skills with you!

About This Course

Over the next five days, we'll build your RNA-seq analysis skills from raw data to publication-ready results. You'll learn not just the theory, but hands-on computational approaches that researchers use in real projects. This course emphasizes practical implementation - we'll work with actual RNA-seq datasets and build complete analysis pipelines. Practice is crucial in bioinformatics, so we'll build lots of computational muscle memory!

Course Structure

  1. Welcome & Setup - Local environment and Jupyter Lab setup
  2. RNA-seq Fundamentals - Understanding the biology and data
  3. Quality Control & Preprocessing - FastQC, trimming, and filtering
  4. Read Alignment - Mapping reads to reference genomes
  5. Count Matrix Generation - From alignments to gene counts
  6. Exploratory Data Analysis - PCA, clustering, and visualization
  7. Differential Expression Analysis - Statistical testing with DESeq2/edgeR
  8. Multiple Testing Correction - Controlling false discovery rates
  9. Functional Annotation - Gene ontology and pathway analysis
  10. Advanced Visualization - Heatmaps, volcano plots, and more
  11. Pipeline Integration - Workflow management and reproducibility
  12. Case Study Project - Complete analysis of a real dataset
  13. Additional Resources - Tools, databases, and best practices

What You'll Accomplish

By the end of this course, you'll be able to:

  • Process raw RNA-seq data from FASTQ to count matrices
  • Perform comprehensive quality control and preprocessing
  • Conduct statistical differential expression analysis
  • Create publication-quality visualizations
  • Interpret results in biological context
  • Build reproducible analysis pipelines
  • Apply best practices for bioinformatics workflows

Prerequisites

Basic Python knowledge is recommended (variables, lists, functions). Familiarity with:

  • Command line basics
  • Basic statistics concepts
  • Molecular biology fundamentals (genes, transcription, etc.)

You'll need: - A computer with at least 8GB RAM (16GB recommended) - Admin privileges to install software - Stable internet connection for downloads


Why Use Local Jupyter Lab

What is Jupyter Lab?

Jupyter Lab is a powerful, interactive development environment that runs locally on your computer. It provides a web-based interface for creating notebooks, managing files, and running code. For bioinformatics, it offers the perfect balance of interactivity and computational power.

Why Local Environment for RNA-seq Analysis

1. Computational Requirements

  • RNA-seq datasets are large (GBs to TBs)
  • Analysis requires substantial memory and processing power
  • Local control over computational resources
  • No session time limits or cloud restrictions

2. Data Security and Privacy

  • Sensitive genomic data stays on your machine
  • No upload limitations or privacy concerns
  • Full control over data access and sharing
  • Compliance with institutional data policies

3. Tool Ecosystem

  • Access to specialized bioinformatics software
  • Easy integration with command-line tools
  • Custom environment configuration
  • Installation of specific package versions

4. Long-Running Analyses

  • Alignment and processing can take hours/days
  • No session timeouts or disconnections
  • Background processing capabilities
  • Persistent data storage

5. Real-World Workflow

  • Mirrors professional bioinformatics environments
  • Better preparation for research computing
  • Integration with HPC clusters
  • Industry-standard practices

Jupyter Lab vs. Other Options

Feature Jupyter Lab Google Colab RStudio
Data Size Limits None 25GB None
Runtime Limits None 12 hours None
Custom Software Full control Limited R-focused
Privacy Complete Cloud-based Complete
Computational Power Your hardware Limited free tier Your hardware
Bioinformatics Tools Full ecosystem Limited R packages

Installing Jupyter Lab and Dependencies

Step 1: Install Python and Conda

We'll use Anaconda or Miniconda for package management:

Option A: Anaconda (Recommended for beginners)

  1. Download from anaconda.com
  2. Run the installer for your operating system
  3. Follow installation prompts (accept defaults)
  4. Restart your terminal/command prompt

Option B: Miniconda (Lightweight alternative)

  1. Download from docs.conda.io/en/latest/miniconda.html
  2. Install following the same process as Anaconda

Step 2: Create a Bioinformatics Environment

Open your terminal/command prompt and create a dedicated environment:

```bash

Create new environment

conda create -n rnaseq python=3.11

Activate the environment

conda activate rnaseq

Install Jupyter Lab

conda install jupyterlab

Install essential packages

conda install pandas numpy matplotlib seaborn scipy conda install -c bioconda biopython pysam conda install -c conda-forge scanpy

Step 3: Install Additional Bioinformatics Tools

```bash

Quality control tools

conda install -c bioconda fastqc multiqc

Alignment tools

conda install -c bioconda star hisat2 bowtie2

Count and analysis tools

conda install -c bioconda subread htseq conda install -c bioconda samtools bedtools

R integration (for DESeq2)

conda install r-base r-essentials conda install -c bioconda bioconductor-deseq2

Step 4: Launch Jupyter Lab

```bash

Make sure your environment is activated

conda activate rnaseq

Launch Jupyter Lab

jupyter lab

Your default web browser should open with Jupyter Lab interface at http://localhost:8888


Jupyter Lab Interface Tour

Main Interface Components

1. File Browser (Left Sidebar)

  • Purpose: Navigate your file system
  • Features: Create folders, upload files, rename items
  • Tip: Right-click for context menu options

2. Launcher Tab

  • Purpose: Create new notebooks, terminals, and files
  • Options:
  • Python 3 notebook
  • Terminal
  • Text file
  • Markdown file

3. Notebook Interface

  • Code Cells: Execute Python code
  • Markdown Cells: Add documentation and explanations
  • Output: Results appear below cells

Creating Your First Notebook

  1. Click "Python 3" under "Notebook" in the Launcher
  2. A new notebook opens with an empty code cell
  3. Rename by right-clicking the tab and selecting "Rename"

Essential Jupyter Lab Features

Cell Types

  • Code Cell: Run Python code (default)
  • Markdown Cell: Format text with Markdown
  • Raw Cell: Plain text (rarely used)

Keyboard Shortcuts

  • Shift + Enter: Run cell and move to next
  • Ctrl + Enter: Run cell and stay
  • A: Insert cell above
  • B: Insert cell below
  • DD: Delete cell
  • M: Change to Markdown cell
  • Y: Change to Code cell

Magic Commands

Time execution

%time code_here

Time multiple runs

%timeit code_here

Run shell commands

!ls -la

Load external Python files

%load filename.py


Setting Up Project Structure

Creating a Bioinformatics Project

Let's create a well-organized project structure:

  1. Create project folder in Jupyter Lab file browser
  2. Right-click in the file browser and select "New Folder"
  3. Name it: RNA_seq_analysis_2025

Recommended Directory Structure

RNA_seq_analysis_2025/ ├── data/ │ ├── raw/ # Original FASTQ files │ ├── processed/ # Quality-controlled data │ ├── aligned/ # Alignment results │ └── counts/ # Count matrices ├── results/ │ ├── qc/ # Quality control reports │ ├── figures/ # Generated plots │ └── tables/ # Result tables ├── scripts/ │ ├── notebooks/ # Analysis notebooks │ └── python/ # Standalone Python scripts ├── references/ │ ├── genome/ # Reference genome files │ └── annotation/ # Gene annotation files └── README.md # Project documentation

Create the Structure

Open a new terminal in Jupyter Lab (File > New > Terminal) and run:

```bash cd RNA_seq_analysis_2025

Create directories

mkdir -p data/{raw,processed,aligned,counts} mkdir -p results/{qc,figures,tables} mkdir -p scripts/{notebooks,python} mkdir -p references/{genome,annotation}

Create README file

touch README.md


Download Practice Dataset

We'll work with a subset of a real RNA-seq dataset throughout the course.

Step 1: Create Download Script

Create a new notebook called 01_data_download.ipynb in the scripts/notebooks/ folder:


import urllib.request
import os

Create data directories if they don't exist

os.makedirs('../../data/raw', exist_ok=True)

Download sample FASTQ files (small subset for practice)

base_url = "https://example-rnaseq-data.com/samples/" samples = ["sample1_R1.fastq.gz", "sample1_R2.fastq.gz", "sample2_R1.fastq.gz", "sample2_R2.fastq.gz"]

for sample in samples: url = base_url + sample filepath = f"../../data/raw/{sample}" print(f"Downloading {sample}...") # urllib.request.urlretrieve(url, filepath) print(f"Saved to {filepath}")

print("Download complete!")

Step 2: Download Reference Files

Download reference genome and annotation

ref_url = "https://example-references.com/" references = { "genome.fa": "references/genome/", "genes.gtf": "references/annotation/" }

for filename, folder in references.items(): os.makedirs(f"../../{folder}", exist_ok=True) # Download code here print(f"Downloaded {filename} to {folder}")

Note: We'll use publicly available test datasets. Real download links will be provided in class.


Environment Testing

Test Your Installation

Create a new notebook called 00_environment_test.ipynb:

Test basic Python packages

import sys print(f"Python version: {sys.version}")

import pandas as pd print(f"Pandas version: {pd.version}")

import numpy as np print(f"NumPy version: {np.version}")

import matplotlib.pyplot as plt print("Matplotlib imported successfully")

import seaborn as sns print(f"Seaborn version: {sns.version}")

Test bioinformatics packages

try: import Bio print(f"Biopython version: {Bio.version}") except ImportError: print("Biopython not installed")

try: import pysam print(f"Pysam version: {pysam.version}") except ImportError: print("Pysam not installed")

Test external tools

import subprocess

tools = ['fastqc', 'multiqc', 'STAR', 'hisat2'] for tool in tools: try: result = subprocess.run([tool, '--version'], capture_output=True, text=True) if result.returncode == 0: print(f"{tool}: Available") else: print(f"{tool}: Not found") except FileNotFoundError: print(f"{tool}: Not found")

Expected Output

Python version: 3.11.x Pandas version: 2.x.x NumPy version: 1.x.x Matplotlib imported successfully Seaborn version: 0.x.x Biopython version: 1.x.x Pysam version: 0.x.x fastqc: Available multiqc: Available STAR: Available hisat2: Available


Best Practices for RNA-seq Analysis

1. Documentation and Reproducibility

  • Document everything: Use markdown cells extensively
  • Version control: We'll set up Git in a later session
  • Environment tracking: Export conda environments
  • Parameter logging: Record all analysis parameters

2. Data Management

  • Raw data is sacred: Never modify original files
  • Intermediate files: Keep processing intermediates
  • Backup important results: Multiple copies of key outputs
  • File naming: Use consistent, descriptive names

3. Quality Control

  • QC at every step: Raw data, processed data, results
  • Visual inspection: Always plot your data
  • Sanity checks: Verify results make biological sense
  • Negative controls: Include where appropriate

4. Computational Efficiency

  • Resource monitoring: Watch memory and CPU usage
  • Parallel processing: Use multiple cores when possible
  • Checkpointing: Save intermediate results
  • Code optimization: Profile slow steps

Troubleshooting Common Issues

Installation Problems

Conda conflicts: ```bash

Clear conda cache

conda clean --all

Update conda

conda update conda

Create fresh environment

conda create -n rnaseq_new python=3.11

Package conflicts: ```bash

Use mamba (faster conda alternative)

conda install mamba mamba install package_name

Jupyter Lab Issues

Lab won't start: ```bash

Check if jupyter is installed

jupyter --version

Try different port

jupyter lab --port=8889

Kernel problems: ```bash

Install kernel for your environment

python -m ipykernel install --user --name rnaseq

Memory Issues

Large datasets: - Process data in chunks - Use memory-efficient formats (HDF5, Parquet) - Monitor memory usage with htop or Activity Monitor - Consider using a high-memory machine or cluster


Quick Reference

Essential Conda Commands

```bash

Environment management

conda create -n envname python=3.11 conda activate envname conda deactivate conda env list

Package management

conda install package_name conda install -c channel package_name conda update package_name conda list

Environment export/import

conda env export > environment.yml conda env create -f environment.yml

Jupyter Lab Shortcuts

  • Shift + Enter: Run cell and advance
  • Ctrl + Enter: Run cell in place
  • A: Insert cell above
  • B: Insert cell below
  • DD: Delete cell
  • M: Markdown cell
  • Y: Code cell
  • Ctrl + S: Save notebook

Getting Help

  • In Jupyter: help(function) or ?function
  • Documentation: ??function for source code
  • Package docs: Most have online documentation
  • Bioinformatics: Biostars, SEQanswers

What's Next?

In our next session, we'll dive into RNA-seq fundamentals - understanding the biology, data formats, and quality metrics. Make sure you can successfully launch Jupyter Lab, create notebooks, and have all required packages installed before we continue.

Homework: 1. Complete the environment test notebook 2. Familiarize yourself with the Jupyter Lab interface 3. Review basic molecular biology concepts (transcription, gene expression) 4. Read about FASTQ file format if unfamiliar

Enjoying this course?

This is just the first episode! Register to unlock 0 more episodes and complete your learning journey.

Register for Full Course