Tutorial for CCC-RISE: Cell-Cell Communication Analysis
========================================================

This tutorial demonstrates the complete CCC-RISE workflow for analyzing cell-cell communication in single-cell RNA-seq data across experimental conditions.

Installation
------------

To add CCC-RISE to your Python package, add the following line to your ``requirements.txt`` and remake your virtual environment::

    git+https://github.com/meyer-lab/cellcommunication-Pf2.git@main

Preprocessing the Dataset
--------------------------

Input Requirements
^^^^^^^^^^^^^^^^^^

Your AnnData object must meet the following requirements:

1. **Condition Index**: Include an observations column ``condition_unique_idxs`` that is a 0-indexed array indicating which condition each cell is derived from, along with the cell barcode. Condition 1 cells are indexed as 0, Condition 2 as 1, and so on.

2. **Preprocessing**: Your AnnData object must be preprocessed (doublets removed, genes filtered, normalized, and log-transformed) before running CCC-RISE. Standard preprocessing functions can assist with gene filtering, normalization, and assigning ``condition_unique_idxs``.

Ligand-Receptor Pairs
^^^^^^^^^^^^^^^^^^^^^

CCC-RISE requires a DataFrame of ligand-receptor pairs to analyze cell-cell communication. This can be obtained from resources like CellChat or other ligand-receptor databases.

**Required DataFrame Format**:

- Must contain columns named ``ligand`` and ``receptor``
- Gene names should match those in your AnnData object (typically uppercase)
- For protein complexes, subunits should be separated by ``&`` (e.g., ``CD74&CD44``)
- Example structure::

    ligand     receptor
    CCL19      CCR7
    PTN        PTPRZ1
    CD74&CD44  CD44

The package includes a default function ``import_ligand_receptor_pairs()`` that loads a curated database from CellChat, but you can provide your own DataFrame following this format.

Import and Prepare the Dataset
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``prepare_dataset`` function assists with preprocessing your data. Parameters:

- ``X``: AnnData object containing raw count data in sparse matrix format
- ``condition_name``: Name of the column in ``X.obs`` that specifies experimental conditions for each cell
- ``geneThreshold``: Minimum mean expression threshold for gene filtering (genes with mean expression below this value are removed)
- ``deviance``: If True, applies deviance transformation instead of log normalization (default: False)

The function performs the following steps:

- Filters cells with fewer than 10 total counts
- Filters genes based on the ``geneThreshold`` parameter
- Normalizes total counts per cell to the median
- Scales gene expression values
- Applies log₁₀((1000 × normalized_value) + 1) transformation (or deviance transformation if specified)
- Creates ``condition_unique_idxs`` column in ``X.obs`` with 0-indexed condition assignments
- Pre-calculates gene means and stores in ``X.var["means"]``

**Example Code**:

Import your dataset as an AnnData object with preprocessed data::

    from cellcommunicationpf2.import_data import prepare_dataset, import_ligand_receptor_pairs
    import anndata
    
    # Load your data
    X = anndata.read_h5ad("your_data.h5ad")
    
    # Prepare the dataset
    X = prepare_dataset(X, condition_name="condition", geneThreshold=0.01)
    
    # Load ligand-receptor pairs
    lr_pairs = import_ligand_receptor_pairs()

Choosing the Rank
------------------

CCC-RISE involves two sequential factorization steps, each requiring rank selection:

1. **RISE Rank Selection**: Determines how many components to extract from the raw scRNA-seq data
2. **CPD Rank Selection**: Determines how many components to use when factorizing the interaction tensor

Both steps require careful evaluation using two complementary metrics:

- **R²X (Variance Explained)**: Measures the proportion of variance in the data explained by the model. Higher values indicate better model fit, but values plateau as rank increases, suggesting diminishing returns.
- **FMS (Factor Match Score)**: Measures the stability/reproducibility of components across different random initializations or bootstrap samples. Values above ~0.6 indicate stable, reliable components. Low FMS suggests overfitting or noise.

RISE Rank Selection
^^^^^^^^^^^^^^^^^^^

**What is RISE?**

RISE (Reduction and Insight in Single-cell Exploration) is an unsupervised, tensor-based computational method designed for the integrative analysis of single-cell RNA sequencing (scRNA-seq) data across multiple experimental conditions, such as drug treatments, patient cohorts, or time points. Built upon the PARAFAC2 tensor decomposition framework, RISE preserves the inherent three-dimensional structure of multi-condition single-cell data—conditions × cells × genes—instead of flattening it into a conventional two-dimensional matrix. The RISE rank determines how many latent patterns are extracted before computing cell-cell communication scores.

**Evaluate RISE Ranks**

Test different RISE ranks to find the optimal balance between model complexity and stability::

    from cellcommunicationpf2.figures.commonFuncs.plotGeneral import plot_fms_r2x_diff_ranks
    import matplotlib.pyplot as plt
    
    # Create figure with two subplots
    fig, ax = plt.subplots(1, 2, figsize=(12, 4))
    
    rise_ranks = list(range(5, 41, 5))
    
    # Plot FMS and R2X for different RISE ranks
    plot_fms_r2x_diff_ranks(
        X, 
        condition_name="condition_unique_idxs",
        ax1=ax[0], 
        ax2=ax[1], 
        ranksList=rise_ranks, 
        runs=e
    )
    
    ax[0].set_title('RISE: Factor Match Score (FMS)')
    ax[0].set_xlabel('RISE Rank')
    ax[1].set_title('RISE: Variance Explained (R²X)')
    ax[1].set_xlabel('RISE Rank')
    plt.tight_layout()
    plt.show()

.. image:: _static/tutorial_images/step1_rise_fms_r2x.png
   :width: 700px
   :align: center

**Interpretation**:

- **R²X**: Look for where the curve begins to plateau. Beyond this point, additional components explain minimal additional variance.
- **FMS**: Choose a rank where FMS is above 0.6, indicating stable components. Higher FMS means the factors are more reproducible.
- **Recommendation**: Balance both metrics. Typically, select the rank where FMS exceeds 0.6 and R²X shows diminishing returns (e.g., RISE rank = 20-40 for most datasets).

CPD Rank Selection
^^^^^^^^^^^^^^^^^^

**What is CPD?**

After RISE, an interaction tensor is computed from cell-cell communication scores between all cell pairs across L-R pairs and conditions. CPD (Canonical Polyadic Decomposition) factorizes this tensor to identify interpretable communication patterns. The CPD rank determines how many communication components are extracted.

**Evaluate CPD Ranks**

First, calculate the interaction tensor using a selected RISE rank, then test different CPD ranks::

    from cellcommunicationpf2.tensor import calculate_interaction_tensor, run_fms_r2x_analysis
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    # Calculate interaction tensor with chosen RISE rank
    rise_rank = 35  # Based on RISE analysis above
    interaction_tensor = calculate_interaction_tensor(X, lr_pairs, rise_rank=rise_rank)
    
    # Test CPD ranks
    rank_list = list(range(1, 11, 2))
    df = run_fms_r2x_analysis(
        interaction_tensor, 
        rank_list=rank_list, 
        runs=3,
        svd_init="random"
    )
    
    # Plot CPD FMS and R2X
    fig, ax = plt.subplots(1, 2, figsize=(12, 4))
    
    sns.lineplot(data=df, x="Component", y="FMS", ax=ax[0], label="FMS")
    ax[0].set_ylim(0, 1)
    ax[0].set_xlabel('CPD Rank')
    
    sns.lineplot(data=df, x="Component", y="R2X", ax=ax[1], color="orange", label="R2X")
    ax[1].set_ylim(0, np.max(df["R2X"]) + 0.02)
    ax[1].set_xlabel('CPD Rank')
    
    plt.tight_layout()
    plt.show()

.. image:: _static/tutorial_images/step2_cpd_fms_r2x.png
   :width: 700px
   :align: center

**Interpretation**:

- **R²X**: Monitor how much of the interaction tensor's variance is explained. Look for an elbow point where additional components provide minimal improvement.
- **FMS**: Ensure components are stable (FMS > 0.6). Lower FMS indicates components may vary significantly with different initializations.
- **Recommendation**: Select the CPD rank where both FMS is high (>0.6) and R²X shows an elbow. Typically 5-20 components capture most meaningful communication patterns.


Running the Factorization
--------------------------

**Perform CCC-RISE Factorization**

Based on the rank selection analysis above, choose appropriate RISE and CPD ranks and run the complete CCC-RISE workflow. This performs the two-step factorization: first extracting latent communication modes from expression data (RISE), then factorizing the resulting interaction tensor (CPD)::

    from cellcommunicationpf2.tensor import run_ccc_rise_workflow
    
    # Select ranks based on FMS/R2X analysis
    rise_rank = 35  # Selected from RISE rank analysis
    cp_rank = 8     # Selected from CPD rank analysis
    
    # Run the complete workflow
    X, r2x = run_ccc_rise_workflow(
        adata=X,
        rise_rank=rise_rank,
        cp_rank=cp_rank,
        lr_pairs=lr_pairs,
        condition_column="condition_column",
        n_iter_max=10000,
        tol=1e-9,
        random_state=42,
        complex_sep="&",
        doEmbedding=True
    )
    
    print(f"Variance Explained (R²X): {r2x:.4f}")

**Function Parameters**

The ``run_ccc_rise_workflow`` function executes the complete CCC-RISE pipeline:

- ``adata``: AnnData object with preprocessed scRNA-seq data
- ``rise_rank``: Number of PARAFAC2 components to extract from expression data
- ``cp_rank``: Number of CPD components for factorizing the interaction tensor (if None, defaults to ``rise_rank``)
- ``lr_pairs``: DataFrame of ligand-receptor pairs with 'ligand' and 'receptor' columns
- ``condition_column``: Column name in ``adata.obs`` containing condition identifiers (default: "sample")
- ``n_iter_max``: Maximum iterations for decomposition (default: 100)
- ``tol``: Convergence tolerance (default: 1e-3)
- ``random_state``: Random seed for reproducibility (default: None)
- ``complex_sep``: Separator for protein complexes in L-R pairs (default: None)
- ``doEmbedding``: If True, computes PaCMAP embeddings for visualization (default: True)
- ``svd_init``: Initialization method for CPD ('svd' or 'random', default: 'svd')

**What Gets Stored**

The function returns the AnnData object with added results and the final R²X value. The following are stored in the AnnData object:

**Factor Matrices** (in ``X.uns``):

- ``X.uns["weights"]`` - Component weights (array of length ``cp_rank``)
- ``X.uns["A"]`` - Condition factor matrix (n_conditions × ``cp_rank``)
- ``X.uns["B"]`` - Sender cell type factor matrix (``rise_rank`` × ``cp_rank``)
- ``X.uns["C"]`` - Receiver cell type factor matrix (``rise_rank`` × ``cp_rank``)
- ``X.uns["D"]`` - Ligand-receptor pair factor matrix (n_lr_pairs × ``cp_rank``)

**Cell Projections** (in ``X.obsm``):

- ``X.obsm["projections"]`` - Cell projections from PARAFAC2 (n_cells × ``rise_rank``)
- ``X.obsm["sc_B"]`` - Sender cell loadings (n_cells × ``cp_rank``)
- ``X.obsm["rc_C"]`` - Receiver cell loadings (n_cells × ``cp_rank``)
- ``X.obsm["PaCMAP"]`` - PaCMAP embedding for visualization (n_cells × 2, if ``doEmbedding=True``)

**Additional Information**:

- ``X.uns["lr_pairs"]`` - Array of ligand-receptor pair names used in the analysis
- ``X.uns["r2x"]`` - Variance explained by the CPD factorization

Visualizing the Factors
------------------------

CCC-RISE decomposes cell-cell communication into four factor matrices, each revealing different aspects of the communication patterns. The package provides specialized plotting functions to visualize each factor type.

**Overview of the Four Factors**

The factorization produces four interpretable factor matrices:

1. **Factor A (Condition Factor)**: Shows how each experimental condition (e.g., patient sample, time point) contributes to each component
2. **Factor B (Sender Cell Eigenstate)**: Represents sender cell patterns in the latent RISE space
3. **Factor C (Receiver Cell Eigenstate)**: Represents receiver cell patterns in the latent RISE space
4. **Factor D (Ligand-Receptor Pairs)**: Shows which L-R pairs drive each communication component

**Visualize All Four Factors**

Create visualizations for each of the four factor types::

    from cellcommunicationpf2.figures.commonFuncs.plotFactors import (
        plot_condition_factors,
        plot_eigenstate_factors,
        plot_lr_factors
    )
    import matplotlib.pyplot as plt
    
    # Prepare condition grouping (if you have multiple conditions per group)
    condition_column = "sample"  # or your condition column name
    group_col = "condition"      # optional: higher-level grouping
    
    # Create mapping from samples to their condition groups
    sample_to_group = X.obs.drop_duplicates(
        subset=[condition_column, group_col]
    ).set_index(condition_column)[group_col]
    
    # Factor A: Condition factors
    fig, ax = plt.subplots(figsize=(6, 8))
    plot_condition_factors(
        X,
        ax,
        cond=condition_column,
        cond_group_labels=sample_to_group,
        group_cond=True,   # Sort by condition groups
        normalize=True     # Normalize each component
    )
    ax.set_title("Factor A: Condition")
    plt.show()
    
    # Factor B: Sender cell eigenstates
    fig, ax = plt.subplots(figsize=(6, 8))
    plot_eigenstate_factors(X, ax, factor_type="B")
    ax.set_title("Factor B: Sender Cell Eigenstate")
    plt.tight_layout()
    plt.show()
    
    # Factor C: Receiver cell eigenstates
    fig, ax = plt.subplots(figsize=(6, 8))
    plot_eigenstate_factors(X, ax, factor_type="C")
    ax.set_title("Factor C: Receiver Cell Eigenstate")
    plt.tight_layout()
    plt.show()
    
    # Factor D: Ligand-receptor pairs
    fig, ax = plt.subplots(figsize=(6, 8))
    plot_lr_factors(
        X, 
        ax, 
        trim=True,      # Show only top L-R pairs
        weight=0.08     # Threshold for inclusion
    )
    ax.set_title("Factor D: LR Pairs")
    plt.tight_layout()
    plt.show()

.. image:: _static/tutorial_images/step3_factor_a.png
   :width: 500px
   :align: center

.. image:: _static/tutorial_images/step3_factor_b.png
   :width: 500px
   :align: center

.. image:: _static/tutorial_images/step3_factor_c.png
   :width: 500px
   :align: center

.. image:: _static/tutorial_images/step3_factor_d.png
   :width: 500px
   :align: center

**Understanding Each Factor**

- **Factor A (Condition)**: Heatmap rows represent conditions/samples, columns represent components. High values indicate which conditions are enriched in each communication pattern.

- **Factor B (Sender Eigenstate)**: Heatmap shows sender cell patterns in the latent space. Each row is a latent dimension from RISE, each column is a CPD component.

- **Factor C (Receiver Eigenstate)**: Similar to Factor B but for receiver cells. Reveals which latent receiver cell states participate in each component.

- **Factor D (LR Pairs)**: Shows which ligand-receptor interactions drive each component. Only top-weighted pairs are displayed for clarity.

**Visualize Cell Projections with PaCMAP**

Explore how individual cells are positioned in the latent communication space::

    from cellcommunicationpf2.figures.commonFuncs.plotPaCMAP import plot_labels_pacmap
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Create figure with 3 subplots
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    
    # Plot colored by cell type
    pal = sns.color_palette(palette="Set3")
    pal = pal.as_hex()
    plot_labels_pacmap(X, labelType="celltype", ax=axes[0], color_key=pal)
    axes[0].set_title("Cells Colored by Cell Type")
    
    # Plot colored by condition
    pal = sns.color_palette("Set2")
    pal = [pal[0], pal[1], pal[2]]
    pal = [f"#{int(r*255):02x}{int(g*255):02x}{int(b*255):02x}" for r, g, b in pal]
    plot_labels_pacmap(X, labelType="condition", ax=axes[1], color_key=pal)
    axes[1].set_title("Cells Colored by Condition")
    
    # Plot colored by sample
    plot_labels_pacmap(X, labelType="sample", ax=axes[2])
    axes[2].set_title("Cells Colored by Sample")
    
    plt.tight_layout()
    plt.show()

.. image:: _static/tutorial_images/step5_pacmap_embedding.png
   :width: 800px
   :align: center

Interpreting a Component
-------------------------

Once you've identified components of interest, dig deeper into the specific cell types and ligand-receptor pairs driving the communication patterns.

**Visualize Specific Component Loadings**

Use specialized plotting functions to visualize component weightings for a specific component::

    from cellcommunicationpf2.figures.commonFuncs.plotPaCMAP import plot_wc_pacmap
    import matplotlib.pyplot as plt
    
    # Select component to visualize
    component = 6
    
    # Create figure with 2 subplots
    fig, ax = plt.subplots(1, 2, figsize=(12, 6))
    
    # Plot sender cell weightings
    plot_wc_pacmap(X, component-1, ax[0], factor_matrix="B", cbarMax=0.3)
    ax[0].set_title(f"Cmp.{component} Sender Cells")
    
    # Plot receiver cell weightings
    plot_wc_pacmap(X, component-1, ax[1], factor_matrix="C", cbarMax=0.3)
    ax[1].set_title(f"Cmp.{component} Receiver Cells")
    
    plt.tight_layout()
    plt.show()

.. image:: _static/tutorial_images/step7_all_component_weightings.png
   :width: 800px
   :align: center


**Violin Plots of Component Weights by Cell Type**

Examine the distribution of component weights for specific cell types::

    import seaborn as sns
    import pandas as pd
    
    # Select a component and cell type of interest
    component = 6 
    cell_type = "mDC"
    
    # Filter cells by cell type
    X_celltype = X[X.obs["celltype"] == cell_type]
    
    # Extract sender cell weights for this component
    sender_weights = X_celltype.obsm["sc_B"][:, component-1]
    
    # Create violin plot
    fig, ax = plt.subplots(figsize=(6, 6))
    sns.violinplot(data=sender_weights, ax=ax)
    ax.set_ylim(-0.1, max(sender_weights) + 0.1)
    ax.set_xlabel(f"{cell_type} Weight Distribution")
    ax.set_ylabel(f"Sender Cell Component {component} Association")
    ax.set_title(f"Distribution of Component {component} Weights in {cell_type}")
    plt.tight_layout()
    plt.show()

.. image:: _static/tutorial_images/step9_violin_weights.png
   :width: 500px
   :align: center

This visualization reveals whether a component is broadly active across all cells of a type or concentrated in a subset.


**Analyze Top LR Pairs per Component**

Visualize the most important ligand-receptor pairs for a specific component::

    from cellcommunicationpf2.figures.commonFuncs.plotFactors import plot_lr_factors_partial
    import matplotlib.pyplot as plt
    
    cmp = 6
    fig, ax = plt.subplots(1, 2, figsize=(12, 6))
    
    # Plot top 10 LR pairs
    plot_lr_factors_partial(X, cmp, ax[0], geneAmount=10, top=True)
    ax[0].set_title(f"Component {cmp}: Top 10 LR Pairs")
    
    # Plot bottom 10 LR pairs
    plot_lr_factors_partial(X, cmp, ax[1], geneAmount=10, top=False)
    ax[1].set_title(f"Component {cmp}: Bottom 10 LR Pairs")
    
    plt.tight_layout()
    plt.show()

.. image:: _static/tutorial_images/step7_top_lr_pairs.png
   :width: 800px
   :align: center


**Expression Product Heatmaps**

Visualize the ligand-receptor expression products between sender and receiver cell populations to validate predicted interactions::

    from cellcommunicationpf2.utils import (
        expression_product_matrix,
        average_product_matrix_ccc
    )
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    # Select component and L-R pair of interest
    ccc_rise_cmp = 6
    ligand = "CCL19"
    receptor = "CCR7"
    
    # Filter and sort sender cells by component weight
    X_mdc_sender = X[X.obs["celltype"] == "mDC"]
    X_mdc_sender = X_mdc_sender[np.argsort(-X_mdc_sender.obsm["sc_B"][:, ccc_rise_cmp-1])]
    
    # Filter and sort receiver cells by component weight
    X_mdc_receiver = X[X.obs["celltype"] == "mDC"]
    X_mdc_receiver = X_mdc_receiver[np.argsort(X_mdc_receiver.obsm["rc_C"][:, ccc_rise_cmp-1])]
    
    # Calculate expression product matrix for mDC -> mDC
    df = expression_product_matrix(X_mdc_sender, X_mdc_receiver, ligand, receptor)
    df = average_product_matrix_ccc(df)
    
    # Create heatmap
    fig, ax = plt.subplots(figsize=(6, 6))
    sns.heatmap(df, ax=ax, cmap="viridis", vmax=0.12)
    ax.set_xlabel("Receiver mDCs")
    ax.set_ylabel("Sender mDCs")
    ax.set_title("CCL19-CCR7 Interaction")
    ax.set_xticks([])
    ax.set_yticks([])
    plt.tight_layout()
    plt.show()

.. image:: _static/tutorial_images/step10_expression_product.png
   :width: 600px
   :align: center

This heatmap shows the product of ligand expression (in senders) and receptor expression (in receivers), revealing which cell pairs have the highest potential for this specific interaction.