picasso.build_tree module

PICASSO: Phylogenetic Inference of Copy number Alterations in Single-cell Sequencing data Optimization.

This module implements the core PICASSO algorithm for reconstructing tumor phylogenies from noisy, inferred copy number alteration (CNA) data derived from single-cell RNA sequencing. The algorithm uses iterative binary splitting with categorical mixture models to handle uncertainty and noise typical in scRNA-seq-inferred CNAs.

Classes

Picasso

Main class implementing the phylogenetic inference algorithm with noise handling capabilities designed specifically for scRNA-seq-inferred CNA data.

Examples

Basic phylogenetic reconstruction:

>>> from picasso import Picasso, load_data
>>>
>>> # Load example CNA data
>>> cna_data = load_data()
>>>
>>> # Initialize with parameters suitable for noisy data
>>> picasso = Picasso(cna_data,
...                  min_clone_size=10,  # Larger for noisy data
...                  assignment_confidence_threshold=0.8)
>>>
>>> # Reconstruct phylogeny
>>> picasso.fit()
>>> phylogeny = picasso.get_phylogeny()
>>> assignments = picasso.get_clone_assignments()

Notes

The PICASSO algorithm is specifically designed to handle the challenges of: - Noise and artifacts in scRNA-seq-inferred CNAs - Uncertainty in copy number state assignments - Variable clone sizes and imbalanced data - Over-fitting to noise patterns

See also

CloneTree

Visualization and analysis of phylogenetic results

utils

Utility functions for data preprocessing and loading

itol_utils

Functions for creating iTOL-compatible visualizations

class picasso.build_tree.Picasso(character_matrix, min_depth=None, max_depth=None, min_clone_size=5, terminate_by='probability', assignment_confidence_threshold=0.75, assignment_confidence_proportion=0.8, bic_penalty_strength=1.0)[source]

Bases: object

__init__(character_matrix, min_depth=None, max_depth=None, min_clone_size=5, terminate_by='probability', assignment_confidence_threshold=0.75, assignment_confidence_proportion=0.8, bic_penalty_strength=1.0)[source]

Initialize the PICASSO model for phylogenetic inference from noisy CNA data.

PICASSO (Phylogenetic Inference of Copy number Alterations in Single-cell Sequencing data Optimization) reconstructs tumor phylogenies from inferred copy number alterations (CNAs) derived from single-cell RNA sequencing data. Unlike direct scDNA-seq data, scRNA-seq-inferred CNAs are noisy and require specialized handling for more accurate phylogenetic reconstruction.

Parameters:
  • character_matrix (pd.DataFrame) – An integer matrix where rows are single cells/samples and columns are genomic features (e.g., chromosome arms, genes, or genomic bins). Values represent inferred copy number states (e.g., 0=deletion, 1=neutral, 2=amplification). For noisy scRNA-seq-inferred data, values may include noise artifacts that PICASSO handles through probabilistic modeling.

  • min_depth (int, optional) – The minimum depth (number of splitting iterations) of the phylogeny. Forces algorithm to continue splitting even if termination criteria are met, useful for exploring deeper clonal structure in noisy data. Default is None (no minimum enforced).

  • max_depth (int, optional) – The maximum depth of the phylogeny to prevent over-fitting in noisy data. Default is None (unlimited depth).

  • min_clone_size (int, default=5) – The minimum number of cells required in a clone for it to be split further. Larger values help avoid spurious clones arising from noise in scRNA-seq-inferred CNAs. Recommended: 50-100 cells for noisy data, 10-50 for high-quality data.

  • terminate_by ({'probability', 'BIC'}, default='probability') – The criterion used to terminate clone splitting: - ‘probability’: Uses assignment confidence to handle uncertainty in noisy data - ‘BIC’: Uses Bayesian Information Criterion for model selection

  • assignment_confidence_threshold (float, default=0.75) – Minimum confidence threshold for clone assignments when terminate_by=’probability’. Higher values (0.8-0.9) recommended for very noisy scRNA-seq data to ensure confident assignments. Must be between 0 and 1.

  • assignment_confidence_proportion (float, default=0.8) – Minimum proportion of cells with confident assignments required for clone splitting when terminate_by=’probability’. Higher values help avoid splitting based on uncertain assignments in noisy data. Must be between 0 and 1.

  • bic_penalty_strength (float, default=1.0) – Strength of BIC penalty term. Higher values (>1.0) encourage simpler models, useful for noisy data to prevent over-fitting.

terminal_clones

Dictionary tracking clones marked as terminal (no further splitting). Keys are clone identifiers, values are pandas Index objects of cell identifiers.

Type:

Dict[str, pd.Index]

clones

Dictionary mapping current clone IDs to pandas Index objects of cell identifiers belonging to each clone. Updated during tree construction.

Type:

Dict[str, pd.Index]

depth

Current depth of phylogenetic tree construction.

Type:

int

Raises:
  • AssertionError – If character_matrix is not a pandas DataFrame.

  • ValueError – If character_matrix cannot be converted to integer values.

  • AssertionError – If confidence thresholds are not between 0 and 1, or if min/max depth values are invalid.

Examples

Basic usage with scRNA-seq-inferred CNA data:

>>> from picasso import Picasso, load_data
>>>
>>> # Load example CNA data
>>> character_matrix = load_data()
>>>
>>> # Initialize PICASSO with parameters suitable for noisy data
>>> picasso = Picasso(character_matrix,
...                  min_clone_size=10,  # Choose a larger value for very noisy data
...                  assignment_confidence_threshold=0.85,  # Higher confidence
...                  assignment_confidence_proportion=0.9)
>>>
>>> # Fit the model
>>> picasso.fit()
>>>
>>> # Get results
>>> phylogeny = picasso.get_phylogeny()
>>> clone_assignments = picasso.get_clone_assignments()

For very noisy data, use stricter parameters:

>>> # Parameters for very noisy scRNA-seq-inferred CNAs
>>> picasso_strict = Picasso(character_matrix,
...                         min_clone_size=50,
...                         max_depth=8,  # Limit depth to avoid over-fitting
...                         assignment_confidence_threshold=0.9,
...                         assignment_confidence_proportion=0.95)  # Stronger penalty
>>> # Alternatively, use BIC-based termination
>>> picasso_strict = Picasso(character_matrix,
...                         min_clone_size=50,
...                         min_depth=3, # Force splitting to a depth of 3
...                         max_depth=8,  # Limit depth to avoid over-fitting
...                         terminate_by='BIC')
>>> picasso_strict.fit()

Notes

The PICASSO algorithm proceeds through the following steps:

  1. Initialization: All cells start in a single root clone

  2. Iterative Splitting: At each depth level: - For each current clone, fit Categorical Mixture Models with k=1 and k=2 components - Evaluate splitting criteria (BIC or assignment confidence) - Split clones that meet criteria into two daughter clones

  3. Termination: Stop when no clones can be split further or max_depth is reached

  4. Tree Construction: Build phylogenetic tree from clone hierarchy. Leaves are clones containing cells

    whose CNAs cannot be further distinguised reliably.

Handling Noisy scRNA-seq Data: - Uses probabilistic assignment with confidence thresholds - Minimum clone size prevents spurious small clones from noise - BIC penalty prevents over-fitting to noise artifacts - Confidence-based termination handles assignment uncertainty

Model Assumptions: - CNAs are acquired progressively but can be acquired multiple times independently (no perfect phylogeny assumption) - Each genomic feature evolves independently - Copy number states follow categorical distributions within clones - Noise is handled through mixture model uncertainty quantification

See also

CloneTree

Class for phylogenetic tree visualization and analysis

get_phylogeny

Method to extract the reconstructed phylogeny

get_clone_assignments

Method to get cell-to-clone assignments

terminal_clones: Dict[str, Index]
clones: Dict[str, Index]
depth: int
split_clone(clone, force_split=False)[source]

Attempt to split a single clone into two daughter clones using mixture modeling.

Evaluates whether a clone should be split by fitting Categorical Mixture Models and applying termination criteria. This is the core method for handling noisy CNA data through probabilistic modeling and confidence-based decisions.

Parameters:
  • clone (str) – Identifier of the clone to attempt splitting. Should be a key in self.clones.

  • force_split (bool, default=False) – If True, override normal termination criteria and force splitting (used when min_depth hasn’t been reached). Still respects minimum clone size constraints.

Returns:

Dictionary mapping new clone identifiers to pandas Index objects containing the cell/sample identifiers assigned to each clone: - If split successful: {‘{clone}-0’: cells_0, ‘{clone}-1’: cells_1} - If terminated: {‘{clone}-STOP’: original_cells} - If already terminal: {clone: original_cells}

Return type:

dict

Examples

>>> from picasso import Picasso, load_data
>>> character_matrix = load_data()
>>> picasso = Picasso(character_matrix)
>>> # After some fitting steps, try splitting a specific clone
>>> result = picasso.split_clone('1-0')
>>> print(f"Split result: {list(result.keys())}")

Force splitting (ignoring confidence criteria): >>> forced_result = picasso.split_clone(‘1-1’, force_split=True)

Notes

Splitting Process: 1. Check if clone is already terminal (return unchanged) 2. Extract CNA profiles for cells in the clone 3. Filter features with sufficient variance (> 1e-10) for performance improvements 4. Fit mixture models with k=1 and k=2 components 5. Evaluate termination criteria (BIC or confidence) 6. Apply minimum clone size constraint 7. Return split result or mark as terminal

Termination Criteria: - BIC: k=1 model has lower BIC than k=2 model - Probability: Insufficient assignment confidence or proportion - Size constraint: Either daughter clone below min_clone_size

Noise Handling: - Confidence thresholds prevent splits based on uncertain assignments - Minimum clone sizes avoid spurious small clusters - Variance filtering removes uninformative features - Multiple model fitting attempts with different initializations

See also

step

Apply split_clone to all current leaf clones

_select_model

Internal method for mixture model fitting

fit

Main method that orchestrates the complete splitting process

step(force_split=False)[source]

Execute one complete iteration of clone splitting across all current leaf clones.

Applies the split_clone method to all current leaf clones in parallel, representing one depth level of the phylogenetic reconstruction process. This method coordinates the simultaneous evaluation of all clones at the current tree depth.

Parameters:

force_split (bool, default=False) – If True, attempts to force splits even when normal termination criteria are met. Used when enforcing minimum tree depth requirements. Individual clones may still be terminated if size constraints are violated.

Return type:

None

Notes

Single Step Process: 1. Iterate through all current leaf clones 2. Apply split_clone() to each clone 3. Collect all resulting clones (split or terminal) 4. Update self.clones with the new clone structure 5. Terminal clones are tracked in self.terminal_clones

Progress Tracking: - Uses tqdm progress bar to show splitting progress - Logs clone processing information at debug level - Reports clone sizes and splitting decisions

State Modification: - Updates self.clones with new clone structure - Adds terminal clones to self.terminal_clones - Preserves cell-to-clone assignment mappings

Parallelization Note: Currently processes clones sequentially. Future versions may implement parallel processing for large datasets.

Examples

>>> from picasso import Picasso, load_data
>>> character_matrix = load_data()
>>> picasso = Picasso(character_matrix)
>>> print(f"Initial clones: {len(picasso.clones)}")
>>> picasso.step()  # Perform one splitting iteration
>>> print(f"After step: {len(picasso.clones)} clones, {len(picasso.terminal_clones)} terminal")

Force splitting to explore deeper structure: >>> picasso.step(force_split=True)

See also

split_clone

Method applied to individual clones during this step

fit

Complete algorithm that calls step() iteratively until termination

fit()[source]

Fit the PICASSO phylogenetic model to the noisy CNA data.

Executes the complete PICASSO algorithm by iteratively splitting clones until termination criteria are met. The algorithm is designed to handle noise and uncertainty in scRNA-seq-inferred CNA data through probabilistic modeling and confidence-based termination.

Parameters:

None

Returns:

Modifies the instance in-place by updating clones, terminal_clones, and depth.

Return type:

None

Notes

The fitting process proceeds as follows:

  1. Iterative Splitting: At each depth level, all current leaf clones are evaluated for splitting using Categorical Mixture Models

  2. Noise Handling: Uses confidence thresholds and minimum clone sizes to avoid splits driven by noise artifacts

  3. Forced Splitting: If min_depth is specified, forces splits until that depth is reached (unless clone size is insufficient)

  4. Termination: Stops when all clones are terminal, max_depth is reached, or no clones meet splitting criteria

Termination Conditions: - All leaf clones have been marked as terminal - Maximum depth limit reached (if specified) - No clones have sufficient size for splitting - Confidence/BIC criteria not met for any clones

For Noisy scRNA-seq Data: - Higher confidence thresholds prevent spurious splits - Larger minimum clone sizes reduce noise-driven artifacts - BIC penalty helps prevent over-fitting to noise

Examples

>>> from picasso import Picasso, load_data
>>> character_matrix = load_data()
>>> picasso = Picasso(character_matrix, min_clone_size=8)
>>> picasso.fit()  # Fit the model
>>> print(f"Final tree depth: {picasso.depth}")
>>> print(f"Number of terminal clones: {len(picasso.terminal_clones)}")

See also

step

Perform a single splitting iteration

split_clone

Split an individual clone

get_phylogeny

Extract the fitted phylogenetic tree

get_phylogeny()[source]

Extract the reconstructed phylogenetic tree from the fitted PICASSO model.

Converts the hierarchical clone structure into an ete3.Tree object for visualization and downstream analysis. The tree represents the inferred evolutionary relationships between clones based on their CNA profiles.

Returns:

Phylogenetic tree where leaves represent terminal clones and internal nodes represent ancestral clones. Node names correspond to clone IDs from the splitting process (e.g., ‘1’, ‘1-0’, ‘1-1’, ‘1-0-STOP’).

Return type:

ete3.Tree

Examples

>>> from picasso import Picasso, load_data
>>> character_matrix = load_data()
>>> picasso = Picasso(character_matrix)
>>> picasso.fit()
>>> tree = picasso.get_phylogeny()
>>> print(tree.get_ascii())  # Display tree structure
>>> print(f"Tree has {len(tree.get_leaves())} terminal clones")

Get leaf names: >>> leaf_names = tree.get_leaf_names() >>> print(f”Terminal clones: {leaf_names}”)

Notes

  • The tree topology reflects the binary splitting process used by PICASSO

  • Internal nodes represent decision points where clones were split

  • Terminal nodes (leaves) represent final clones that could not be split further

  • Node names encode the splitting history (e.g., ‘1-0-1’ = root -> left -> right)

  • Trees from noisy data may have different topologies due to uncertainty handling

See also

get_clone_assignments

Get cell-to-clone assignments

CloneTree

Class for enhanced tree visualization and analysis

create_tree_from_paths

Static method for tree construction from paths

get_clone_assignments()[source]

Extract cell-to-clone assignments from the fitted PICASSO model.

Returns a DataFrame mapping each cell/sample to its assigned terminal clone. These assignments represent the final clustering result after the phylogenetic reconstruction process.

Returns:

DataFrame with cell/sample identifiers as index and a ‘clone_id’ column containing the assigned clone ID for each cell. Clone IDs correspond to the terminal nodes in the phylogenetic tree.

Return type:

pd.DataFrame

Examples

>>> from picasso import Picasso, load_data
>>> character_matrix = load_data()
>>> picasso = Picasso(character_matrix)
>>> picasso.fit()
>>> assignments = picasso.get_clone_assignments()
>>> print(assignments.head())
>>> print(f"Number of clones: {assignments['clone_id'].nunique()}")

Get cells in a specific clone: >>> clone_cells = assignments[assignments[‘clone_id’] == ‘1-0-STOP’].index >>> print(f”Cells in clone 1-0-STOP: {list(clone_cells)}”)

Clone size distribution: >>> clone_sizes = assignments[‘clone_id’].value_counts() >>> print(“Clone sizes:”) >>> print(clone_sizes)

Notes

  • Each cell is assigned to exactly one terminal clone

  • Clone IDs reflect the splitting hierarchy (e.g., ‘1-0-STOP’, ‘1-1-0-STOP’)

  • The ‘-STOP’ suffix indicates terminal clones that were not split further

  • Assignment quality depends on the noise level in the input CNA data

  • For very noisy data, some assignments may have lower confidence

See also

get_phylogeny

Get the phylogenetic tree structure

CloneTree

Class for integrated analysis of assignments and phylogeny

fit

Method that performs the clustering and phylogeny reconstruction

static create_tree_from_paths(paths, separator=':')[source]

Construct phylogenetic tree from hierarchical clone path identifiers.

Converts a list of clone path strings into an ete3 tree structure by parsing the hierarchical splitting history encoded in each path. This is used internally by PICASSO to generate the final phylogenetic tree from the clone splitting process.

Parameters:
  • paths (list of str) – List of clone path identifiers representing the hierarchical structure. Each path encodes the splitting history (e.g., ‘1’, ‘1-0’, ‘1-0-STOP’, ‘1-1-0’). All paths must start with the same root character.

  • separator (str, default=':') – Character used to separate levels in the path hierarchy. PICASSO uses ‘-’ by default for clone paths.

Returns:

Root node of the constructed phylogenetic tree where: - Leaves represent terminal clones - Internal nodes represent ancestral states/splitting points - Node names correspond to the original path identifiers

Return type:

ete3.TreeNode

Examples

Basic tree construction from clone paths:

>>> from picasso.build_tree import Picasso
>>>
>>> # Example clone paths from PICASSO splitting
>>> clone_paths = ['1', '1-0-STOP', '1-1-0-STOP', '1-1-1-STOP']
>>> tree = Picasso.create_tree_from_paths(clone_paths, '-')
>>> print(tree.get_ascii())
>>> print(f"Leaves: {tree.get_leaf_names()}")

Notes

Path Structure: - Root level: Single character (typically ‘1’) - Subsequent levels: Added via separator (e.g., ‘1-0’, ‘1-1’) - Terminal indicator: Often ends with ‘-STOP’ for final clones

Tree Construction Logic: - Identifies common root from all paths - Builds tree level by level based on path prefixes - Creates parent-child relationships following path hierarchy - Handles variable depth paths automatically

Internal Use: - Called by get_phylogeny() to convert clone structure to tree - Maintains clone ID information in node names - Preserves splitting history for downstream analysis

Raises:

AssertionError – If paths don’t share a common root character.

See also

get_phylogeny

Public method that uses this function to create phylogenetic trees

fit

Method that generates the clone paths through iterative splitting