picasso.build_tree module
PICASSO: Phylogenetic Inference of Copy number Alterations in Single-cell Sequencing data Optimization.
This module implements the core PICASSO algorithm for reconstructing tumor phylogenies from noisy, inferred copy number alteration (CNA) data derived from single-cell RNA sequencing. The algorithm uses iterative binary splitting with categorical mixture models to handle uncertainty and noise typical in scRNA-seq-inferred CNAs.
Classes
- Picasso
Main class implementing the phylogenetic inference algorithm with noise handling capabilities designed specifically for scRNA-seq-inferred CNA data.
Examples
Basic phylogenetic reconstruction:
>>> from picasso import Picasso, load_data
>>>
>>> # Load example CNA data
>>> cna_data = load_data()
>>>
>>> # Initialize with parameters suitable for noisy data
>>> picasso = Picasso(cna_data,
... min_clone_size=10, # Larger for noisy data
... assignment_confidence_threshold=0.8)
>>>
>>> # Reconstruct phylogeny
>>> picasso.fit()
>>> phylogeny = picasso.get_phylogeny()
>>> assignments = picasso.get_clone_assignments()
Notes
The PICASSO algorithm is specifically designed to handle the challenges of: - Noise and artifacts in scRNA-seq-inferred CNAs - Uncertainty in copy number state assignments - Variable clone sizes and imbalanced data - Over-fitting to noise patterns
See also
CloneTreeVisualization and analysis of phylogenetic results
utilsUtility functions for data preprocessing and loading
itol_utilsFunctions for creating iTOL-compatible visualizations
- class picasso.build_tree.Picasso(character_matrix, min_depth=None, max_depth=None, min_clone_size=5, terminate_by='probability', assignment_confidence_threshold=0.75, assignment_confidence_proportion=0.8, bic_penalty_strength=1.0)[source]
Bases:
object- __init__(character_matrix, min_depth=None, max_depth=None, min_clone_size=5, terminate_by='probability', assignment_confidence_threshold=0.75, assignment_confidence_proportion=0.8, bic_penalty_strength=1.0)[source]
Initialize the PICASSO model for phylogenetic inference from noisy CNA data.
PICASSO (Phylogenetic Inference of Copy number Alterations in Single-cell Sequencing data Optimization) reconstructs tumor phylogenies from inferred copy number alterations (CNAs) derived from single-cell RNA sequencing data. Unlike direct scDNA-seq data, scRNA-seq-inferred CNAs are noisy and require specialized handling for more accurate phylogenetic reconstruction.
- Parameters:
character_matrix (pd.DataFrame) – An integer matrix where rows are single cells/samples and columns are genomic features (e.g., chromosome arms, genes, or genomic bins). Values represent inferred copy number states (e.g., 0=deletion, 1=neutral, 2=amplification). For noisy scRNA-seq-inferred data, values may include noise artifacts that PICASSO handles through probabilistic modeling.
min_depth (int, optional) – The minimum depth (number of splitting iterations) of the phylogeny. Forces algorithm to continue splitting even if termination criteria are met, useful for exploring deeper clonal structure in noisy data. Default is None (no minimum enforced).
max_depth (int, optional) – The maximum depth of the phylogeny to prevent over-fitting in noisy data. Default is None (unlimited depth).
min_clone_size (int, default=5) – The minimum number of cells required in a clone for it to be split further. Larger values help avoid spurious clones arising from noise in scRNA-seq-inferred CNAs. Recommended: 50-100 cells for noisy data, 10-50 for high-quality data.
terminate_by ({'probability', 'BIC'}, default='probability') – The criterion used to terminate clone splitting: - ‘probability’: Uses assignment confidence to handle uncertainty in noisy data - ‘BIC’: Uses Bayesian Information Criterion for model selection
assignment_confidence_threshold (float, default=0.75) – Minimum confidence threshold for clone assignments when terminate_by=’probability’. Higher values (0.8-0.9) recommended for very noisy scRNA-seq data to ensure confident assignments. Must be between 0 and 1.
assignment_confidence_proportion (float, default=0.8) – Minimum proportion of cells with confident assignments required for clone splitting when terminate_by=’probability’. Higher values help avoid splitting based on uncertain assignments in noisy data. Must be between 0 and 1.
bic_penalty_strength (float, default=1.0) – Strength of BIC penalty term. Higher values (>1.0) encourage simpler models, useful for noisy data to prevent over-fitting.
- terminal_clones
Dictionary tracking clones marked as terminal (no further splitting). Keys are clone identifiers, values are pandas Index objects of cell identifiers.
- Type:
Dict[str, pd.Index]
- clones
Dictionary mapping current clone IDs to pandas Index objects of cell identifiers belonging to each clone. Updated during tree construction.
- Type:
Dict[str, pd.Index]
- Raises:
AssertionError – If character_matrix is not a pandas DataFrame.
ValueError – If character_matrix cannot be converted to integer values.
AssertionError – If confidence thresholds are not between 0 and 1, or if min/max depth values are invalid.
Examples
Basic usage with scRNA-seq-inferred CNA data:
>>> from picasso import Picasso, load_data >>> >>> # Load example CNA data >>> character_matrix = load_data() >>> >>> # Initialize PICASSO with parameters suitable for noisy data >>> picasso = Picasso(character_matrix, ... min_clone_size=10, # Choose a larger value for very noisy data ... assignment_confidence_threshold=0.85, # Higher confidence ... assignment_confidence_proportion=0.9) >>> >>> # Fit the model >>> picasso.fit() >>> >>> # Get results >>> phylogeny = picasso.get_phylogeny() >>> clone_assignments = picasso.get_clone_assignments()
For very noisy data, use stricter parameters:
>>> # Parameters for very noisy scRNA-seq-inferred CNAs >>> picasso_strict = Picasso(character_matrix, ... min_clone_size=50, ... max_depth=8, # Limit depth to avoid over-fitting ... assignment_confidence_threshold=0.9, ... assignment_confidence_proportion=0.95) # Stronger penalty >>> # Alternatively, use BIC-based termination >>> picasso_strict = Picasso(character_matrix, ... min_clone_size=50, ... min_depth=3, # Force splitting to a depth of 3 ... max_depth=8, # Limit depth to avoid over-fitting ... terminate_by='BIC') >>> picasso_strict.fit()
Notes
The PICASSO algorithm proceeds through the following steps:
Initialization: All cells start in a single root clone
Iterative Splitting: At each depth level: - For each current clone, fit Categorical Mixture Models with k=1 and k=2 components - Evaluate splitting criteria (BIC or assignment confidence) - Split clones that meet criteria into two daughter clones
Termination: Stop when no clones can be split further or max_depth is reached
- Tree Construction: Build phylogenetic tree from clone hierarchy. Leaves are clones containing cells
whose CNAs cannot be further distinguised reliably.
Handling Noisy scRNA-seq Data: - Uses probabilistic assignment with confidence thresholds - Minimum clone size prevents spurious small clones from noise - BIC penalty prevents over-fitting to noise artifacts - Confidence-based termination handles assignment uncertainty
Model Assumptions: - CNAs are acquired progressively but can be acquired multiple times independently (no perfect phylogeny assumption) - Each genomic feature evolves independently - Copy number states follow categorical distributions within clones - Noise is handled through mixture model uncertainty quantification
See also
CloneTreeClass for phylogenetic tree visualization and analysis
get_phylogenyMethod to extract the reconstructed phylogeny
get_clone_assignmentsMethod to get cell-to-clone assignments
- split_clone(clone, force_split=False)[source]
Attempt to split a single clone into two daughter clones using mixture modeling.
Evaluates whether a clone should be split by fitting Categorical Mixture Models and applying termination criteria. This is the core method for handling noisy CNA data through probabilistic modeling and confidence-based decisions.
- Parameters:
- Returns:
Dictionary mapping new clone identifiers to pandas Index objects containing the cell/sample identifiers assigned to each clone: - If split successful: {‘{clone}-0’: cells_0, ‘{clone}-1’: cells_1} - If terminated: {‘{clone}-STOP’: original_cells} - If already terminal: {clone: original_cells}
- Return type:
Examples
>>> from picasso import Picasso, load_data >>> character_matrix = load_data() >>> picasso = Picasso(character_matrix) >>> # After some fitting steps, try splitting a specific clone >>> result = picasso.split_clone('1-0') >>> print(f"Split result: {list(result.keys())}")
Force splitting (ignoring confidence criteria): >>> forced_result = picasso.split_clone(‘1-1’, force_split=True)
Notes
Splitting Process: 1. Check if clone is already terminal (return unchanged) 2. Extract CNA profiles for cells in the clone 3. Filter features with sufficient variance (> 1e-10) for performance improvements 4. Fit mixture models with k=1 and k=2 components 5. Evaluate termination criteria (BIC or confidence) 6. Apply minimum clone size constraint 7. Return split result or mark as terminal
Termination Criteria: - BIC: k=1 model has lower BIC than k=2 model - Probability: Insufficient assignment confidence or proportion - Size constraint: Either daughter clone below min_clone_size
Noise Handling: - Confidence thresholds prevent splits based on uncertain assignments - Minimum clone sizes avoid spurious small clusters - Variance filtering removes uninformative features - Multiple model fitting attempts with different initializations
- step(force_split=False)[source]
Execute one complete iteration of clone splitting across all current leaf clones.
Applies the split_clone method to all current leaf clones in parallel, representing one depth level of the phylogenetic reconstruction process. This method coordinates the simultaneous evaluation of all clones at the current tree depth.
- Parameters:
force_split (bool, default=False) – If True, attempts to force splits even when normal termination criteria are met. Used when enforcing minimum tree depth requirements. Individual clones may still be terminated if size constraints are violated.
- Return type:
Notes
Single Step Process: 1. Iterate through all current leaf clones 2. Apply split_clone() to each clone 3. Collect all resulting clones (split or terminal) 4. Update self.clones with the new clone structure 5. Terminal clones are tracked in self.terminal_clones
Progress Tracking: - Uses tqdm progress bar to show splitting progress - Logs clone processing information at debug level - Reports clone sizes and splitting decisions
State Modification: - Updates self.clones with new clone structure - Adds terminal clones to self.terminal_clones - Preserves cell-to-clone assignment mappings
Parallelization Note: Currently processes clones sequentially. Future versions may implement parallel processing for large datasets.
Examples
>>> from picasso import Picasso, load_data >>> character_matrix = load_data() >>> picasso = Picasso(character_matrix) >>> print(f"Initial clones: {len(picasso.clones)}") >>> picasso.step() # Perform one splitting iteration >>> print(f"After step: {len(picasso.clones)} clones, {len(picasso.terminal_clones)} terminal")
Force splitting to explore deeper structure: >>> picasso.step(force_split=True)
See also
split_cloneMethod applied to individual clones during this step
fitComplete algorithm that calls step() iteratively until termination
- fit()[source]
Fit the PICASSO phylogenetic model to the noisy CNA data.
Executes the complete PICASSO algorithm by iteratively splitting clones until termination criteria are met. The algorithm is designed to handle noise and uncertainty in scRNA-seq-inferred CNA data through probabilistic modeling and confidence-based termination.
- Parameters:
None
- Returns:
Modifies the instance in-place by updating clones, terminal_clones, and depth.
- Return type:
None
Notes
The fitting process proceeds as follows:
Iterative Splitting: At each depth level, all current leaf clones are evaluated for splitting using Categorical Mixture Models
Noise Handling: Uses confidence thresholds and minimum clone sizes to avoid splits driven by noise artifacts
Forced Splitting: If min_depth is specified, forces splits until that depth is reached (unless clone size is insufficient)
Termination: Stops when all clones are terminal, max_depth is reached, or no clones meet splitting criteria
Termination Conditions: - All leaf clones have been marked as terminal - Maximum depth limit reached (if specified) - No clones have sufficient size for splitting - Confidence/BIC criteria not met for any clones
For Noisy scRNA-seq Data: - Higher confidence thresholds prevent spurious splits - Larger minimum clone sizes reduce noise-driven artifacts - BIC penalty helps prevent over-fitting to noise
Examples
>>> from picasso import Picasso, load_data >>> character_matrix = load_data() >>> picasso = Picasso(character_matrix, min_clone_size=8) >>> picasso.fit() # Fit the model >>> print(f"Final tree depth: {picasso.depth}") >>> print(f"Number of terminal clones: {len(picasso.terminal_clones)}")
See also
stepPerform a single splitting iteration
split_cloneSplit an individual clone
get_phylogenyExtract the fitted phylogenetic tree
- get_phylogeny()[source]
Extract the reconstructed phylogenetic tree from the fitted PICASSO model.
Converts the hierarchical clone structure into an ete3.Tree object for visualization and downstream analysis. The tree represents the inferred evolutionary relationships between clones based on their CNA profiles.
- Returns:
Phylogenetic tree where leaves represent terminal clones and internal nodes represent ancestral clones. Node names correspond to clone IDs from the splitting process (e.g., ‘1’, ‘1-0’, ‘1-1’, ‘1-0-STOP’).
- Return type:
ete3.Tree
Examples
>>> from picasso import Picasso, load_data >>> character_matrix = load_data() >>> picasso = Picasso(character_matrix) >>> picasso.fit() >>> tree = picasso.get_phylogeny() >>> print(tree.get_ascii()) # Display tree structure >>> print(f"Tree has {len(tree.get_leaves())} terminal clones")
Get leaf names: >>> leaf_names = tree.get_leaf_names() >>> print(f”Terminal clones: {leaf_names}”)
Notes
The tree topology reflects the binary splitting process used by PICASSO
Internal nodes represent decision points where clones were split
Terminal nodes (leaves) represent final clones that could not be split further
Node names encode the splitting history (e.g., ‘1-0-1’ = root -> left -> right)
Trees from noisy data may have different topologies due to uncertainty handling
See also
get_clone_assignmentsGet cell-to-clone assignments
CloneTreeClass for enhanced tree visualization and analysis
create_tree_from_pathsStatic method for tree construction from paths
- get_clone_assignments()[source]
Extract cell-to-clone assignments from the fitted PICASSO model.
Returns a DataFrame mapping each cell/sample to its assigned terminal clone. These assignments represent the final clustering result after the phylogenetic reconstruction process.
- Returns:
DataFrame with cell/sample identifiers as index and a ‘clone_id’ column containing the assigned clone ID for each cell. Clone IDs correspond to the terminal nodes in the phylogenetic tree.
- Return type:
pd.DataFrame
Examples
>>> from picasso import Picasso, load_data >>> character_matrix = load_data() >>> picasso = Picasso(character_matrix) >>> picasso.fit() >>> assignments = picasso.get_clone_assignments() >>> print(assignments.head()) >>> print(f"Number of clones: {assignments['clone_id'].nunique()}")
Get cells in a specific clone: >>> clone_cells = assignments[assignments[‘clone_id’] == ‘1-0-STOP’].index >>> print(f”Cells in clone 1-0-STOP: {list(clone_cells)}”)
Clone size distribution: >>> clone_sizes = assignments[‘clone_id’].value_counts() >>> print(“Clone sizes:”) >>> print(clone_sizes)
Notes
Each cell is assigned to exactly one terminal clone
Clone IDs reflect the splitting hierarchy (e.g., ‘1-0-STOP’, ‘1-1-0-STOP’)
The ‘-STOP’ suffix indicates terminal clones that were not split further
Assignment quality depends on the noise level in the input CNA data
For very noisy data, some assignments may have lower confidence
See also
get_phylogenyGet the phylogenetic tree structure
CloneTreeClass for integrated analysis of assignments and phylogeny
fitMethod that performs the clustering and phylogeny reconstruction
- static create_tree_from_paths(paths, separator=':')[source]
Construct phylogenetic tree from hierarchical clone path identifiers.
Converts a list of clone path strings into an ete3 tree structure by parsing the hierarchical splitting history encoded in each path. This is used internally by PICASSO to generate the final phylogenetic tree from the clone splitting process.
- Parameters:
paths (list of str) – List of clone path identifiers representing the hierarchical structure. Each path encodes the splitting history (e.g., ‘1’, ‘1-0’, ‘1-0-STOP’, ‘1-1-0’). All paths must start with the same root character.
separator (str, default=':') – Character used to separate levels in the path hierarchy. PICASSO uses ‘-’ by default for clone paths.
- Returns:
Root node of the constructed phylogenetic tree where: - Leaves represent terminal clones - Internal nodes represent ancestral states/splitting points - Node names correspond to the original path identifiers
- Return type:
ete3.TreeNode
Examples
Basic tree construction from clone paths:
>>> from picasso.build_tree import Picasso >>> >>> # Example clone paths from PICASSO splitting >>> clone_paths = ['1', '1-0-STOP', '1-1-0-STOP', '1-1-1-STOP'] >>> tree = Picasso.create_tree_from_paths(clone_paths, '-') >>> print(tree.get_ascii()) >>> print(f"Leaves: {tree.get_leaf_names()}")
Notes
Path Structure: - Root level: Single character (typically ‘1’) - Subsequent levels: Added via separator (e.g., ‘1-0’, ‘1-1’) - Terminal indicator: Often ends with ‘-STOP’ for final clones
Tree Construction Logic: - Identifies common root from all paths - Builds tree level by level based on path prefixes - Creates parent-child relationships following path hierarchy - Handles variable depth paths automatically
Internal Use: - Called by get_phylogeny() to convert clone structure to tree - Maintains clone ID information in node names - Preserves splitting history for downstream analysis
- Raises:
AssertionError – If paths don’t share a common root character.
See also
get_phylogenyPublic method that uses this function to create phylogenetic trees
fitMethod that generates the clone paths through iterative splitting