Detailed Examples
This section provides comprehensive examples for different types of analyses with PICASSO. Each example demonstrates best practices for specific scenarios you might encounter with single-cell CNA data.
Example 1: Basic Phylogeny Reconstruction
This example demonstrates the standard PICASSO workflow using the built-in example dataset.
Setup and Data Loading
import picasso
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load example CNA data
character_matrix = picasso.load_data()
print(f'Dataset: {character_matrix.shape[0]} cells × {character_matrix.shape[1]} features')
# Examine the data structure
print("Data range:", character_matrix.min().min(), "to", character_matrix.max().max())
print("First few rows:")
print(character_matrix.head())
Data Visualization
# Visualize the CNA data as a heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(character_matrix.iloc[:100],
cmap='coolwarm', center=0,
cbar_kws={'label': 'Copy Number'})
plt.title('Copy Number Alterations (first 100 cells)')
plt.xlabel('Genomic Features')
plt.ylabel('Cells')
plt.tight_layout()
plt.show()
Data Encoding for Complex CNAs
# Encode complex CNAs as ternary values for better similarity handling
character_matrix = picasso.encode_cnvs_as_ternary(character_matrix)
# Visualize the encoded data
plt.figure(figsize=(12, 8))
sns.heatmap(character_matrix.iloc[:100],
cmap='coolwarm', center=0,
cbar_kws={'label': 'Copy Number'})
plt.title('Copy Number Alterations (first 100 cells)')
plt.xlabel('Genomic Features')
plt.ylabel('Cells')
plt.tight_layout()
plt.show()
Basic Phylogenetic Reconstruction
# Initialize PICASSO with standard parameters
model = picasso.Picasso(
character_matrix,
min_clone_size=5,
assignment_confidence_threshold=0.8,
assignment_confidence_proportion=0.9
)
# Fit the model
print("Reconstructing phylogeny...")
model.fit()
# Extract results
phylogeny = model.get_phylogeny()
clone_assignments = model.get_clone_assignments()
print(f"Reconstructed phylogeny with {len(phylogeny.get_leaves())} terminal clones")
Visualize Clone Size Distribution
# Plot the distribution of clone sizes
plt.figure(figsize=(10, 6))
sns.ecdfplot(clone_assignments['clone_id'].value_counts())
plt.title('Distribution of Clone Sizes')
plt.xlabel('Clone Size')
plt.ylabel('Count')
plt.show()
Alternative: BIC-based Termination
# We can also use BIC-based termination; for small datasets,
# it may terminate with less resolved clones
model = picasso.Picasso(
character_matrix,
min_clone_size=5,
terminate_by='BIC'
)
# Fit the model
print("Reconstructing phylogeny...")
model.fit()
# Extract results
phylogeny = model.get_phylogeny()
clone_assignments = model.get_clone_assignments()
print(f"Reconstructed phylogeny with {len(phylogeny.get_leaves())} terminal clones")
print(f"Clone size distribution:")
print(clone_assignments['clone_id'].value_counts().head())
Tree Analysis & Downstream Visualization
# Create CloneTree for advanced analysis
tree = picasso.CloneTree(phylogeny, clone_assignments, character_matrix)
# Root the tree at the most ancestral clone
outgroup = tree.get_most_ancestral_clone()
tree.root_tree(outgroup)
print(f"Tree rooted at clone: {outgroup}")
# Generate visualizations showing the clones and their groupings (not phylogenetic structure)
tree.plot_clone_sizes()
tree.plot_alterations()
# Get clone phylogeny as Newick string for external tools
clone_tree = tree.get_clone_phylogeny()
print("Newick format (first 100 characters):")
print(clone_tree.write()[:100] + "...")
Example 2: Filtering Very Noisy scRNA-seq Data
This example shows how to handle very noisy CNA data typically obtained from scRNA-seq inference.
Data Preparation
# Load data
character_matrix = picasso.load_data()
# Encode complex CNAs as ternary values for better similarity handling
encoded_matrix = picasso.encode_cnvs_as_ternary(character_matrix)
print(f'Original: {character_matrix.shape[1]} features')
print(f'Encoded: {encoded_matrix.shape[1]} features')
Feature Filtering for Noise Reduction & Performance Improvements
# Use encoded data for noisy data handling
data = encoded_matrix
# Remove features with very low variance (uninformative)
print(f'Features before filtering: {data.shape[1]}')
# Calculate modal proportion for each feature
modal_proportions = (data.values == data.mode(axis=0).values).mean(axis=0)
# Keep features where <99% of cells have the modal value
informative_features = modal_proportions < 0.99
filtered_data = data.loc[:, informative_features]
print(f'Features after filtering: {filtered_data.shape[1]}')
print(f'Removed {data.shape[1] - filtered_data.shape[1]} uninformative features')
Conservative Parameter Settings
# Use conservative parameters for noisy data
model = picasso.Picasso(
filtered_data,
min_depth=2, # Force minimum depth to explore structure
max_depth=12, # Limit depth to prevent overfitting
min_clone_size=50, # Larger clones for noise robustness
terminate_by='BIC', # Use conservative BIC-based termination
bic_penalty_strength=1.2 # Stronger penalty against complexity
)
print("Fitting model with conservative parameters...")
model.fit()
# Analyze results
phylogeny = model.get_phylogeny()
clone_assignments = model.get_clone_assignments()
print(f"Conservative approach: {len(phylogeny.get_leaves())} clones")
print("Clone size distribution:")
print(clone_assignments['clone_id'].value_counts().describe())
Example 3: Advanced Tree Analysis with CloneTree Class
This example shows how to extract detailed phylogenetic information from PICASSO results.
Comprehensive Tree Analysis
# Start with a fitted model (from previous examples)
data = picasso.load_data()
model = picasso.Picasso(data, min_clone_size=10)
model.fit()
# Create CloneTree with modal aggregation
tree = picasso.CloneTree(
model.get_phylogeny(),
model.get_clone_assignments(),
data,
clone_aggregation='mode' # Use modal values for clone profiles
)
# Root the tree
outgroup = tree.get_most_ancestral_clone()
tree.root_tree(outgroup)
print(f"Tree rooted at: {outgroup}")
Clone Profile Analysis
We can examine the overall CNA profile that characterizes each clone:
import numpy as np
# Get modal CNA profiles for each clone
modal_profiles, modal_frequencies = tree.get_modal_clone_profiles()
print(f"Modal profiles shape: {modal_profiles.shape}")
# Visualize clone profiles
plt.figure(figsize=(12, 8))
sns.clustermap(modal_profiles,
cmap='coolwarm', center=0,
figsize=(12, 8),
cbar_kws={'label': 'Modal Copy Number'},
col_cluster=False)
plt.title('Clone CNA Profiles (Modal Values)')
plt.show()
# Visualize the frequencies of the modal values to get a sense of how noisy the leaves are
plt.figure(figsize=(12, 8))
sns.clustermap(modal_frequencies,
cmap='Blues', vmin=0,
cbar_kws={'label': 'Modal Frequency'},
col_cluster=False)
plt.title('Clone CNA Profiles (Modal Frequencies)')
plt.show()
Sample-Level Phylogeny and Tree Statistics
# Create sample-level phylogeny (may be large as it contains all cells)
print("Creating sample phylogeny...")
sample_tree = tree.get_sample_phylogeny()
print(f"Sample tree has {len(sample_tree.get_leaves())} leaves")
# For visualization, we'll work with clone tree
clone_phylogeny = tree.get_clone_phylogeny()
# Tree statistics
print(f"Clone tree depth: {clone_phylogeny.get_farthest_leaf()[1]}")
print(f"Number of internal nodes: {len(clone_phylogeny.get_descendants()) - len(clone_phylogeny.get_leaves())}")
Example 4: iTOL Export for Publication Figures
iTOL is a sophisticated visualization tool for phylogenies. This example shows how to create publication-ready visualizations using iTOL and helper functions.
Prepare Phylogeny for iTOL
# Prepare data
data = picasso.load_data()
model = picasso.Picasso(data, min_clone_size=15)
model.fit()
tree = picasso.CloneTree(model.get_phylogeny(),
model.get_clone_assignments(),
data)
outgroup = tree.get_most_ancestral_clone()
tree.root_tree(outgroup)
# Get cell-level tree for iTOL (use clone tree if too large)
cell_tree = tree.get_sample_phylogeny()
newick_string = cell_tree.write()
# Save tree file for iTOL
with open('cell_phylogeny.nwk', 'w') as f:
f.write(newick_string)
print("Saved phylogeny to cell_phylogeny.nwk")
CNA Heatmap Annotation
# Create heatmap annotation showing CNA profiles
heatmap_annotation = picasso.itol_utils.dataframe_to_itol_heatmap(
data,
dataset_label="Copy Number Alterations",
color_min='#053061', # Dark blue for deletions
color_max='#67001f' # Dark red for amplifications
)
# Save annotation file
with open('cna_heatmap.txt', 'w') as f:
f.write(heatmap_annotation)
print("Saved CNA heatmap annotation to cna_heatmap.txt")
print("First few lines:")
print('\\n'.join(heatmap_annotation.split('\\n')[:10]))
Metadata Color Strips
# Create sample metadata for demonstration
clone_assignments = model.get_clone_assignments()
# Simulate tissue sites
np.random.seed(42) # For reproducibility
sites = np.random.choice(['Primary', 'Metastasis_1', 'Metastasis_2', 'Normal'],
size=len(clone_assignments))
sites_series = pd.Series(sites, index=clone_assignments.index, name='Tissue_Site')
# Define color mapping
site_colors = {
'Primary': '#e41a1c',
'Metastasis_1': '#377eb8',
'Metastasis_2': '#4daf4a',
'Normal': '#984ea3'
}
# Create color strip annotation
colorstrip_annotation = picasso.itol_utils.dataframe_to_itol_colorstrip(
sites_series,
site_colors,
dataset_label='Tissue Site'
)
with open('tissue_sites.txt', 'w') as f:
f.write(colorstrip_annotation)
print("Saved tissue site annotation to tissue_sites.txt")
Clone Composition Stacked Bars
# Analyze tissue composition within each clone
clone_tissue_data = clone_assignments.merge(sites_series,
left_index=True,
right_index=True)
# Calculate proportions of each tissue type within each clone
site_proportions = (clone_tissue_data.groupby('clone_id')['Tissue_Site']
.value_counts(normalize=True)
.unstack(fill_value=0))
print("Tissue proportions by clone:")
print(site_proportions.head())
# Create stacked bar annotation for clone tree
stackedbar_annotation = picasso.itol_utils.dataframe_to_itol_stackedbar(
site_proportions,
site_colors,
dataset_label='Tissue Composition'
)
with open('clone_composition.txt', 'w') as f:
f.write(stackedbar_annotation)
print("Saved clone composition annotation to clone_composition.txt")
iTOL Workflow Summary
Files created for iTOL:
1. cell_phylogeny.nwk - Main phylogenetic tree
2. cna_heatmap.txt - CNA profile heatmap
3. tissue_sites.txt - Tissue site color strips
4. clone_composition.txt - Clone composition stacked bars
Steps for iTOL visualization:
1. Go to https://itol.embl.de/
2. Upload cell_phylogeny.nwk
3. Drag and drop annotation files to add visualizations
4. Customize colors, labels, and layout
5. Export high-resolution figures
Pro tip: Use clone tree instead of cell tree for large datasets
to improve iTOL performance and readability
Summary and Best Practices
Key Takeaways
Start Simple: Begin with default parameters and basic workflow
Understand Your Data: Examine noise levels, feature characteristics, and data quality
Data Preprocessing: Use ternary encoding and feature filtering for noisy datasets
Parameter Tuning: Choose conservative parameters for noisy data
Tree Analysis: Use CloneTree for detailed phylogenetic analysis
Visualization: Use iTOL for publication-ready phylogenetic figures
Parameter Selection Guidelines
Here are general guidelines for parameter selection based on data quality:
Parameter |
Clean Data |
Noisy Data |
Very Noisy Data |
|---|---|---|---|
min_clone_size |
5-15 |
15-50 |
50-100 |
confidence_thresh |
0.7-0.8 |
0.8-0.85 |
0.85-0.95 |
max_depth |
unlimited |
10-15 |
8-12 |
terminate_by |
BIC/probability |
probability |
BIC |
bic_penalty |
1.0 |
1.0-1.2 |
1.2-1.5 |
Common Pitfalls to Avoid
Using too small
min_clone_sizewith noisy data (leads to over-fitting)Setting
max_depthtoo high with noisy data (computational burden, over-fitting)Ignoring feature filtering for high-dimensional noisy datasets
Not using ternary encoding for complex copy number data
Skipping data quality assessment before parameter selection
Next Steps
For more advanced usage, consult the API Reference documentation for detailed parameter descriptions and method specifications.