picasso.utils module
Utilities: Data preprocessing and loading functions for PICASSO.
This module provides utility functions for preprocessing copy number alteration (CNA) data and loading example datasets. It includes specialized functions for handling noisy scRNA-seq-inferred CNA data and converting complex copy number states into formats suitable for phylogenetic analysis.
Functions
- encode_cnvs_as_ternary
Convert integer CNA data to ternary encoding for improved phylogenetic inference.
- load_data
Load example CNA dataset for testing and demonstration purposes.
Examples
Data preprocessing workflow:
>>> from picasso import Picasso, load_data, encode_cnvs_as_ternary
>>>
>>> # Load example dataset
>>> cna_data = load_data()
>>> print(f"Loaded data: {cna_data.shape}")
>>>
>>> # Optional: Convert to ternary encoding for complex copy number states
>>> ternary_data = encode_cnvs_as_ternary(cna_data)
>>> print(f"Ternary encoded: {ternary_data.shape}")
>>>
>>> # Use with PICASSO
>>> picasso = Picasso(cna_data, min_clone_size=8)
>>> picasso.fit()
Notes
These utilities are specifically designed for: - Handling noisy scRNA-seq-inferred CNA data - Converting complex copy number states to phylogeny-compatible formats - Providing realistic example data for algorithm development - Supporting data preprocessing workflows
See also
PicassoMain phylogenetic inference class
CloneTreeAnalysis and visualization of phylogenetic results
- picasso.utils.encode_cnvs_as_ternary(data)[source]
Convert CNA data to ternary encoding for phylogenetic analysis.
Transforms integer copy number alteration (CNA) data into a ternary format suitable for phylogenetic inference algorithms like PICASSO. This encoding is particularly useful for handling complex copy number states and ensuring compatibility with categorical mixture models.
- Parameters:
data (pd.DataFrame or np.ndarray) – Input CNA data where rows represent cells/samples and columns represent genomic features. Values should be integers representing copy number states (e.g., 0=deletion, 1=neutral, 2=single amplification, 3=double amplification). Can handle both positive and negative copy number values.
- Returns:
Ternary-encoded DataFrame with values in {-1, 0, 1}. The number of columns is expanded based on the maximum absolute value in each original column. Column names follow the pattern ‘original_column-position’ (e.g., ‘chr1p-1’, ‘chr1p-2’).
- Return type:
pd.DataFrame
Examples
Basic encoding of copy number states:
>>> import pandas as pd >>> import numpy as np >>> from picasso.utils import encode_cnvs_as_ternary >>> >>> # Create sample CNA data >>> cna_data = pd.DataFrame({ ... 'chr1p': [0, 1, 2, 3], ... 'chr2q': [0, 0, 1, 2] ... }, index=['Cell_A', 'Cell_B', 'Cell_C', 'Cell_D']) >>> >>> print(cna_data) chr1p chr2q Cell_A 0 0 Cell_B -1 0 Cell_C 2 1 Cell_D 3 2
>>> # Encode to ternary format >>> ternary_data = encode_cnvs_as_ternary(cna_data) >>> print(ternary_data) chr1p-1 chr1p-2 chr1p-3 chr2q-1 chr2q-2 Cell_A 0 0 0 0 0 Cell_B -1 0 0 0 0 Cell_C 1 1 0 1 0 Cell_D 1 1 1 1 1
Handling deletions (negative values):
>>> # Data with deletions >>> cna_with_dels = pd.DataFrame({ ... 'chr3p': [-2, -1, 0, 1, 2], ... }, index=[f'Cell_{i}' for i in range(5)]) >>> >>> ternary_dels = encode_cnvs_as_ternary(cna_with_dels) >>> print(ternary_dels) chr3p-1 chr3p-2 Cell_0 -1 -1 Cell_1 -1 0 Cell_2 0 0 Cell_3 1 0 Cell_4 1 1
Notes
Encoding Rules: - Positive integers n are encoded as n ones followed by zeros: [1, 1, …, 1, 0, 0, …] - Negative integers -n are encoded as n negative ones: [-1, -1, …, -1] - Zero values are encoded as all zeros: [0, 0, …] - Column width is determined by the maximum absolute value in each original column
Use Cases: - Preprocessing CNA data for PICASSO phylogenetic inference - Converting complex copy number states to categorical format - Ensuring proper handling of amplifications and deletions in mixture models
Performance Considerations: - Output size scales with maximum copy number values - Memory usage increases significantly for high-amplitude CNAs - Consider binning extreme values before encoding for very noisy data. We recommend binning into ‘amplified’ and ‘highly amplified’ categories.
- Raises:
ValueError – If input data cannot be converted to integer format.
See also
PicassoMain phylogenetic inference class that accepts ternary-encoded data
load_dataFunction to load example CNA datasets
- picasso.utils.load_data()[source]
Load example single-cell copy number alteration (CNA) dataset.
Provides a sample dataset of inferred CNAs from single-cell RNA sequencing data for testing and demonstration purposes. This dataset represents the type of noisy, inferred CNA data that PICASSO is designed to handle.
- Returns:
Example CNA dataset with cells as rows and genomic features as columns. Values represent inferred copy number states, typically integers where: - 0 indicates deletions/loss - 1 indicates neutral copy number - 2+ indicates amplifications/gains Index contains cell/sample identifiers, columns contain feature names.
- Return type:
pd.DataFrame
Examples
Load and explore the example dataset:
>>> from picasso import Picasso, load_data >>> >>> # Load example data >>> cna_data = load_data() >>> print(f"Dataset shape: {cna_data.shape}") >>> print(f"Copy number range: {cna_data.min().min()} to {cna_data.max().max()}") >>> print("First few rows:") >>> print(cna_data.head()) >>> >>> # Use with PICASSO >>> picasso = Picasso(cna_data, min_clone_size=5) >>> picasso.fit()
Inspect data characteristics:
>>> # Check for missing values >>> print(f"Missing values: {cna_data.isnull().sum().sum()}") >>> >>> # Distribution of copy number states >>> print("Copy number state distribution:") >>> print(cna_data.values.flatten().astype(int)) >>> >>> # Feature-wise statistics >>> print("Per-feature statistics:") >>> print(cna_data.describe())
Notes
Dataset Characteristics: - Representative of scRNA-seq-inferred CNA data - Contains typical noise patterns and artifacts - Suitable for algorithm testing and parameter tuning - May include both amplifications and deletions
Data Origin: - Loaded from sample_data/cnas.txt in the package directory - Tab-separated format with sample IDs as first column - Preprocessed to remove extreme outliers and artifacts
Intended Use: - Algorithm development and testing - Parameter optimization for noisy datasets - Tutorial and documentation examples - Benchmarking against other methods
- Raises:
FileNotFoundError – If the sample data file cannot be located in the expected directory.
pd.errors.EmptyDataError – If the data file is empty or corrupted.
See also
PicassoMain phylogenetic inference class for analyzing the loaded data
encode_cnvs_as_ternaryPreprocessing function for complex copy number states
CloneTreeClass for visualizing and analyzing phylogenetic results