Good AI Tools

>> scientific-skills/string-database

stars: 1935
forks: 230
watches: 1935
last updated: 2025-12-10 15:59:01

STRING Database

Overview

STRING is a comprehensive database of known and predicted protein-protein interactions covering 59M proteins and 20B+ interactions across 5000+ organisms. Query interaction networks, perform functional enrichment, discover partners via REST API for systems biology and pathway analysis.

When to Use This Skill

This skill should be used when:

  • Retrieving protein-protein interaction networks for single or multiple proteins
  • Performing functional enrichment analysis (GO, KEGG, Pfam) on protein lists
  • Discovering interaction partners and expanding protein networks
  • Testing if proteins form significantly enriched functional modules
  • Generating network visualizations with evidence-based coloring
  • Analyzing homology and protein family relationships
  • Conducting cross-species protein interaction comparisons
  • Identifying hub proteins and network connectivity patterns

Quick Start

The skill provides:

  1. Python helper functions (scripts/string_api.py) for all STRING REST API operations
  2. Comprehensive reference documentation (references/string_reference.md) with detailed API specifications

When users request STRING data, determine which operation is needed and use the appropriate function from scripts/string_api.py.

Core Operations

1. Identifier Mapping (string_map_ids)

Convert gene names, protein names, and external IDs to STRING identifiers.

When to use: Starting any STRING analysis, validating protein names, finding canonical identifiers.

Usage:

from scripts.string_api import string_map_ids

# Map single protein
result = string_map_ids('TP53', species=9606)

# Map multiple proteins
result = string_map_ids(['TP53', 'BRCA1', 'EGFR', 'MDM2'], species=9606)

# Map with multiple matches per query
result = string_map_ids('p53', species=9606, limit=5)

Parameters:

  • species: NCBI taxon ID (9606 = human, 10090 = mouse, 7227 = fly)
  • limit: Number of matches per identifier (default: 1)
  • echo_query: Include query term in output (default: 1)

Best practice: Always map identifiers first for faster subsequent queries.

2. Network Retrieval (string_network)

Get protein-protein interaction network data in tabular format.

When to use: Building interaction networks, analyzing connectivity, retrieving interaction evidence.

Usage:

from scripts.string_api import string_network

# Get network for single protein
network = string_network('9606.ENSP00000269305', species=9606)

# Get network with multiple proteins
proteins = ['9606.ENSP00000269305', '9606.ENSP00000275493']
network = string_network(proteins, required_score=700)

# Expand network with additional interactors
network = string_network('TP53', species=9606, add_nodes=10, required_score=400)

# Physical interactions only
network = string_network('TP53', species=9606, network_type='physical')

Parameters:

  • required_score: Confidence threshold (0-1000)
    • 150: low confidence (exploratory)
    • 400: medium confidence (default, standard analysis)
    • 700: high confidence (conservative)
    • 900: highest confidence (very stringent)
  • network_type: 'functional' (all evidence, default) or 'physical' (direct binding only)
  • add_nodes: Add N most connected proteins (0-10)

Output columns: Interaction pairs, confidence scores, and individual evidence scores (neighborhood, fusion, coexpression, experimental, database, text-mining).

3. Network Visualization (string_network_image)

Generate network visualization as PNG image.

When to use: Creating figures, visual exploration, presentations.

Usage:

from scripts.string_api import string_network_image

# Get network image
proteins = ['TP53', 'MDM2', 'ATM', 'CHEK2', 'BRCA1']
img_data = string_network_image(proteins, species=9606, required_score=700)

# Save image
with open('network.png', 'wb') as f:
    f.write(img_data)

# Evidence-colored network
img = string_network_image(proteins, species=9606, network_flavor='evidence')

# Confidence-based visualization
img = string_network_image(proteins, species=9606, network_flavor='confidence')

# Actions network (activation/inhibition)
img = string_network_image(proteins, species=9606, network_flavor='actions')

Network flavors:

  • 'evidence': Colored lines show evidence types (default)
  • 'confidence': Line thickness represents confidence
  • 'actions': Shows activating/inhibiting relationships

4. Interaction Partners (string_interaction_partners)

Find all proteins that interact with given protein(s).

When to use: Discovering novel interactions, finding hub proteins, expanding networks.

Usage:

from scripts.string_api import string_interaction_partners

# Get top 10 interactors of TP53
partners = string_interaction_partners('TP53', species=9606, limit=10)

# Get high-confidence interactors
partners = string_interaction_partners('TP53', species=9606,
                                      limit=20, required_score=700)

# Find interactors for multiple proteins
partners = string_interaction_partners(['TP53', 'MDM2'],
                                      species=9606, limit=15)

Parameters:

  • limit: Maximum number of partners to return (default: 10)
  • required_score: Confidence threshold (0-1000)

Use cases:

  • Hub protein identification
  • Network expansion from seed proteins
  • Discovering indirect connections

5. Functional Enrichment (string_enrichment)

Perform enrichment analysis across Gene Ontology, KEGG pathways, Pfam domains, and more.

When to use: Interpreting protein lists, pathway analysis, functional characterization, understanding biological processes.

Usage:

from scripts.string_enrichment import string_enrichment

# Enrichment for a protein list
proteins = ['TP53', 'MDM2', 'ATM', 'CHEK2', 'BRCA1', 'ATR', 'TP73']
enrichment = string_enrichment(proteins, species=9606)

# Parse results to find significant terms
import pandas as pd
df = pd.read_csv(io.StringIO(enrichment), sep='\t')
significant = df[df['fdr'] < 0.05]

Enrichment categories:

  • Gene Ontology: Biological Process, Molecular Function, Cellular Component
  • KEGG Pathways: Metabolic and signaling pathways
  • Pfam: Protein domains
  • InterPro: Protein families and domains
  • SMART: Domain architecture
  • UniProt Keywords: Curated functional keywords

Output columns:

  • category: Annotation database (e.g., "KEGG Pathways", "GO Biological Process")
  • term: Term identifier
  • description: Human-readable term description
  • number_of_genes: Input proteins with this annotation
  • p_value: Uncorrected enrichment p-value
  • fdr: False discovery rate (corrected p-value)

Statistical method: Fisher's exact test with Benjamini-Hochberg FDR correction.

Interpretation: FDR < 0.05 indicates statistically significant enrichment.

6. PPI Enrichment (string_ppi_enrichment)

Test if a protein network has significantly more interactions than expected by chance.

When to use: Validating if proteins form functional module, testing network connectivity.

Usage:

from scripts.string_api import string_ppi_enrichment
import json

# Test network connectivity
proteins = ['TP53', 'MDM2', 'ATM', 'CHEK2', 'BRCA1']
result = string_ppi_enrichment(proteins, species=9606, required_score=400)

# Parse JSON result
data = json.loads(result)
print(f"Observed edges: {data['number_of_edges']}")
print(f"Expected edges: {data['expected_number_of_edges']}")
print(f"P-value: {data['p_value']}")

Output fields:

  • number_of_nodes: Proteins in network
  • number_of_edges: Observed interactions
  • expected_number_of_edges: Expected in random network
  • p_value: Statistical significance

Interpretation:

  • p-value < 0.05: Network is significantly enriched (proteins likely form functional module)
  • p-value ≥ 0.05: No significant enrichment (proteins may be unrelated)

7. Homology Scores (string_homology)

Retrieve protein similarity and homology information.

When to use: Identifying protein families, paralog analysis, cross-species comparisons.

Usage:

from scripts.string_api import string_homology

# Get homology between proteins
proteins = ['TP53', 'TP63', 'TP73']  # p53 family
homology = string_homology(proteins, species=9606)

Use cases:

  • Protein family identification
  • Paralog discovery
  • Evolutionary analysis

8. Version Information (string_version)

Get current STRING database version.

When to use: Ensuring reproducibility, documenting methods.

Usage:

from scripts.string_api import string_version

version = string_version()
print(f"STRING version: {version}")

Common Analysis Workflows

Workflow 1: Protein List Analysis (Standard Workflow)

Use case: Analyze a list of proteins from experiment (e.g., differential expression, proteomics).

from scripts.string_api import (string_map_ids, string_network,
                                string_enrichment, string_ppi_enrichment,
                                string_network_image)

# Step 1: Map gene names to STRING IDs
gene_list = ['TP53', 'BRCA1', 'ATM', 'CHEK2', 'MDM2', 'ATR', 'BRCA2']
mapping = string_map_ids(gene_list, species=9606)

# Step 2: Get interaction network
network = string_network(gene_list, species=9606, required_score=400)

# Step 3: Test if network is enriched
ppi_result = string_ppi_enrichment(gene_list, species=9606)

# Step 4: Perform functional enrichment
enrichment = string_enrichment(gene_list, species=9606)

# Step 5: Generate network visualization
img = string_network_image(gene_list, species=9606,
                          network_flavor='evidence', required_score=400)
with open('protein_network.png', 'wb') as f:
    f.write(img)

# Step 6: Parse and interpret results

Workflow 2: Single Protein Investigation

Use case: Deep dive into one protein's interactions and partners.

from scripts.string_api import (string_map_ids, string_interaction_partners,
                                string_network_image)

# Step 1: Map protein name
protein = 'TP53'
mapping = string_map_ids(protein, species=9606)

# Step 2: Get all interaction partners
partners = string_interaction_partners(protein, species=9606,
                                      limit=20, required_score=700)

# Step 3: Visualize expanded network
img = string_network_image(protein, species=9606, add_nodes=15,
                          network_flavor='confidence', required_score=700)
with open('tp53_network.png', 'wb') as f:
    f.write(img)

Workflow 3: Pathway-Centric Analysis

Use case: Identify and visualize proteins in a specific biological pathway.

from scripts.string_api import string_enrichment, string_network

# Step 1: Start with known pathway proteins
dna_repair_proteins = ['TP53', 'ATM', 'ATR', 'CHEK1', 'CHEK2',
                       'BRCA1', 'BRCA2', 'RAD51', 'XRCC1']

# Step 2: Get network
network = string_network(dna_repair_proteins, species=9606,
                        required_score=700, add_nodes=5)

# Step 3: Enrichment to confirm pathway annotation
enrichment = string_enrichment(dna_repair_proteins, species=9606)

# Step 4: Parse enrichment for DNA repair pathways
import pandas as pd
import io
df = pd.read_csv(io.StringIO(enrichment), sep='\t')
dna_repair = df[df['description'].str.contains('DNA repair', case=False)]

Workflow 4: Cross-Species Analysis

Use case: Compare protein interactions across different organisms.

from scripts.string_api import string_network

# Human network
human_network = string_network('TP53', species=9606, required_score=700)

# Mouse network
mouse_network = string_network('Trp53', species=10090, required_score=700)

# Yeast network (if ortholog exists)
yeast_network = string_network('gene_name', species=4932, required_score=700)

Workflow 5: Network Expansion and Discovery

Use case: Start with seed proteins and discover connected functional modules.

from scripts.string_api import (string_interaction_partners, string_network,
                                string_enrichment)

# Step 1: Start with seed protein(s)
seed_proteins = ['TP53']

# Step 2: Get first-degree interactors
partners = string_interaction_partners(seed_proteins, species=9606,
                                      limit=30, required_score=700)

# Step 3: Parse partners to get protein list
import pandas as pd
import io
df = pd.read_csv(io.StringIO(partners), sep='\t')
all_proteins = list(set(df['preferredName_A'].tolist() +
                       df['preferredName_B'].tolist()))

# Step 4: Perform enrichment on expanded network
enrichment = string_enrichment(all_proteins[:50], species=9606)

# Step 5: Filter for interesting functional modules
enrichment_df = pd.read_csv(io.StringIO(enrichment), sep='\t')
modules = enrichment_df[enrichment_df['fdr'] < 0.001]

Common Species

When specifying species, use NCBI taxon IDs:

OrganismCommon NameTaxon ID
Homo sapiensHuman9606
Mus musculusMouse10090
Rattus norvegicusRat10116
Drosophila melanogasterFruit fly7227
Caenorhabditis elegansC. elegans6239
Saccharomyces cerevisiaeYeast4932
Arabidopsis thalianaThale cress3702
Escherichia coliE. coli511145
Danio rerioZebrafish7955

Full list available at: https://string-db.org/cgi/input?input_page_active_form=organisms

Understanding Confidence Scores

STRING provides combined confidence scores (0-1000) integrating multiple evidence types:

Evidence Channels

  1. Neighborhood (nscore): Conserved genomic neighborhood across species
  2. Fusion (fscore): Gene fusion events
  3. Phylogenetic Profile (pscore): Co-occurrence patterns across species
  4. Coexpression (ascore): Correlated RNA expression
  5. Experimental (escore): Biochemical and genetic experiments
  6. Database (dscore): Curated pathway and complex databases
  7. Text-mining (tscore): Literature co-occurrence and NLP extraction

Recommended Thresholds

Choose threshold based on analysis goals:

  • 150 (low confidence): Exploratory analysis, hypothesis generation
  • 400 (medium confidence): Standard analysis, balanced sensitivity/specificity
  • 700 (high confidence): Conservative analysis, high-confidence interactions
  • 900 (highest confidence): Very stringent, experimental evidence preferred

Trade-offs:

  • Lower thresholds: More interactions (higher recall, more false positives)
  • Higher thresholds: Fewer interactions (higher precision, more false negatives)

Network Types

Functional Networks (Default)

Includes all evidence types (experimental, computational, text-mining). Represents proteins that are functionally associated, even without direct physical binding.

When to use:

  • Pathway analysis
  • Functional enrichment studies
  • Systems biology
  • Most general analyses

Physical Networks

Only includes evidence for direct physical binding (experimental data and database annotations for physical interactions).

When to use:

  • Structural biology studies
  • Protein complex analysis
  • Direct binding validation
  • When physical contact is required

API Best Practices

  1. Always map identifiers first: Use string_map_ids() before other operations for faster queries
  2. Use STRING IDs when possible: Use format 9606.ENSP00000269305 instead of gene names
  3. Specify species for networks >10 proteins: Required for accurate results
  4. Respect rate limits: Wait 1 second between API calls
  5. Use versioned URLs for reproducibility: Available in reference documentation
  6. Handle errors gracefully: Check for "Error:" prefix in returned strings
  7. Choose appropriate confidence thresholds: Match threshold to analysis goals

Detailed Reference

For comprehensive API documentation, complete parameter lists, output formats, and advanced usage, refer to references/string_reference.md. This includes:

  • Complete API endpoint specifications
  • All supported output formats (TSV, JSON, XML, PSI-MI)
  • Advanced features (bulk upload, values/ranks enrichment)
  • Error handling and troubleshooting
  • Integration with other tools (Cytoscape, R, Python libraries)
  • Data license and citation information

Troubleshooting

No proteins found:

  • Verify species parameter matches identifiers
  • Try mapping identifiers first with string_map_ids()
  • Check for typos in protein names

Empty network results:

  • Lower confidence threshold (required_score)
  • Check if proteins actually interact
  • Verify species is correct

Timeout or slow queries:

  • Reduce number of input proteins
  • Use STRING IDs instead of gene names
  • Split large queries into batches

"Species required" error:

  • Add species parameter for networks with >10 proteins
  • Always include species for consistency

Results look unexpected:

  • Check STRING version with string_version()
  • Verify network_type is appropriate (functional vs physical)
  • Review confidence threshold selection

Additional Resources

For proteome-scale analysis or complete species network upload:

  • Visit https://string-db.org
  • Use "Upload proteome" feature
  • STRING will generate complete interaction network and predict functions

For bulk downloads of complete datasets:

Data License

STRING data is freely available under Creative Commons BY 4.0 license:

  • Free for academic and commercial use
  • Attribution required when publishing
  • Cite latest STRING publication

Citation

When using STRING in publications, cite the most recent publication from: https://string-db.org/cgi/about