Data Caching and Performance¶

This tutorial explains how pycancensus caches data to improve performance and how to manage the cache effectively.

Why Caching Matters¶

Census data requests can be slow due to:

Large datasets (millions of records)
Complex geographic boundaries
API rate limits
Network latency

pycancensus automatically caches responses to make subsequent requests much faster.

import pycancensus as pc
import pandas as pd
import os
from pathlib import Path
import time

print("pycancensus caching tutorial")

pycancensus caching tutorial

How Caching Works¶

pycancensus uses intelligent caching that considers:

# Cache system overview
print("pycancensus Cache System:")
print("="*30)
print("• Automatic caching of all API responses")
print("• Unique cache keys based on request parameters")
print("• Persistent storage between sessions")
print("• Configurable cache location")
print("• Cache management tools")

pycancensus Cache System:
==============================
• Automatic caching of all API responses
• Unique cache keys based on request parameters
• Persistent storage between sessions
• Configurable cache location
• Cache management tools

Cache Configuration¶

Viewing Current Cache Settings¶

try:
    # Check current cache settings
    cache_path = pc.get_cache_path()
    print(f"Current cache path: {cache_path}")

    # Check if cache directory exists
    if os.path.exists(cache_path):
        print(f"Cache directory exists: ✓")

        # List cache contents
        cache_files = list(Path(cache_path).glob("*"))
        print(f"Cache contains {len(cache_files)} files")
    else:
        print("Cache directory doesn't exist yet")

except Exception as e:
    print(f"Error checking cache: {e}")

Current cache path: /home/docs/.cancensus_cache
Cache directory exists: ✓
Cache contains 23 files

Setting Custom Cache Location¶

try:
    # Set custom cache location
    custom_cache = os.path.expanduser("~/my_census_cache")
    pc.set_cache_path(custom_cache)

    print(f"Cache path set to: {pc.get_cache_path()}")

    # Create directory if it doesn't exist
    os.makedirs(custom_cache, exist_ok=True)
    print("Custom cache directory created")

except Exception as e:
    print(f"Error setting cache path: {e}")

Cache path set to: /home/docs/my_census_cache for current session.
Cache path set to: /home/docs/my_census_cache
Custom cache directory created

Cache in Action¶

Let’s see caching performance improvements:

try:
    # First request (uncached) - will be slower
    print("Making first request (will be cached)...")
    start_time = time.time()

    data1 = pc.get_census(
        dataset="CA21",
        regions={"CMA": "59933"},  # Vancouver
        vectors=["v_CA21_1", "v_CA21_434"],
        level="CSD"
    )

    first_time = time.time() - start_time
    print(f"First request: {first_time:.2f} seconds")
    print(f"Retrieved {len(data1)} records")

    # Second identical request (cached) - will be faster
    print("\nMaking identical request (will use cache)...")
    start_time = time.time()

    data2 = pc.get_census(
        dataset="CA21",
        regions={"CMA": "59933"},
        vectors=["v_CA21_1", "v_CA21_434"],
        level="CSD"
    )

    second_time = time.time() - start_time
    print(f"Second request: {second_time:.2f} seconds")

    # Compare performance
    if first_time > 0 and second_time > 0:
        speedup = first_time / second_time
        print(f"\nSpeedup: {speedup:.1f}x faster!")

    # Verify data is identical
    print(f"Data identical: {data1.equals(data2)}")

except Exception as e:
    print(f"Error demonstrating cache: {e}")
    print("This requires API access to demonstrate fully")

Making first request (will be cached)...
📋 Request Preview:
   Dataset: CA21
   Level: CSD
   Regions: 1 region(s)
   Variables: 2 vector(s)
🔍 Estimated Size: small (100 rows)
⏱️  Expected Time: < 5 seconds
🔄 Querying CensusMapper API for 1 region(s)...
📊 Retrieving 2 variable(s) at CSD level...

✅ Successfully retrieved data for 38 regions
📈 Data includes 2 vector columns
First request: 0.77 seconds
Retrieved 38 records

Making identical request (will use cache)...
📋 Request Preview:
   Dataset: CA21
   Level: CSD
   Regions: 1 region(s)
   Variables: 2 vector(s)
🔍 Estimated Size: small (100 rows)
⏱️  Expected Time: < 5 seconds
Reading data from cache...
Second request: 0.04 seconds

Speedup: 18.9x faster!
Data identical: True

Cache Management¶

Listing Cached Data¶

The list_cache() function shows what’s currently in your cache:

try:
    # List what's in the cache
    cached_items = pc.list_cache()

    print(f"Found {len(cached_items)} cached items:")
    print("\nCache contents:")
    for i, item in enumerate(cached_items[:5]):  # Show first 5
        print(f"{i+1}. {item}")

except Exception as e:
    print(f"Error listing cache: {e}")
    print("Cache may be empty or cache directory may not exist")

Found 1 cached items:

Cache contents:
cache_key
file_path
size_mb
created
modified
dataset
level
vectors
version
geo_version

Cache Statistics¶

try:
    cache_path = pc.get_cache_path()

    if os.path.exists(cache_path):
        # Calculate cache size
        total_size = 0
        file_count = 0

        for file_path in Path(cache_path).rglob("*"):
            if file_path.is_file():
                total_size += file_path.stat().st_size
                file_count += 1

        # Convert to readable format
        if total_size > 1024**3:  # GB
            size_str = f"{total_size / 1024**3:.2f} GB"
        elif total_size > 1024**2:  # MB
            size_str = f"{total_size / 1024**2:.2f} MB"
        elif total_size > 1024:  # KB
            size_str = f"{total_size / 1024:.2f} KB"
        else:
            size_str = f"{total_size} bytes"

        print("Cache Statistics:")
        print("="*20)
        print(f"Location: {cache_path}")
        print(f"Files: {file_count}")
        print(f"Total size: {size_str}")

    else:
        print("Cache directory doesn't exist")

except Exception as e:
    print(f"Error calculating cache stats: {e}")

Cache Statistics:
====================
Location: /home/docs/my_census_cache
Files: 2
Total size: 5.57 KB

Controlling Cache Behavior¶

Refreshing the Cache¶

You can force fresh data by using the use_cache=False parameter. This skips reading any locally cached copy and re-downloads — and, matching the R cancensus behavior, the fresh result replaces the cached entry, so subsequent calls with use_cache=True see the updated data:

try:
    # Force a fresh download; the cache entry is refreshed afterwards
    print("Forcing fresh data download...")

    fresh_data = pc.get_census(
        dataset="CA21",
        regions={"PR": "59"},  # British Columbia
        vectors=["v_CA21_1"],
        level="PR",
        use_cache=False  # Skip stale cache, re-download, refresh cache
    )

    print(f"Fresh data retrieved: {len(fresh_data)} records")

except Exception as e:
    print(f"Error refreshing cache: {e}")
    print("This requires API access")

Forcing fresh data download...
📋 Request Preview:
   Dataset: CA21
   Level: PR
   Regions: 1 region(s)
   Variables: 1 vector(s)
🔍 Estimated Size: small (1 rows)
⏱️  Expected Time: < 5 seconds
🔄 Querying CensusMapper API for 1 region(s)...
📊 Retrieving 1 variable(s) at PR level...

✅ Successfully retrieved data for 1 regions
📈 Data includes 1 vector columns
Fresh data retrieved: 1 records

In-Memory Session Cache¶

Metadata lists (list_census_vectors(), list_census_regions()) are additionally held in an in-memory session cache: repeated calls within the same Python session skip the disk entirely, which makes hierarchy navigation and variable search snappy. The session cache is invalidated automatically by remove_from_cache() and clear_cache().

Cache Behavior¶

# pycancensus cache behavior
print("Cache Behavior:")
print("="*25)
print("• Census data rarely changes, so cache persists")
print("• Vector lists and region lists are cached")
print("• Geographic boundaries are cached")
print("• API responses include metadata for freshness")
print("\nTo force refresh:")
print("• Use use_cache=False in get_census() (re-downloads AND updates the cache)")
print("• Remove specific cache entries with remove_from_cache()")
print("• Clear entire cache with clear_cache()")

Cache Behavior:
=========================
• Census data rarely changes, so cache persists
• Vector lists and region lists are cached
• Geographic boundaries are cached
• API responses include metadata for freshness

To force refresh:
• Use use_cache=False in get_census() (re-downloads AND updates the cache)
• Remove specific cache entries with remove_from_cache()
• Clear entire cache with clear_cache()

Recalled Data¶

Statistics Canada occasionally recalls published census data after discovering errors. CensusMapper tracks these recalls, and pycancensus checks your local cache against them: each cached get_census() result records the server’s data version, and reading recalled data from the cache produces a warning.

You can inspect and clean recalled data explicitly:

try:
    # List locally cached entries that have been recalled by StatCan
    recalled = pc.list_recalled_cached_data()
    if recalled is None:
        print("Recall database unavailable")
    elif recalled.empty:
        print("No recalled data in the local cache")
    else:
        print(f"{len(recalled)} recalled cache entries:")
        print(recalled[["cache_key", "dataset", "level"]])

    # Remove them (no-op if there are none)
    pc.remove_recalled_cached_data()

except Exception as e:
    print(f"Error checking recalls: {e}")

No recalled data in the local cache
No recalled data in cached data.

Data downloaded after a recall reflects the corrected values and is not flagged. Cache entries written by pycancensus versions before 0.2.0 have no version metadata and are skipped by the recall check — clear them with clear_cache() if you want full coverage.

Cache Maintenance¶

Selective Cache Removal¶

You can remove specific items from the cache:

try:
    # Example: Remove a specific cached item
    # First, let's see what's in the cache
    cached_items = pc.list_cache()

    if len(cached_items) > 0:
        print(f"Cache contains {len(cached_items)} items")
        print("\nTo remove a specific item, use:")
        print("pc.remove_from_cache('cache_key_name')")
        print("\nExample:")
        print(f"pc.remove_from_cache('{cached_items[0]}')")
    else:
        print("Cache is empty - nothing to remove")

except Exception as e:
    print(f"Note: {e}")

Cache contains 2 items

To remove a specific item, use:
pc.remove_from_cache('cache_key_name')

Example:
Note: 0

Complete Cache Reset¶

try:
    # Clear entire cache
    print("Complete cache reset:")
    print("="*20)

    # Get cache size before
    cache_path = pc.get_cache_path()
    if os.path.exists(cache_path):
        before_files = len(list(Path(cache_path).rglob("*")))
        print(f"Files before reset: {before_files}")

        # To clear cache, uncomment the line below:
        # pc.clear_cache()
        print("\nTo clear entire cache: pc.clear_cache()")

        print("\n⚠️  Warning: This removes all cached data!")
        print("   Use with caution as it will slow down future requests")
    else:
        print("No cache to clear")

except Exception as e:
    print(f"Error with cache reset: {e}")

Complete cache reset:
====================
Files before reset: 4

To clear entire cache: pc.clear_cache()

⚠️  Warning: This removes all cached data!
   Use with caution as it will slow down future requests

Advanced Cache Strategies¶

Preloading Common Data¶

For applications that frequently access certain data, preload it into cache:

# Strategy for preloading frequently used data
common_datasets = ["CA21", "CA16"]
major_cmas = {
    "Toronto": "535",
    "Montreal": "462",
    "Vancouver": "59933",
    "Calgary": "825",
    "Ottawa": "505"
}

print("Preloading Strategy:")
print("="*20)
print("For applications that frequently access certain data:")
print()

for city, cma_code in list(major_cmas.items())[:2]:  # Show 2 examples
    print(f"# Preload {city} data")
    print(f"""data_{city.lower()} = pc.get_census(
    dataset="CA21",
    regions={{"CMA": "{cma_code}"}},
    vectors=["v_CA21_1", "v_CA21_434"],  # Common vectors
    level="CSD"
)""")
    print()

print("This ensures fast access to commonly requested data.")

Preloading Strategy:
====================
For applications that frequently access certain data:

# Preload Toronto data
data_toronto = pc.get_census(
    dataset="CA21",
    regions={"CMA": "535"},
    vectors=["v_CA21_1", "v_CA21_434"],  # Common vectors
    level="CSD"
)

# Preload Montreal data
data_montreal = pc.get_census(
    dataset="CA21",
    regions={"CMA": "462"},
    vectors=["v_CA21_1", "v_CA21_434"],  # Common vectors
    level="CSD"
)

This ensures fast access to commonly requested data.

Cache-Aware Application Design¶

def efficient_census_analysis(regions_list, vectors_list):
    """
    Example of cache-aware function design
    """
    print("Cache-Aware Function Design:")
    print("="*30)

    # Check what's already cached
    try:
        cached = pc.list_cache()
        print(f"Found {len(cached)} cached items")
    except:
        print("Cache check not available")

    # Best practices
    print("\nBest practices:")
    print("• Group requests by dataset and geography level")
    print("• Request multiple vectors in single call")
    print("• Reuse data objects when possible")
    print("• Check cache before making requests")

    return "Design pattern demonstrated"

# Example usage
result = efficient_census_analysis(
    regions_list=["535", "462", "59933"],  # Toronto, Montreal, Vancouver
    vectors_list=["v_CA21_1", "v_CA21_434"]
)

Cache-Aware Function Design:
==============================
Found 2 cached items

Best practices:
• Group requests by dataset and geography level
• Request multiple vectors in single call
• Reuse data objects when possible
• Check cache before making requests

Troubleshooting Cache Issues¶

Common Problems and Solutions¶

print("Common Cache Issues and Solutions:")
print("="*35)
print()

print("1. Cache Directory Permissions:")
print("   Problem: Can't write to cache directory")
print("   Solution: Check directory permissions or set new cache path")
print("   pc.set_cache_path('/path/with/write/access')")
print()

print("2. Disk Space:")
print("   Problem: Cache grows too large")
print("   Solution: Regular cache cleanup")
print("   pc.clear_cache()")
print()

print("3. Stale Data:")
print("   Problem: Using old cached data")
print("   Solution: Force fresh download")
print("   pc.get_census(..., use_cache=False)")
print()

print("4. Cache Corruption:")
print("   Problem: Corrupted cache files")
print("   Solution: Clear cache and start fresh")
print("   pc.clear_cache()")

Common Cache Issues and Solutions:
===================================

1. Cache Directory Permissions:
   Problem: Can't write to cache directory
   Solution: Check directory permissions or set new cache path
   pc.set_cache_path('/path/with/write/access')

2. Disk Space:
   Problem: Cache grows too large
   Solution: Regular cache cleanup
   pc.clear_cache()

3. Stale Data:
   Problem: Using old cached data
   Solution: Force fresh download
   pc.get_census(..., use_cache=False)

4. Cache Corruption:
   Problem: Corrupted cache files
   Solution: Clear cache and start fresh
   pc.clear_cache()

Cache Diagnostics¶

def diagnose_cache_health():
    """Diagnostic function for cache health"""

    print("Cache Health Diagnostic:")
    print("="*25)

    try:
        cache_path = pc.get_cache_path()
        print(f"✓ Cache path accessible: {cache_path}")

        # Check if writable
        test_file = os.path.join(cache_path, ".test_write")
        try:
            os.makedirs(cache_path, exist_ok=True)
            with open(test_file, 'w') as f:
                f.write("test")
            os.remove(test_file)
            print("✓ Cache directory is writable")
        except:
            print("✗ Cache directory is not writable")

        # Check space usage
        if os.path.exists(cache_path):
            file_count = len(list(Path(cache_path).rglob("*")))
            print(f"✓ Cache contains {file_count} files")
        else:
            print("! Cache directory doesn't exist yet")

    except Exception as e:
        print(f"✗ Cache diagnostic error: {e}")

# Run diagnostics
diagnose_cache_health()

Cache Health Diagnostic:
=========================
✓ Cache path accessible: /home/docs/my_census_cache
✓ Cache directory is writable
✓ Cache contains 4 files

Performance Monitoring¶

Measuring Cache Effectiveness¶

class CacheMonitor:
    """Simple cache performance monitor"""

    def __init__(self):
        self.requests = []

    def log_request(self, request_type, duration, cached=False):
        self.requests.append({
            'type': request_type,
            'duration': duration,
            'cached': cached,
            'timestamp': time.time()
        })

    def get_stats(self):
        if not self.requests:
            return "No requests logged"

        cached_requests = [r for r in self.requests if r['cached']]
        uncached_requests = [r for r in self.requests if not r['cached']]

        print("Cache Performance Stats:")
        print("="*25)
        print(f"Total requests: {len(self.requests)}")
        print(f"Cache hits: {len(cached_requests)}")
        print(f"Cache misses: {len(uncached_requests)}")

        if cached_requests and uncached_requests:
            avg_cached = sum(r['duration'] for r in cached_requests) / len(cached_requests)
            avg_uncached = sum(r['duration'] for r in uncached_requests) / len(uncached_requests)
            speedup = avg_uncached / avg_cached if avg_cached > 0 else 0
            print(f"Average speedup: {speedup:.1f}x")

# Example usage
monitor = CacheMonitor()
monitor.log_request("get_census", 2.5, cached=False)
monitor.log_request("get_census", 0.1, cached=True)
monitor.get_stats()

Cache Performance Stats:
=========================
Total requests: 2
Cache hits: 1
Cache misses: 1
Average speedup: 25.0x

Summary¶

This tutorial covered comprehensive cache management for pycancensus:

Key Concepts Learned:

How pycancensus caching works automatically
Configuring cache location and settings
Measuring cache performance improvements
Managing and maintaining the cache
Troubleshooting common cache issues
Cache-aware application design patterns

Available Cache Functions:

get_cache_path() - Get current cache directory
set_cache_path(path) - Set custom cache directory
list_cache() - List cached items (with dataset/vector/version metadata)
remove_from_cache(key) - Remove specific cache entry
clear_cache() - Clear all cached data
list_recalled_cached_data() - List cached data recalled by StatCan
remove_recalled_cached_data() - Remove recalled cached data
use_cache=False - Skip stale cache in get_census() and refresh it

Best Practices Summary:¶

Let cache work automatically - Default behavior is optimized
Monitor cache size - Clean up periodically if needed
Use consistent requests - Same parameters = cache hits
Batch requests - Request multiple vectors together
Handle cache errors - Have fallbacks for cache issues

Next Steps:¶

Implement cache monitoring in your applications
Set up cache maintenance routines
Experiment with preloading strategies for your use cases
Combine caching with other performance optimizations

The cache system makes pycancensus much more responsive for interactive analysis and production applications!