Getting Started with pycancensus

This tutorial demonstrates the enhanced pycancensus functionality with clear hierarchy examples and real data access.

Key Features Demonstrated:

  • list_census_vectors() - Browse all available data variables

  • Vector Hierarchies - Navigate parent-child relationships, with ASCII tree visualization

  • find_census_vectors() - Exact, semantic, and keyword search

  • Region Selection - Filter region lists and de-duplicate ambiguous names

  • Real Data Retrieval - Get actual census data

Note

You’ll need a free API key from CensusMapper to run these examples with real data.

Setup and Installation

First, let’s import pycancensus and set up our environment:

import pycancensus
from pycancensus import (
    list_census_datasets, 
    list_census_vectors, 
    get_census,
    parent_census_vectors,
    child_census_vectors,
    find_census_vectors
)
import pandas as pd

# Set your API key (replace with your actual key)
# pycancensus.set_api_key("your_api_key_here")
print("pycancensus imported successfully!")
pycancensus imported successfully!

1. Exploring Census Vectors

The list_census_vectors() function shows all available data variables:

# List all vectors for 2021 Census
try:
    vectors_ca21 = list_census_vectors('CA21')
    print(f"CA21 Census has {len(vectors_ca21):,} vectors available")
    print(f"Columns: {list(vectors_ca21.columns)}")

    # Show how many vectors have parent relationships
    with_parents = vectors_ca21[vectors_ca21['parent_vector'].notna()]
    print(f"Vectors with parent relationships: {len(with_parents):,} out of {len(vectors_ca21):,}")
    print("\nSample hierarchy examples:")
    display(with_parents[['vector', 'parent_vector', 'label']].head())
    
except Exception as e:
    print(f"Error: {e}")
    print("Make sure you have set your API key!")
Reading vectors from cache...
CA21 Census has 7,709 vectors available
Columns: ['vector', 'label', 'type', 'units', 'aggregation', 'parent_vector', 'details']
Vectors with parent relationships: 7,448 out of 7,709

Sample hierarchy examples:
vector parent_vector label
4 v_CA21_5 v_CA21_4 Private dwellings occupied by usual residents
10 v_CA21_11 v_CA21_8 0 to 14 years
11 v_CA21_12 v_CA21_9 0 to 14 years
12 v_CA21_13 v_CA21_10 0 to 14 years
13 v_CA21_14 v_CA21_11 0 to 4 years

2. Vector Hierarchy Navigation

Unlike previous versions with limited hierarchy examples, pycancensus now provides clear parent-child relationships:

try:
    # Find household income vector (this is our ROOT with real hierarchy)
    income_root = "v_CA21_923"  # Household total income groups in 2020
    
    # Get the vector details for context
    income_info = vectors_ca21[vectors_ca21['vector'] == income_root]
    if not income_info.empty:
        print(f"Household Income Hierarchy\n")
        print(f"ROOT: {income_root} - {income_info['label'].iloc[0][:50]}...")
        print(f"\nLEVEL 1 - Income Brackets:")
    
    # Get its direct children (income brackets)
    income_children = child_census_vectors(income_root, 'CA21')
    display(income_children[['vector', 'label', 'parent_vector']].head(8))  # Show first 8 brackets
    
except Exception as e:
    print(f"Error exploring hierarchy: {e}")
Household Income Hierarchy

ROOT: v_CA21_923 - Household total income groups in 2020 for private ...

LEVEL 1 - Income Brackets:
vector label parent_vector
0 v_CA21_924 Under $5,000 v_CA21_923
1 v_CA21_925 $5,000 to $9,999 v_CA21_923
2 v_CA21_926 $10,000 to $14,999 v_CA21_923
3 v_CA21_927 $15,000 to $19,999 v_CA21_923
4 v_CA21_928 $20,000 to $24,999 v_CA21_923
5 v_CA21_929 $25,000 to $29,999 v_CA21_923
6 v_CA21_930 $30,000 to $34,999 v_CA21_923
7 v_CA21_931 $35,000 to $39,999 v_CA21_923

Drilling Down Further

try:
    # Drill down into the high-income bracket (shows grandparent -> parent -> child)
    high_income_bracket = "v_CA21_939"  # $100,000 and over
    print(f"LEVEL 2 - High-income sub-categories for '{high_income_bracket}':")

    # Get the children of the $100,000+ bracket
    high_income_subcats = child_census_vectors(high_income_bracket, 'CA21')
    display(high_income_subcats[['vector', 'label', 'parent_vector']])

    # Show the parent relationship for context
    parent_info = parent_census_vectors(high_income_bracket, 'CA21')
    if not parent_info.empty:
        print(f"\nParent of this bracket: {parent_info['vector'].iloc[0]} - {parent_info['label'].iloc[0][:50]}...")
    
except Exception as e:
    print(f"Error exploring detailed hierarchy: {e}")
LEVEL 2 - High-income sub-categories for 'v_CA21_939':
vector label parent_vector
0 v_CA21_940 $100,000 to $124,999 v_CA21_939
1 v_CA21_941 $125,000 to $149,999 v_CA21_939
2 v_CA21_942 $150,000 to $199,999 v_CA21_939
3 v_CA21_943 $200,000 and over v_CA21_939
Parent of this bracket: v_CA21_923 - Household total income groups in 2020 for private ...

Finding Parent Vectors

You can also navigate upward in the hierarchy:

try:
    # Find parent of a specific income bracket
    income_bracket = "v_CA21_942"  # $150,000 to $199,999
    parent = parent_census_vectors(income_bracket, 'CA21')
    print(f"Finding parent of income bracket '{income_bracket}':")
    display(parent[['vector', 'label', 'parent_vector']])
    
except Exception as e:
    print(f"Error finding parent: {e}")
Finding parent of income bracket 'v_CA21_942':
vector label parent_vector
0 v_CA21_939 $100,000 and over v_CA21_923
1 v_CA21_923 Household total income groups in 2020 for priv... NaN

4. Selecting Regions

Region lists can be filtered like any DataFrame and passed straight to get_census() with as_census_region_list(). When municipality names are ambiguous (there are two Langleys in metro Vancouver), add_unique_names_to_region_list() de-duplicates them:

from pycancensus import (
    list_census_regions,
    as_census_region_list,
    add_unique_names_to_region_list,
)

try:
    regions = list_census_regions("CA21", quiet=True)
    metro_van = regions[(regions["level"] == "CSD") &
                        (regions["CMA_UID"] == "59933")]

    named = add_unique_names_to_region_list(metro_van)
    print("De-duplicated names for duplicated municipalities:")
    display(named.loc[named["name"].duplicated(keep=False),
                      ["region", "name", "Name"]])

    # Convert the selection into a get_census() regions argument
    region_arg = as_census_region_list(metro_van.head(5))
    print(f"\nRegions argument for get_census(): {region_arg}")

except Exception as e:
    print(f"Error selecting regions: {e}")
De-duplicated names for duplicated municipalities:
region name Name
400 5915001 Langley Langley (DM)
425 5915046 North Vancouver North Vancouver (DM)
452 5915051 North Vancouver North Vancouver (CY)
517 5915002 Langley Langley (CY)
Regions argument for get_census(): {'CSD': ['5915022', '5915004', '5915025', '5915015', '5915034']}

5. Real Data Retrieval

Finally, let’s get actual census data using our hierarchy vectors:

try:
    # Get real data for Toronto CMA using our income hierarchy vectors
    toronto_data = get_census(
        dataset='CA21',
        regions={'CMA': '35535'},  # Toronto CMA
        vectors=['v_CA21_923', 'v_CA21_939', 'v_CA21_942', 'v_CA21_943'],  # Income categories
        level='CMA',
        labels='short',
        use_cache=False
    )
    
    print(f"Toronto CMA Income Demographics:")
    print(f"\nHousehold Income Distribution:")
    total_households = toronto_data['v_CA21_923'].iloc[0]
    high_income = toronto_data['v_CA21_939'].iloc[0]  # $100,000+
    very_high_1 = toronto_data['v_CA21_942'].iloc[0]  # $150,000-$199,999
    very_high_2 = toronto_data['v_CA21_943'].iloc[0]  # $200,000+
    
    print(f"• Total households: {total_households:,}")
    print(f"• $100,000+ income: {high_income:,} ({high_income/total_households*100:.1f}%)")
    print(f"  - $150,000-$199,999: {very_high_1:,} ({very_high_1/total_households*100:.1f}%)")
    print(f"  - $200,000+: {very_high_2:,} ({very_high_2/total_households*100:.1f}%)")
    
except Exception as e:
    print(f"Error retrieving data: {e}")
    print("This requires a valid API key and internet connection")
📋 Request Preview:
   Dataset: CA21
   Level: CMA
   Regions: 1 region(s)
   Variables: 4 vector(s)
🔍 Estimated Size: small (5 rows)
⏱️  Expected Time: < 5 seconds
🔄 Querying CensusMapper API for 1 region(s)...
📊 Retrieving 4 variable(s) at CMA level...
✅ Successfully retrieved data for 1 regions
📈 Data includes 4 vector columns
Toronto CMA Income Demographics:

Household Income Distribution:
• Total households: 2,262,475
• $100,000+ income: 1,096,595 (48.5%)
  - $150,000-$199,999: 278,755 (12.3%)
  - $200,000+: 343,145 (15.2%)

Summary

This tutorial demonstrates the enhanced pycancensus capabilities:

  1. list_census_vectors() - Browse 7,709+ available variables with explicit parent-child relationships

  2. Hierarchy Navigation - Navigate through income hierarchies from main categories to detailed brackets

  3. parent_census_vectors() & child_census_vectors() - Navigate up and down the hierarchy

  4. find_census_vectors() & visualize_vector_hierarchy() - Exact, semantic, and keyword search; ASCII hierarchy trees

  5. as_census_region_list() & add_unique_names_to_region_list() - Region selection helpers

  6. Real Data - Actual census data retrieved and analyzed

Key Improvement: Unlike previous versions, these hierarchy functions now work with clear, well-defined parent-child relationships in the census data structure.

Next Steps:

  • Explore other hierarchies (income, education, housing)

  • Try different geographic levels (province, census division, etc.)

  • Use geo_format='geopandas' for spatial analysis

  • Check out the gallery examples for more advanced use cases

Getting Help

  • Documentation: Explore the API reference and other tutorials

  • Examples: Browse the example gallery for specific use cases

  • Issues: Report problems on GitHub

  • API Key: Get your free key at CensusMapper