Getting Started with pycancensus¶

This tutorial demonstrates the enhanced pycancensus functionality with clear hierarchy examples and real data access.

Key Features Demonstrated:¶

list_census_vectors() - Browse all available data variables
Vector Hierarchies - Navigate parent-child relationships, with ASCII tree visualization
find_census_vectors() - Exact, semantic, and keyword search
Region Selection - Filter region lists and de-duplicate ambiguous names
Real Data Retrieval - Get actual census data

Note

You’ll need a free API key from CensusMapper to run these examples with real data.

Setup and Installation¶

First, let’s import pycancensus and set up our environment:

import pycancensus
from pycancensus import (
    list_census_datasets, 
    list_census_vectors, 
    get_census,
    parent_census_vectors,
    child_census_vectors,
    find_census_vectors
)
import pandas as pd

# Set your API key (replace with your actual key)
# pycancensus.set_api_key("your_api_key_here")
print("pycancensus imported successfully!")

pycancensus imported successfully!

1. Exploring Census Vectors¶

The list_census_vectors() function shows all available data variables:

# List all vectors for 2021 Census
try:
    vectors_ca21 = list_census_vectors('CA21')
    print(f"CA21 Census has {len(vectors_ca21):,} vectors available")
    print(f"Columns: {list(vectors_ca21.columns)}")

    # Show how many vectors have parent relationships
    with_parents = vectors_ca21[vectors_ca21['parent_vector'].notna()]
    print(f"Vectors with parent relationships: {len(with_parents):,} out of {len(vectors_ca21):,}")
    print("\nSample hierarchy examples:")
    display(with_parents[['vector', 'parent_vector', 'label']].head())
    
except Exception as e:
    print(f"Error: {e}")
    print("Make sure you have set your API key!")

Reading vectors from cache...
CA21 Census has 7,709 vectors available
Columns: ['vector', 'label', 'type', 'units', 'aggregation', 'parent_vector', 'details']
Vectors with parent relationships: 7,448 out of 7,709

Sample hierarchy examples:

	vector	parent_vector	label
4	v_CA21_5	v_CA21_4	Private dwellings occupied by usual residents
10	v_CA21_11	v_CA21_8	0 to 14 years
11	v_CA21_12	v_CA21_9	0 to 14 years
12	v_CA21_13	v_CA21_10	0 to 14 years
13	v_CA21_14	v_CA21_11	0 to 4 years

2. Vector Hierarchy Navigation¶

Unlike previous versions with limited hierarchy examples, pycancensus now provides clear parent-child relationships:

try:
    # Find household income vector (this is our ROOT with real hierarchy)
    income_root = "v_CA21_923"  # Household total income groups in 2020
    
    # Get the vector details for context
    income_info = vectors_ca21[vectors_ca21['vector'] == income_root]
    if not income_info.empty:
        print(f"Household Income Hierarchy\n")
        print(f"ROOT: {income_root} - {income_info['label'].iloc[0][:50]}...")
        print(f"\nLEVEL 1 - Income Brackets:")
    
    # Get its direct children (income brackets)
    income_children = child_census_vectors(income_root, 'CA21')
    display(income_children[['vector', 'label', 'parent_vector']].head(8))  # Show first 8 brackets
    
except Exception as e:
    print(f"Error exploring hierarchy: {e}")

Household Income Hierarchy

ROOT: v_CA21_923 - Household total income groups in 2020 for private ...

LEVEL 1 - Income Brackets:

	vector	label	parent_vector
0	v_CA21_924	Under $5,000	v_CA21_923
1	v_CA21_925	$5,000 to $9,999	v_CA21_923
2	v_CA21_926	$10,000 to $14,999	v_CA21_923
3	v_CA21_927	$15,000 to $19,999	v_CA21_923
4	v_CA21_928	$20,000 to $24,999	v_CA21_923
5	v_CA21_929	$25,000 to $29,999	v_CA21_923
6	v_CA21_930	$30,000 to $34,999	v_CA21_923
7	v_CA21_931	$35,000 to $39,999	v_CA21_923

Drilling Down Further¶

try:
    # Drill down into the high-income bracket (shows grandparent -> parent -> child)
    high_income_bracket = "v_CA21_939"  # $100,000 and over
    print(f"LEVEL 2 - High-income sub-categories for '{high_income_bracket}':")

    # Get the children of the $100,000+ bracket
    high_income_subcats = child_census_vectors(high_income_bracket, 'CA21')
    display(high_income_subcats[['vector', 'label', 'parent_vector']])

    # Show the parent relationship for context
    parent_info = parent_census_vectors(high_income_bracket, 'CA21')
    if not parent_info.empty:
        print(f"\nParent of this bracket: {parent_info['vector'].iloc[0]} - {parent_info['label'].iloc[0][:50]}...")
    
except Exception as e:
    print(f"Error exploring detailed hierarchy: {e}")

LEVEL 2 - High-income sub-categories for 'v_CA21_939':

	vector	label	parent_vector
0	v_CA21_940	$100,000 to $124,999	v_CA21_939
1	v_CA21_941	$125,000 to $149,999	v_CA21_939
2	v_CA21_942	$150,000 to $199,999	v_CA21_939
3	v_CA21_943	$200,000 and over	v_CA21_939

Parent of this bracket: v_CA21_923 - Household total income groups in 2020 for private ...

Finding Parent Vectors¶

You can also navigate upward in the hierarchy:

try:
    # Find parent of a specific income bracket
    income_bracket = "v_CA21_942"  # $150,000 to $199,999
    parent = parent_census_vectors(income_bracket, 'CA21')
    print(f"Finding parent of income bracket '{income_bracket}':")
    display(parent[['vector', 'label', 'parent_vector']])
    
except Exception as e:
    print(f"Error finding parent: {e}")

Finding parent of income bracket 'v_CA21_942':

	vector	label	parent_vector
0	v_CA21_939	$100,000 and over	v_CA21_923
1	v_CA21_923	Household total income groups in 2020 for priv...	NaN

3. Enhanced Vector Search¶

The find_census_vectors() function supports exact, semantic, and keyword search (matching the R cancensus package):

try:
    # Search for income-related vectors
    income_vectors = find_census_vectors('income', 'CA21', query_type='keyword')
    print(f"Found {len(income_vectors)} income-related vectors")
    print(f"\nTop income vectors:")
    display(income_vectors[['vector', 'label']].head(3))
    
except Exception as e:
    print(f"Error searching vectors: {e}")

Found 649 income-related vectors

Top income vectors:

	vector	label
511	v_CA21_554	Income statistics in 2020 for the population a...
512	v_CA21_555	Income statistics in 2020 for the population a...
513	v_CA21_556	Income statistics in 2020 for the population a...

Semantic search tolerates misspellings and loose phrasing — useful when you don’t know the exact census terminology:

try:
    # "comute" is misspelled on purpose; semantic search still finds it
    commute_vectors = find_census_vectors('comute duration', 'CA16',
                                          query_type='semantic', quiet=True)
    print(f"Found {len(commute_vectors)} vectors despite the typo")
    display(commute_vectors[['vector', 'label']].head(3))

except Exception as e:
    print(f"Error in semantic search: {e}")

Found 552 vectors despite the typo

	vector	label
4928	v_CA16_5051	Total - Highest certificate, diploma or degree...
4929	v_CA16_5052	Total - Highest certificate, diploma or degree...
4930	v_CA16_5053	Total - Highest certificate, diploma or degree...

You can also visualize where a vector sits in its hierarchy as an ASCII tree:

from pycancensus import visualize_vector_hierarchy

try:
    # Low-income status hierarchy from the 2016 census
    visualize_vector_hierarchy("v_CA16_2510", quiet=True)

except Exception as e:
    print(f"Error visualizing hierarchy: {e}")

v_CA16_2510: Total - Low-income status in 2015 for the population in private households to whom low-income concepts are applicable - 100% data
├── v_CA16_2513: 0 to 17 years
│   └── v_CA16_2516: 0 to 5 years (leaf)
├── v_CA16_2519: 18 to 64 years (leaf)
└── v_CA16_2522: 65 years and over (leaf)

4. Selecting Regions¶

Region lists can be filtered like any DataFrame and passed straight to get_census() with as_census_region_list(). When municipality names are ambiguous (there are two Langleys in metro Vancouver), add_unique_names_to_region_list() de-duplicates them:

from pycancensus import (
    list_census_regions,
    as_census_region_list,
    add_unique_names_to_region_list,
)

try:
    regions = list_census_regions("CA21", quiet=True)
    metro_van = regions[(regions["level"] == "CSD") &
                        (regions["CMA_UID"] == "59933")]

    named = add_unique_names_to_region_list(metro_van)
    print("De-duplicated names for duplicated municipalities:")
    display(named.loc[named["name"].duplicated(keep=False),
                      ["region", "name", "Name"]])

    # Convert the selection into a get_census() regions argument
    region_arg = as_census_region_list(metro_van.head(5))
    print(f"\nRegions argument for get_census(): {region_arg}")

except Exception as e:
    print(f"Error selecting regions: {e}")

De-duplicated names for duplicated municipalities:

	region	name	Name
400	5915001	Langley	Langley (DM)
425	5915046	North Vancouver	North Vancouver (DM)
452	5915051	North Vancouver	North Vancouver (CY)
517	5915002	Langley	Langley (CY)

Regions argument for get_census(): {'CSD': ['5915022', '5915004', '5915025', '5915015', '5915034']}

5. Real Data Retrieval¶

Finally, let’s get actual census data using our hierarchy vectors:

try:
    # Get real data for Toronto CMA using our income hierarchy vectors
    toronto_data = get_census(
        dataset='CA21',
        regions={'CMA': '35535'},  # Toronto CMA
        vectors=['v_CA21_923', 'v_CA21_939', 'v_CA21_942', 'v_CA21_943'],  # Income categories
        level='CMA',
        labels='short',
        use_cache=False
    )
    
    print(f"Toronto CMA Income Demographics:")
    print(f"\nHousehold Income Distribution:")
    total_households = toronto_data['v_CA21_923'].iloc[0]
    high_income = toronto_data['v_CA21_939'].iloc[0]  # $100,000+
    very_high_1 = toronto_data['v_CA21_942'].iloc[0]  # $150,000-$199,999
    very_high_2 = toronto_data['v_CA21_943'].iloc[0]  # $200,000+
    
    print(f"• Total households: {total_households:,}")
    print(f"• $100,000+ income: {high_income:,} ({high_income/total_households*100:.1f}%)")
    print(f"  - $150,000-$199,999: {very_high_1:,} ({very_high_1/total_households*100:.1f}%)")
    print(f"  - $200,000+: {very_high_2:,} ({very_high_2/total_households*100:.1f}%)")
    
except Exception as e:
    print(f"Error retrieving data: {e}")
    print("This requires a valid API key and internet connection")

📋 Request Preview:
   Dataset: CA21
   Level: CMA
   Regions: 1 region(s)
   Variables: 4 vector(s)
🔍 Estimated Size: small (5 rows)
⏱️  Expected Time: < 5 seconds
🔄 Querying CensusMapper API for 1 region(s)...
📊 Retrieving 4 variable(s) at CMA level...

✅ Successfully retrieved data for 1 regions
📈 Data includes 4 vector columns
Toronto CMA Income Demographics:

Household Income Distribution:
• Total households: 2,262,475
• $100,000+ income: 1,096,595 (48.5%)
  - $150,000-$199,999: 278,755 (12.3%)
  - $200,000+: 343,145 (15.2%)

Summary¶

This tutorial demonstrates the enhanced pycancensus capabilities:

list_census_vectors() - Browse 7,709+ available variables with explicit parent-child relationships
Hierarchy Navigation - Navigate through income hierarchies from main categories to detailed brackets
parent_census_vectors() & child_census_vectors() - Navigate up and down the hierarchy
find_census_vectors() & visualize_vector_hierarchy() - Exact, semantic, and keyword search; ASCII hierarchy trees
as_census_region_list() & add_unique_names_to_region_list() - Region selection helpers
Real Data - Actual census data retrieved and analyzed

Key Improvement: Unlike previous versions, these hierarchy functions now work with clear, well-defined parent-child relationships in the census data structure.

Next Steps:¶

Explore other hierarchies (income, education, housing)
Try different geographic levels (province, census division, etc.)
Use geo_format='geopandas' for spatial analysis
Check out the gallery examples for more advanced use cases

Getting Help¶

Documentation: Explore the API reference and other tutorials
Examples: Browse the example gallery for specific use cases
Issues: Report problems on GitHub
API Key: Get your free key at CensusMapper