Getting Started with pycancensus¶
This tutorial demonstrates the enhanced pycancensus functionality with clear hierarchy examples and real data access.
Key Features Demonstrated:¶
list_census_vectors() - Browse all available data variables
Vector Hierarchies - Navigate parent-child relationships, with ASCII tree visualization
find_census_vectors() - Exact, semantic, and keyword search
Region Selection - Filter region lists and de-duplicate ambiguous names
Real Data Retrieval - Get actual census data
Note
You’ll need a free API key from CensusMapper to run these examples with real data.
Setup and Installation¶
First, let’s import pycancensus and set up our environment:
import pycancensus
from pycancensus import (
list_census_datasets,
list_census_vectors,
get_census,
parent_census_vectors,
child_census_vectors,
find_census_vectors
)
import pandas as pd
# Set your API key (replace with your actual key)
# pycancensus.set_api_key("your_api_key_here")
print("pycancensus imported successfully!")
pycancensus imported successfully!
1. Exploring Census Vectors¶
The list_census_vectors() function shows all available data variables:
# List all vectors for 2021 Census
try:
vectors_ca21 = list_census_vectors('CA21')
print(f"CA21 Census has {len(vectors_ca21):,} vectors available")
print(f"Columns: {list(vectors_ca21.columns)}")
# Show how many vectors have parent relationships
with_parents = vectors_ca21[vectors_ca21['parent_vector'].notna()]
print(f"Vectors with parent relationships: {len(with_parents):,} out of {len(vectors_ca21):,}")
print("\nSample hierarchy examples:")
display(with_parents[['vector', 'parent_vector', 'label']].head())
except Exception as e:
print(f"Error: {e}")
print("Make sure you have set your API key!")
Reading vectors from cache...
CA21 Census has 7,709 vectors available
Columns: ['vector', 'label', 'type', 'units', 'aggregation', 'parent_vector', 'details']
Vectors with parent relationships: 7,448 out of 7,709
Sample hierarchy examples:
| vector | parent_vector | label | |
|---|---|---|---|
| 4 | v_CA21_5 | v_CA21_4 | Private dwellings occupied by usual residents |
| 10 | v_CA21_11 | v_CA21_8 | 0 to 14 years |
| 11 | v_CA21_12 | v_CA21_9 | 0 to 14 years |
| 12 | v_CA21_13 | v_CA21_10 | 0 to 14 years |
| 13 | v_CA21_14 | v_CA21_11 | 0 to 4 years |
3. Enhanced Vector Search¶
The find_census_vectors() function supports exact, semantic, and keyword search
(matching the R cancensus package):
try:
# Search for income-related vectors
income_vectors = find_census_vectors('income', 'CA21', query_type='keyword')
print(f"Found {len(income_vectors)} income-related vectors")
print(f"\nTop income vectors:")
display(income_vectors[['vector', 'label']].head(3))
except Exception as e:
print(f"Error searching vectors: {e}")
Found 649 income-related vectors
Top income vectors:
| vector | label | |
|---|---|---|
| 511 | v_CA21_554 | Income statistics in 2020 for the population a... |
| 512 | v_CA21_555 | Income statistics in 2020 for the population a... |
| 513 | v_CA21_556 | Income statistics in 2020 for the population a... |
Semantic search tolerates misspellings and loose phrasing — useful when you don’t know the exact census terminology:
try:
# "comute" is misspelled on purpose; semantic search still finds it
commute_vectors = find_census_vectors('comute duration', 'CA16',
query_type='semantic', quiet=True)
print(f"Found {len(commute_vectors)} vectors despite the typo")
display(commute_vectors[['vector', 'label']].head(3))
except Exception as e:
print(f"Error in semantic search: {e}")
Found 552 vectors despite the typo
| vector | label | |
|---|---|---|
| 4928 | v_CA16_5051 | Total - Highest certificate, diploma or degree... |
| 4929 | v_CA16_5052 | Total - Highest certificate, diploma or degree... |
| 4930 | v_CA16_5053 | Total - Highest certificate, diploma or degree... |
You can also visualize where a vector sits in its hierarchy as an ASCII tree:
from pycancensus import visualize_vector_hierarchy
try:
# Low-income status hierarchy from the 2016 census
visualize_vector_hierarchy("v_CA16_2510", quiet=True)
except Exception as e:
print(f"Error visualizing hierarchy: {e}")
v_CA16_2510: Total - Low-income status in 2015 for the population in private households to whom low-income concepts are applicable - 100% data
├── v_CA16_2513: 0 to 17 years
│ └── v_CA16_2516: 0 to 5 years (leaf)
├── v_CA16_2519: 18 to 64 years (leaf)
└── v_CA16_2522: 65 years and over (leaf)
4. Selecting Regions¶
Region lists can be filtered like any DataFrame and passed straight to
get_census() with as_census_region_list(). When municipality names are
ambiguous (there are two Langleys in metro Vancouver),
add_unique_names_to_region_list() de-duplicates them:
from pycancensus import (
list_census_regions,
as_census_region_list,
add_unique_names_to_region_list,
)
try:
regions = list_census_regions("CA21", quiet=True)
metro_van = regions[(regions["level"] == "CSD") &
(regions["CMA_UID"] == "59933")]
named = add_unique_names_to_region_list(metro_van)
print("De-duplicated names for duplicated municipalities:")
display(named.loc[named["name"].duplicated(keep=False),
["region", "name", "Name"]])
# Convert the selection into a get_census() regions argument
region_arg = as_census_region_list(metro_van.head(5))
print(f"\nRegions argument for get_census(): {region_arg}")
except Exception as e:
print(f"Error selecting regions: {e}")
De-duplicated names for duplicated municipalities:
| region | name | Name | |
|---|---|---|---|
| 400 | 5915001 | Langley | Langley (DM) |
| 425 | 5915046 | North Vancouver | North Vancouver (DM) |
| 452 | 5915051 | North Vancouver | North Vancouver (CY) |
| 517 | 5915002 | Langley | Langley (CY) |
Regions argument for get_census(): {'CSD': ['5915022', '5915004', '5915025', '5915015', '5915034']}
5. Real Data Retrieval¶
Finally, let’s get actual census data using our hierarchy vectors:
try:
# Get real data for Toronto CMA using our income hierarchy vectors
toronto_data = get_census(
dataset='CA21',
regions={'CMA': '35535'}, # Toronto CMA
vectors=['v_CA21_923', 'v_CA21_939', 'v_CA21_942', 'v_CA21_943'], # Income categories
level='CMA',
labels='short',
use_cache=False
)
print(f"Toronto CMA Income Demographics:")
print(f"\nHousehold Income Distribution:")
total_households = toronto_data['v_CA21_923'].iloc[0]
high_income = toronto_data['v_CA21_939'].iloc[0] # $100,000+
very_high_1 = toronto_data['v_CA21_942'].iloc[0] # $150,000-$199,999
very_high_2 = toronto_data['v_CA21_943'].iloc[0] # $200,000+
print(f"• Total households: {total_households:,}")
print(f"• $100,000+ income: {high_income:,} ({high_income/total_households*100:.1f}%)")
print(f" - $150,000-$199,999: {very_high_1:,} ({very_high_1/total_households*100:.1f}%)")
print(f" - $200,000+: {very_high_2:,} ({very_high_2/total_households*100:.1f}%)")
except Exception as e:
print(f"Error retrieving data: {e}")
print("This requires a valid API key and internet connection")
📋 Request Preview:
Dataset: CA21
Level: CMA
Regions: 1 region(s)
Variables: 4 vector(s)
🔍 Estimated Size: small (5 rows)
⏱️ Expected Time: < 5 seconds
🔄 Querying CensusMapper API for 1 region(s)...
📊 Retrieving 4 variable(s) at CMA level...
✅ Successfully retrieved data for 1 regions
📈 Data includes 4 vector columns
Toronto CMA Income Demographics:
Household Income Distribution:
• Total households: 2,262,475
• $100,000+ income: 1,096,595 (48.5%)
- $150,000-$199,999: 278,755 (12.3%)
- $200,000+: 343,145 (15.2%)
Summary¶
This tutorial demonstrates the enhanced pycancensus capabilities:
list_census_vectors() - Browse 7,709+ available variables with explicit parent-child relationships
Hierarchy Navigation - Navigate through income hierarchies from main categories to detailed brackets
parent_census_vectors() & child_census_vectors() - Navigate up and down the hierarchy
find_census_vectors() & visualize_vector_hierarchy() - Exact, semantic, and keyword search; ASCII hierarchy trees
as_census_region_list() & add_unique_names_to_region_list() - Region selection helpers
Real Data - Actual census data retrieved and analyzed
Key Improvement: Unlike previous versions, these hierarchy functions now work with clear, well-defined parent-child relationships in the census data structure.
Next Steps:¶
Explore other hierarchies (income, education, housing)
Try different geographic levels (province, census division, etc.)
Use
geo_format='geopandas'for spatial analysisCheck out the gallery examples for more advanced use cases
Getting Help¶
Documentation: Explore the API reference and other tutorials
Examples: Browse the example gallery for specific use cases
Issues: Report problems on GitHub
API Key: Get your free key at CensusMapper