Migration Guide: R to Python
=============================

Complete guide for R ``cancensus`` users migrating to Python ``pycancensus``.

Quick Start
-----------

Installation
~~~~~~~~~~~~

**R:**

.. code-block:: r

   install.packages("cancensus")
   library(cancensus)

**Python:**

.. code-block:: bash

   pip install pycancensus

.. code-block:: python

   import pycancensus as pc

API Key Setup
~~~~~~~~~~~~~

**R:**

.. code-block:: r

   set_cancensus_api_key("YOUR_API_KEY", install = TRUE)

**Python:**

.. code-block:: python

   pc.set_api_key("YOUR_API_KEY", install=True)
   # Or: export CANCENSUS_API_KEY="YOUR_API_KEY"

Function Equivalence
--------------------

All Core Functions Available
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :header-rows: 1
   :widths: 40 40 20

   * - R Function
     - Python Function
     - Equivalence
   * - ``get_census()``
     - ``get_census()``
     - ✅ 100%
   * - ``list_census_datasets()``
     - ``list_census_datasets()``
     - ✅ 100%
   * - ``list_census_vectors()``
     - ``list_census_vectors()``
     - ✅ 100%
   * - ``search_census_vectors()``
     - ``search_census_vectors()``
     - ✅ 100%
   * - ``find_census_vectors()``
     - ``find_census_vectors()``
     - ✅ 100%
   * - ``parent_census_vectors()``
     - ``parent_census_vectors()``
     - ✅ 100%
   * - ``child_census_vectors()``
     - ``child_census_vectors()``
     - ✅ 100%
   * - ``dataset_attribution()``
     - ``dataset_attribution()``
     - ✅ 100%
   * - ``label_vectors()``
     - ``label_vectors()``
     - ✅ 100%
   * - ``list_cancensus_cache()``
     - ``list_cache()``
     - ✅ 100%

Syntax Conversion
-----------------

Core Syntax Differences
~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :header-rows: 1
   :widths: 50 50

   * - R Syntax
     - Python Syntax
   * - ``list(CMA = "59933")``
     - ``{'CMA': '59933'}``
   * - ``c("v1", "v2", "v3")``
     - ``['v1', 'v2', 'v3']``
   * - ``TRUE`` / ``FALSE``
     - ``True`` / ``False``
   * - ``NULL``
     - ``None``

Side-by-Side Examples
---------------------

Example 1: Basic Data Retrieval
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**R:**

.. code-block:: r

   library(cancensus)

   census_data <- get_census(
     dataset = 'CA21',
     regions = list(CMA = "59933"),
     vectors = c("v_CA21_906"),
     level = 'CSD'
   )

**Python:**

.. code-block:: python

   import pycancensus as pc

   census_data = pc.get_census(
       dataset='CA21',
       regions={'CMA': '59933'},
       vectors=['v_CA21_906'],
       level='CSD'
   )

Example 2: With Geography
~~~~~~~~~~~~~~~~~~~~~~~~~~

**R:**

.. code-block:: r

   census_data <- get_census(
     dataset = 'CA21',
     regions = list(CMA = "35535"),
     vectors = c("v_CA21_906"),
     level = 'CSD',
     geo_format = 'sf'
   )

**Python:**

.. code-block:: python

   census_data = pc.get_census(
       dataset='CA21',
       regions={'CMA': '35535'},
       vectors=['v_CA21_906'],
       level='CSD',
       geo_format='sf'
   )

Example 3: Search Vectors
~~~~~~~~~~~~~~~~~~~~~~~~~~

**R:**

.. code-block:: r

   income_vectors <- search_census_vectors("income", "CA21")

**Python:**

.. code-block:: python

   income_vectors = pc.search_census_vectors("income", "CA21")

Example 4: List Datasets
~~~~~~~~~~~~~~~~~~~~~~~~~

**R:**

.. code-block:: r

   datasets <- list_census_datasets()

**Python:**

.. code-block:: python

   datasets = pc.list_census_datasets()

Return Type Conversions
------------------------

Data Structures
~~~~~~~~~~~~~~~

.. list-table::
   :header-rows: 1
   :widths: 40 40 20

   * - R Type
     - Python Type
     - Notes
   * - ``data.frame`` / ``tibble``
     - ``pandas.DataFrame``
     - Direct equivalent
   * - ``sf`` object
     - ``geopandas.GeoDataFrame``
     - Same spatial data
   * - ``list``
     - ``list``
     - Direct equivalent
   * - ``character``
     - ``str``
     - Direct equivalent

Working with Results
~~~~~~~~~~~~~~~~~~~~

**R:**

.. code-block:: r

   # Filter data
   filtered <- census_data %>%
     filter(Population > 50000)

   # Select columns
   selected <- census_data %>%
     select(GeoUID, Population)

**Python:**

.. code-block:: python

   # Filter data
   filtered = census_data[census_data['Population'] > 50000]

   # Select columns
   selected = census_data[['GeoUID', 'Population']]

Visualization Migration
-----------------------

Mapping
~~~~~~~

**R (using ggplot2 + sf):**

.. code-block:: r

   library(ggplot2)
   library(sf)

   ggplot(census_data) +
     geom_sf(aes(fill = v_CA21_906)) +
     scale_fill_viridis_c() +
     theme_minimal()

**Python (using matplotlib + geopandas):**

.. code-block:: python

   import matplotlib.pyplot as plt

   census_data.plot(
       column='v_CA21_906',
       cmap='viridis',
       legend=True
   )
   plt.show()

**Python (using plotly for interactive):**

.. code-block:: python

   import plotly.express as px

   fig = px.choropleth_mapbox(
       census_data,
       geojson=census_data.geometry,
       locations=census_data.index,
       color='v_CA21_906',
       mapbox_style='carto-positron'
   )
   fig.show()

Charts
~~~~~~

**R (ggplot2):**

.. code-block:: r

   ggplot(census_data, aes(x = `Region Name`, y = Population)) +
     geom_bar(stat = "identity") +
     theme(axis.text.x = element_text(angle = 45))

**Python (matplotlib):**

.. code-block:: python

   census_data.plot.bar(x='Region Name', y='Population')
   plt.xticks(rotation=45)
   plt.tight_layout()
   plt.show()

**Python (plotly):**

.. code-block:: python

   import plotly.express as px

   fig = px.bar(census_data, x='Region Name', y='Population')
   fig.show()

Common Migration Patterns
--------------------------

Pattern 1: Data Pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~

**R:**

.. code-block:: r

   library(dplyr)
   library(cancensus)

   result <- get_census(
     dataset = 'CA21',
     regions = list(CMA = "35535"),
     vectors = c("v_CA21_906"),
     level = 'CSD'
   ) %>%
     filter(Population > 50000) %>%
     arrange(desc(v_CA21_906))

**Python:**

.. code-block:: python

   import pycancensus as pc

   result = (pc.get_census(
       dataset='CA21',
       regions={'CMA': '35535'},
       vectors=['v_CA21_906'],
       level='CSD'
   )
   .query('Population > 50000')
   .sort_values('v_CA21_906', ascending=False)
   )

Pattern 2: Multiple Regions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**R:**

.. code-block:: r

   regions <- list(
     CMA = c("59933", "35535", "24462")
   )

   data <- get_census(
     dataset = 'CA21',
     regions = regions,
     vectors = c("v_CA21_906"),
     level = 'CSD'
   )

**Python:**

.. code-block:: python

   regions = {
       'CMA': ['59933', '35535', '24462']
   }

   data = pc.get_census(
       dataset='CA21',
       regions=regions,
       vectors=['v_CA21_906'],
       level='CSD'
   )

Pattern 3: Caching Control
~~~~~~~~~~~~~~~~~~~~~~~~~~~

**R:**

.. code-block:: r

   # Disable cache for this query
   data <- get_census(
     dataset = 'CA21',
     regions = list(CMA = "59933"),
     vectors = c("v_CA21_906"),
     level = 'CSD',
     use_cache = FALSE
   )

   # Clear all cache
   remove_from_cancensus_cache()

**Python:**

.. code-block:: python

   # Disable cache for this query
   data = pc.get_census(
       dataset='CA21',
       regions={'CMA': '59933'},
       vectors=['v_CA21_906'],
       level='CSD',
       use_cache=False
   )

   # Clear all cache
   pc.clear_cache()

Key Differences to Remember
----------------------------

1. **Dictionary vs Named List**

   R uses named lists: ``list(CMA = "59933")``

   Python uses dictionaries: ``{'CMA': '59933'}``

2. **Vector vs List**

   R uses ``c()``: ``c("v1", "v2")``

   Python uses ``[]``: ``['v1', 'v2']``

3. **Boolean Capitalization**

   R: ``TRUE``, ``FALSE``

   Python: ``True``, ``False``

4. **NULL vs None**

   R: ``NULL``

   Python: ``None``

5. **Function Parameter Order**

   ``find_census_vectors()`` has different parameter order:

   - R: ``find_census_vectors(query, dataset, ...)``
   - Python: ``find_census_vectors(dataset, query, ...)``

Performance Comparison
----------------------

Based on validation testing, Python pycancensus is typically **2.7x faster** than R cancensus
for equivalent operations, primarily due to:

- More efficient HTTP connection pooling
- Optimized pandas data operations
- Better caching implementation

Troubleshooting
---------------

Common Issues
~~~~~~~~~~~~~

**Issue 1: Empty vector list causes API error**

.. code-block:: python

   # ❌ This fails
   data = pc.get_census(dataset='CA21', regions={'CSD': '123'}, vectors=[])

   # ✅ Use None instead
   data = pc.get_census(dataset='CA21', regions={'CSD': '123'}, vectors=None)

**Issue 2: Function not found**

Make sure you've imported pycancensus:

.. code-block:: python

   import pycancensus as pc
   # Then use: pc.get_census(...)

**Issue 3: API key not set**

.. code-block:: python

   # Check if key is set
   pc.show_api_key()

   # Set key
   pc.set_api_key("YOUR_KEY")

Getting Help
~~~~~~~~~~~~

- **Documentation:** https://pycancensus.readthedocs.io/
- **Validation Results:** See :doc:`validation`
- **GitHub Issues:** https://github.com/dshkol/pycancensus/issues
- **R cancensus docs:** https://mountainmath.github.io/cancensus/

Complete Example
----------------

Here's a complete analysis migrated from R to Python:

**R Version:**

.. code-block:: r

   library(cancensus)
   library(dplyr)
   library(ggplot2)
   library(sf)

   # Get data
   toronto <- get_census(
     dataset = 'CA21',
     regions = list(CMA = "35535"),
     vectors = c("v_CA21_906"),
     level = 'CSD',
     geo_format = 'sf'
   )

   # Analyze
   top_income <- toronto %>%
     filter(!is.na(v_CA21_906)) %>%
     top_n(10, v_CA21_906)

   # Visualize
   ggplot(top_income) +
     geom_sf(aes(fill = v_CA21_906)) +
     scale_fill_viridis_c() +
     labs(title = "Top 10 Highest Income Areas - Toronto CMA") +
     theme_minimal()

**Python Version:**

.. code-block:: python

   import pycancensus as pc
   import matplotlib.pyplot as plt

   # Get data
   toronto = pc.get_census(
       dataset='CA21',
       regions={'CMA': '35535'},
       vectors=['v_CA21_906'],
       level='CSD',
       geo_format='sf'
   )

   # Analyze
   top_income = (toronto
       .dropna(subset=['v_CA21_906'])
       .nlargest(10, 'v_CA21_906')
   )

   # Visualize
   top_income.plot(
       column='v_CA21_906',
       cmap='viridis',
       legend=True
   )
   plt.title("Top 10 Highest Income Areas - Toronto CMA")
   plt.axis('off')
   plt.show()

Both versions produce identical results!

Further Reading
---------------

- :doc:`validation` - See 96% validation pass rate with 24 examples
- :doc:`../README` - Package overview
- :doc:`tutorials/index` - Step-by-step tutorials
- :doc:`auto_examples/index` - Gallery of examples