ROADMAP: Tools to Know¶

For Python Data Analytics, Business Intelligence, Machine Learning, and more.

All tools listed are free unless noted. Check the boxes as you add skills.

Core Setup¶

[ ] Visual Studio Code (VS Code) - Modern IDE with Python, Markdown, and Git support. Use to edit code, run notebooks, and manage projects.
[ ] Python - Primary programming language for scripting, data analysis, and BI automation.
[ ] Markdown - Simple formatting language for README files, documentation, and Jupyter notes.
[ ] Jupyter Notebooks / JupyterLab - Interactive notebooks that combine code, output, and formatted text. Essential for data exploration and storytelling.
[ ] VS Code Jupyter Extension - Rich notebook editing in VS Code, with support for Markdown, plots, and cell output.
[ ] uv (Astral) - Modern Python package and environment manager. Replaces pip, venv, and virtualenv.

Version Control and Project Management¶

[ ] Git and GitHub - Version control and collaboration platform. Track changes, share code, and publish projects.
[ ] .gitignore / pyproject.toml / virtual environments - Tools to manage dependencies and keep projects clean and reproducible.
[ ] pre-commit – Automates linting, formatting, and static analysis before commits.
[ ] GitHub Actions - Automate checks, deployments, and other tasks from GitHub.
[ ] TortoiseGit - Windows only, integrates Git with Windows File Explorer.

Terminals¶

[ ] PowerShell - Use on Windows Machines. Also a very popular cross-platform automation tool for Mac/Linux.
[ ] Windows Subsystem for Linux (WSL/WSL2) - Linux environment on Windows machines. Strongly recommended or required when testing advanced Apache and open-source data tools.
[ ] bash / zsh / Git Bash - Common terminal interfaces for Mac/Linux. Git Bash provides a bash shell for Windows.
[ ] uvx CLI – Run Python tools without installing them permanently.

Data Manipulation & In-Memory Processing¶

[ ] Pandas - Core library for tabular data manipulation and analysis in Python.
[ ] Polars - High-performance DataFrame library (written in Rust). Fast and efficient, especially for large datasets.
[ ] PyArrow – Python interface for Apache Arrow, powering Polars, DuckDB, and Parquet interoperability.

Modern Analytics Data Engineering Tools¶

[ ] dbt - Transform and test SQL-based analytics workflows in a modular way.
[ ] SQLMesh - Modern data transformation and version control for SQL-based workflows. Allows modular and repeatable ETL development.
[ ] Datafold – Data diffing and validation for analytics CI/CD workflows.

In-Process SQL & Columnar Data¶

[ ] DuckDB - In-process SQL engine. Run fast queries directly on CSVs and Parquet files.
[ ] Apache Arrow – Columnar data format enabling high-speed data exchange between tools like Polars, DuckDB, and Spark.

Distributed Computing & Big Data Processing¶

[ ] Apache Spark - Distributed processing engine for big data and analytics. Uses micro-batching, good for iterative machine learning tasks, mature ecosystem with extensive community support.
[ ] PySpark - Python API for working with Apache Spark.
[ ] Apache Flink - Highly scalable stream processing engine for exactly once event processing, real-time analytics, and complex events.
[ ] Ray – Distributed Python for machine learning, model serving, and parallel computing.

Data Lakehouse & Table Formats¶

[ ] Delta Lake - open source framework for data lakes
[ ] Apache Iceberg - High-performance table format for huge analytic datasets, enabling fast data lake queries and ACID transactions.
[ ] Apache Hudi - Real-time data lake platform for incremental data processing and streaming ingestion.
[ ] Apache XTable - Multi-format table abstraction layer, enabling seamless conversion between Iceberg, Delta Lake, and Hudi.
[ ] OneTable – Unifies schema and metadata across Iceberg, Delta, and Hudi.

Data Pipelines & Quality¶

[ ] Apache NiFi - Visual tool to build and manage automated data flows.
[ ] Prefect - Orchestrate and schedule ETL pipelines. Great for managing dependencies and retries in production environments. See also Apache Airflow.
[ ] Pandera - open source DataFrame validator tool
[ ] Tableau Prep - (Paid, but free for students via Tableau for Students) Visual tool to clean, shape, and join data before loading into Tableau dashboards.
[ ] Great Expectations, GX Cloud - Tool for validating and testing data quality in pipelines or notebooks. Used in enterprise-grade pipelines to ensure data reliability.
[ ] Marquez – Open-source metadata and lineage tracking for pipelines.
[ ] Great Expectations (GX) – Modernized framework for data validation and testing in pipelines.

Streaming & Real-Time Processing¶

[ ] WebSockets - Real-time communication in dashboards and live apps.
[ ] Python socket.io - real-time bidirectional event-based communication between clients (e.g., web browsers) and a server.
[ ] Apache Kafka - High-throughput distributed messaging system for real-time data.
[ ] Apache Pulsar - Cloud-native distributed messaging and event streaming platform for high-throughput, low-latency real-time processing.
[ ] Faust – Python stream processing library compatible with Kafka for event-driven analytics.

Web Scraping & APIs¶

[ ] requests - Simple HTTP library for making API calls.
[ ] BeautifulSoup - HTML parser for web scraping and structured data extraction.
[ ] Scrapy - Open-source scraping and extraction framework.
[ ] FastAPI - High-performance web framework for building APIs in Python.
[ ] Flask - Lightweight microframework for building web apps and data services.
[ ] httpx – Modern async HTTP client for Python. Often used with FastAPI for high-performance API calls.
[ ] FastAPI (Pydantic v2) – Fully compatible with Pydantic v2 for fast, type-safe API validation.

Data Visualization: Core¶

[ ] Matplotlib - Core plotting library for Python.
[ ] Seaborn - Statistical data visualization built on top of Matplotlib.
[ ] Plotly - Interactive charts and graphs, browser-based.
[ ] Altair – Declarative visualization library built on Vega-Lite.
[ ] Polars LazyFrame Plotting (Experimental) – Integrates with Polars for lightweight analytics visualizations.

Data Visualization: Interactive Web Apps¶

[ ] Streamlit - Create interactive web apps directly from Python scripts.
[ ] Plotly Dash - Build analytical web applications with no front-end experience required.

Data Visualization: Interactive and Dashboarding Tools¶

[ ] PyShiny - Reactive Python dashboard framework (inspired by the popular R Shiny).
[ ] Apache Superset - Open-source BI tool for building interactive dashboards and data exploration.
[ ] Power BI - (Free for students and basic use) - Professional Windows-only dashboarding tool used widely in business.
[ ] Tableau - (Paid, but free for students via Tableau for Students) - Leading BI dashboard platform used in many companies.
[ ] Apache Superset v4 – Includes a semantic layer and embedded dashboard capabilities.

Cloud Tools & Deployment¶

[ ] GitHub Codespaces – Cloud-based VS Code with uv and Jupyter preinstalled.
[ ] Google Colab - Free hosted Jupyter notebooks with built-in libraries and GPU support.
[ ] ShinyLive - Host interactive Shiny applications directly in the browser without servers.
[ ] Streamlit Cloud - Host Streamlit dashboards and apps online with a free account.
[ ] Render- Platform for deploying FastAPI, Flask, and Python web apps easily.
[ ] Railway - Platform for deploying web apps and microservices easily.
[ ] AWS Lambda - Run Python functions in the cloud without managing servers. Free tier available.
[ ] Microsoft Azure Functions) - Cloud-based serverless function execution from Microsoft. Free tier available.
[ ] Modal – Run and scale Python functions in the cloud without infrastructure setup.
[ ] Hugging Face Spaces – Free hosting for Streamlit, Gradio, and Shiny apps.

AI & Semantic Tools (Advanced or Optional)¶

[ ] Chroma (ChromaDB) - Open-source vector database for storing and searching embeddings (semantic search). Used in AI pipelines for tasks like document search, RAG apps, and chatbots.
[ ] LangChain – Framework for building LLM-powered applications and workflows.
[ ] LlamaIndex – Framework for connecting structured/unstructured data to LLMs (RAG).
[ ] Ollama – Run and serve local LLMs like Llama 3 and Mistral on macOS, Windows, or Linux.
[ ] OpenAI API / Anthropic Claude API – Cloud APIs for state-of-the-art LLMs used in production applications.