Professional Data Analytics

These are set of common pages to help get started with professional data analytics, including GitHub, Git, Python, Markdown, VS Code, and more.

Our Goal

Our goal is to help you get productive quickly and effectively. We introduce the general landscape and terms, and provide a bit of what they do and why we think these tools are worth learning. Details are left to outside resources.

Requirements

Recommended choices are nearly always free and open-source. You will need a relatively modern machine and electricity to keep it running. Your curiousity, resourcefulness, and tenacity will be quite valuable.

Data analytics WP20

Start Here

  1. First, sign up for a free account on GitHub.

Then, read about and install the following on your machine.

  1. Git - a system for tracking evolving code files and syncing between your machine and GitHub
  2. VS Code - a lightweight editor great for beginners and professionals alike
  3. Python - a popular, powerful language for working with data

Chapters

We organize our documentation into several main categories. Click the links below to explore each chapter:

  1. Languages
  2. Tools
  3. Terminals
  4. Techniques
  5. Hosting
  6. Data
  7. Other

Or browse the sidebar menu on the left.

Contributing

We welcome contributions to our documentation! If you’re interested in contributing, please see the CONTRIBUTING.md file for guidelines on how to get started.

Feedback

To provide comments or feedback, please use the Issues and Discussions tabs in the GitHub repo.

License

This documentation is licensed under the Creative Commons Attribution-ShareAlike 4.0 International license. Please see the LICENSE.md file for more information.

Subsections of

Chapter 1

Languages

This chapter provides an introduction to popular languages for analytics.

Languages in Analytics

Data analysts often work with a variety of languages, which can be broadly categorized into programming languages and markup languages.

Programming languages are further classified into compiled and scripting languages.

Compiled languages, such as Go, Rust, Java, and C# (“C-sharp”) require a separate compilation step to convert source code into machine-readable code, resulting in faster execution times and better performance optimization.

Scripting languages, such as Python, R, and JavaScript, are interpreted at runtime, providing more flexibility and ease of use, making them popular choices for data analysis tasks.

Markup languages, like Markdown, HTML, and CSS, are used to structure and present data, rather than performing computations.

Data analysts often use markup languages to store, exchange, and visualize data, in conjunction with programming languages for data manipulation and analysis.

Familiarity with various languages across these categories enables data analysts to effectively handle diverse data sources, perform complex analyses, and communicate results in a clear, accessible manner.

In alphabetical order, some of the languages you may encounter include the following.

CSS

Markup Language Markup Language Web Development Web Development

CSS (Cascading Style Sheets) is a stylesheet language used for describing the look and formatting of a document or web page written in HTML. While not directly related to data analytics, it’s essential for creating visually appealing dashboards and reports.

Go

Programming Language Programming Language Compiled Language Compiled Language

Go is a statically typed, compiled language with strong support for concurrent programming. While not as popular for data analytics as Python or R, Go is gaining traction for developing high-performance data processing tools.

HTML

Markup Language Markup Language Web Development Web Development

HTML (Hypertext Markup Language) is the standard markup language used to create web pages. It is useful for structuring and formatting web content, including data visualizations and interactive analytics applications.

JavaScript

Programming Language Programming Language Scripting Language Scripting Language Web Development Web Development

JavaScript is a widely-used programming language that enables interactivity and dynamic content on the web. In data analytics, JavaScript is commonly used with libraries like D3.js to create interactive visualizations and web-based applications.

Julia

Programming Language Programming Language Scripting Language Scripting Language Jupyter Support Jupyter Support

Julia is a high-level, high-performance programming language for technical computing. It is gaining popularity in data analytics due to its speed, ease of use, and extensive library ecosystem, including packages for data manipulation, statistical analysis, and machine learning. It can be used in Jupyter notebooks along with Python.

LaTeX

Markup Language Markup Language Typesetting Typesetting

LaTeX (“la-TECH”) is a markup language used for creating professional-looking documents, including academic papers, capstone reports, theses, and presentations. It is widely used in the scientific and technical communities due to its ability to handle complex equations and symbols with ease.

Markdown

Markup Language Markup Language Jupyter Support Jupyter Support

Markdown is a lightweight markup language used to create formatted text documents. While not specific to data analytics, it is commonly used to document code, write README files, and create reports in a simple and human-readable format. It is commonly used in Jupyter notebooks along with Python.

PowerShell

Programming Language Programming Language Scripting Language Scripting Language

PowerShell is a powerful scripting language and shell designed for automating tasks and managing configurations in Windows environments. While not commonly used for data analytics, it can be employed for data extraction, transformation, and automation tasks.

Python

Programming Language Programming Language Scripting Language Scripting Language Jupyter Support Jupyter Support

Python is a popular programming language for data science and machine learning. It offers extensive libraries and tools for data analysis, visualization, and machine learning, making it an excellent choice for data analytics tasks.

R

Programming Language Programming Language Scripting Language Scripting Language Jupyter Support Jupyter Support

R is a programming language and software environment for statistical computing and graphics. It is widely used in data analytics for statistical analysis, data manipulation, and visualization. R can be used in Jupyter notebooks along with Python.

Rust

Programming Language Programming Language Compiled Language Compiled Language

Rust is a systems programming language focused on safety, concurrency, and performance. While not as widely used for data analytics, it can be employed for building high-performance data processing tools and libraries.

SQL

Programming Language Programming Language Declarative Language Declarative Language

SQL is a domain-specific programming language used to manage and manipulate relational databases.

Typst

Markup Language Markup Language Typesetting Typesetting

Typst is a new typesetting option that aims to simplify the document creation process. It provides an intuitive markup language for formatting text, with support for mathematical equations, tables, and figures. It can be compiled into various document formats, including PDF and HTML.

Subsections of Languages

CSS

CSS is a powerful styling language used to add visual effects to web pages.

Why CSS?

For web developers and designers, Cascading Style Sheets (CSS) is an essential skill for creating attractive and engaging websites.

  • CSS helps to create visually appealing layouts and designs that enhance user experience.
  • It allows for consistent styling across all pages of a website, making it easier to maintain and update.

CSS Syntax

  • CSS uses a set of rules and declarations to style HTML elements.
  • Selectors are used to target specific HTML elements, while properties define the styling rules.

Free Resources for Learning CSS

  • CSS Tricks: A website with a wide range of articles, tutorials, and resources for learning CSS
  • MDN Web Docs - CSS: A comprehensive guide to CSS, with documentation and examples
  • W3Schools CSS Tutorial: A free, interactive tutorial for learning CSS, with practical examples and exercises
  • Codecademy CSS Course: An interactive course that covers the basics of CSS, with hands-on coding exercises
  • CSS Zen Garden: A showcase of creative CSS designs, with source code available for learning

File Extensions

  • .css

Using CSS

There is no installation needed to begin using CSS to style web pages. CSS is understood by web browsers such as Chrome, Firefox, Safari, and Edge. To use CSS, you can define styles in a separate CSS file or in the head section of an HTML file using the <style> element.

To add a .css file to an HTML file,
include a link in the head section of the HTML file.

<head>
  <link rel="stylesheet" href="styles.css">
</head>

An example of styles.css is shown below.

body {
  font-family: Arial, sans-serif;
  background-color: #f0f0f0;
}

h1 {
  color: #333;
  font-size: 2em;
  margin-bottom: 1em;
}

p {
  color: #666;
  font-size: 1.2em;
  line-height: 1.5;
  margin-bottom: 1.5em;
}

Responsive Design

Responsive design is an approach to web design that aims to create websites that adapt to different screen sizes and devices. With responsive design, web developers can ensure that their websites look and function well on desktops, laptops, tablets, and smartphones, and provide a consistent user experience across all devices.

To create responsive websites, CSS is used to define media queries that specify different styles and layouts for different screen sizes. By using media queries, web developers can adjust the design of their websites based on the width of the viewport, the orientation of the device, and other factors.

Design Skills

Good design is an essential aspect of creating effective and engaging websites and dashboards. CSS plays a crucial role in web design, as it allows users to control the visual presentation of their sites and create attractive and user-friendly interfaces.

To create good design with CSS, it’s important to have a solid understanding of typography, color theory, layout principles, and user experience design. Web developers can use CSS to define fonts, colors, spacing, positioning, and other visual elements, and use design principles to create a cohesive and appealing look and feel.

Because getting dashboards and web pages to look good on all possible screen sizes and orientations, many of us prefer to use professionally-created CSS rather than our own.

CSS frameworks are pre-built libraries of CSS and JavaScript code used to streamline web development and create consistent and responsive displays. These frameworks are desiged to be responsive and look good on screens ranging from mobile devices like smart phones to large, wall-mounted screens. They provide a range of pre-designed elements, such as navigation bars, forms, and buttons, that can be easily customized and incorporated into dashboards and web projects.

Popular CSS frameworks include Bootstrap, Material Design Bootstrap (MDB), Foundation, and Bulma. These frameworks offer a wide range of design options, robust documentation, and support from their communities.

CSS in Dashboarding Frameworks

Many popular data analytics dashboarding frameworks allow customization for analysts with a knowledge of CSS.

CSS in Tableau

Tableau provides a range of customization options for dashboard styling, including the ability to use custom CSS code to modify the appearance of dashboards and reports. Users can create custom themes and apply them to their dashboards, or use CSS to modify individual elements such as fonts, colors, and backgrounds.

CSS in Power BI

Power BI allows users to customize the appearance of their dashboards using themes and custom CSS code. Users can modify the styling of individual elements such as charts, tables, and cards, and can apply custom CSS classes to elements for greater control over styling.

CSS in Plotly

Plotly is a web-based data visualization platform that provides a range of customization options for dashboard styling, including the ability to use custom CSS code to modify the appearance of charts and graphs. Users can modify the styling of individual elements such as colors, fonts, and backgrounds, and can apply custom CSS classes to elements for more granular control over styling.

Plotly supports multiple programming languages including Python, R, and JavaScript.

CSS in Metabase

Metabase is an open-source business intelligence and data analytics platform that allows users to create interactive dashboards and reports. It provides a range of customization options for dashboard styling, including the ability to use custom CSS code to modify the appearance of dashboards and reports. Users can modify the styling of individual elements such as fonts, colors, and backgrounds, and can apply custom CSS classes to elements for greater control over styling.

Metabase supports SQL queries and has a web-based interface.

CSS in Redash

Redash is an open-source data visualization and dashboarding platform that allows users to connect to various data sources and create interactive dashboards and reports. It provides a range of customization options for dashboard styling, including the ability to use custom CSS code to modify the appearance of dashboards and reports. Users can modify the styling of individual elements such as fonts, colors, and backgrounds, and can apply custom CSS classes to elements for more granular control over styling.

Redash connects to SQL databases, MongoDB, and APIs, and includes support for Python scripts.

See Also

Read more about some of these important options in:

Go

Powerful and Efficient Programming Language

Golang, also known as Go, is an open-source programming language developed by Google. It is designed for simplicity, efficiency, and strong support for concurrent programming.

Why Go?

For developers, Golang offers several advantages over other programming languages:

  • Go is designed for simplicity, making it easy to learn and write.
  • It has strong support for concurrent programming, allowing for efficient performance in multi-core environments.
  • Go has a garbage collector, which automatically manages memory allocation and deallocation.
  • It has a growing ecosystem and community, with a range of libraries and frameworks available.

Go Syntax

  • Golang has a clean and straightforward syntax, influenced by C but with some improvements.
  • It uses static typing and supports various data types, including integers, floats, strings, and arrays.
  • Go has built-in support for concurrent programming with goroutines and channels.

Free Resources for Learning Go

  • The Go Programming Language: The official Go website, with documentation, tutorials, and downloads.
  • A Tour of Go: An interactive introduction to Golang, with hands-on coding exercises.
  • Go by Example: A collection of practical examples and snippets for learning Golang.
  • Effective Go: A guide to writing efficient and idiomatic Golang code.
  • The Go Playground: An online environment for writing and testing Golang code.

Golang Frameworks and Libraries

  • Golang has a growing ecosystem of libraries and frameworks, catering to various use cases such as web development, data processing, and networking.
  • Popular Golang frameworks and libraries include Gin, Revel, and Gorilla.

File Extensions

  • .go

HTML

HTML is a markup language used for creating web pages and applications.

Why HTML?

HTML is essential for data analysts and developers who want to create web-based applications and documents.

  • HTML skills allow you to create and publish web content and applications.
  • HTML can be used with other languages like CSS and JavaScript to create dynamic web pages and applications.

HTML Syntax

  • HTML is a markup language that uses tags to define elements on a web page.
  • Tags are used to define headings, paragraphs, links, images, and other elements.
  • HTML documents are typically saved with the .html file extension.

Free Resources for Learning HTML

File Extensions

  • .html

JavaScript

JavaScript is a popular programming language used for web development and beyond. In this page, we will cover some basics of JavaScript.

Why JavaScript?

JavaScript is widely used for building web applications, and it’s a vital skill for web developers. Some of the reasons to learn JavaScript include:

  • Interactivity: JavaScript makes websites more interactive and engaging, allowing for features such as animations, user input validation, and dynamic content updates.

  • Front-end web development: JavaScript is used heavily in front-end web development, enabling developers to build user interfaces and dynamic web pages.

  • Back-end web development: JavaScript can also be used for back-end web development, allowing developers to build server-side applications and APIs.

  • Cross-platform development: With tools like Node.js, JavaScript can be used to build cross-platform applications for desktop and mobile devices.

Free Resources for Learning JavaScript

  • JavaScript Tutorial for Beginners: A comprehensive tutorial covering the basics of JavaScript syntax, data types, operators, functions, and more.

  • MDN Web Docs: JavaScript: Mozilla’s guide to JavaScript, including a reference guide, tutorials, and examples.

  • Eloquent JavaScript: A free online book that covers the basics of JavaScript programming, including control structures, functions, objects, and more.

  • JavaScript30: A free 30-day JavaScript coding challenge that covers different aspects of the language and helps build real-world projects.

  • Codecademy: JavaScript: An interactive online course that teaches the basics of JavaScript programming.

Free Resources for Advanced JavaScript

  • You Don’t Know JS: A series of books that covers advanced JavaScript topics, including closures, prototypes, asynchronous programming, and more.

  • JSBooks: A collection of free JavaScript books covering advanced topics such as functional programming, design patterns, and algorithms.

  • Node.js: A JavaScript runtime built on Chrome’s V8 JavaScript engine that allows developers to build scalable network applications.

File Extensions

  • .js

Julia

High-Performance Dynamic Programming Language

Julia is a high-level, high-performance dynamic programming language designed for numerical and scientific computing, data analysis, and machine learning.

Why Julia?

For developers, Julia offers several advantages over other programming languages:

  • Julia has a just-in-time (JIT) compiler, which means that it can run code as fast as statically compiled languages like C and Fortran.
  • It has a simple and expressive syntax, making it easy to learn and write.
  • Julia supports multiple dispatch, which allows for flexible and efficient handling of functions with different argument types.
  • It has a growing ecosystem and community, with a range of libraries and frameworks available.

Julia Syntax

  • Julia has a simple and readable syntax, with support for multiple dispatch and type inference.
  • It supports various data types, including integers, floats, strings, and arrays.
  • Julia has built-in support for parallel and distributed computing.

Project Management

Project.toml is a configuration file used in Julia projects to specify the project’s dependencies and other metadata. It is part of the Julia package management system, which provides a standardized way to manage packages and their dependencies.

Project.toml is used by the Julia package manager to create and manage project environments. When a Project.toml file is present in a project directory, the package manager can use this file to create a dedicated environment for the project, separate from the user’s global environment or other project environments.

It allows developers to specify the exact version of each dependency required by the project. This helps ensure that the project is compatible with specific versions of each package, and can help avoid conflicts or unexpected behavior caused by incompatible package versions.

Project.toml can also include other metadata about the project, such as its name, version number, and author information. This makes it easy to share and distribute the project with others.

Free Resources for Learning Julia

Julia Frameworks and Libraries

  • Julia has a growing ecosystem of libraries and frameworks, catering to various use cases such as data processing, scientific computing, and machine learning.
  • Popular Julia frameworks and libraries include Flux, DifferentialEquations.jl, and JuMP.

File Extensions

  • .jl

LaTeX

LaTeX, pronounced “la-TECH”, is a high-quality typesetting system designed for the production of technical and scientific documents. It is widely used in academia, industry, and publishing, and is known for its ability to produce professional-looking documents with complex mathematical formulas and graphics.

Why LaTeX?

LaTeX offers several advantages over traditional word processors such as Microsoft Word or Google Docs:

  • Precision: LaTeX is designed to produce high-quality, precise documents with consistent formatting, layout, and typesetting.

  • Flexibility: LaTeX allows users to easily create and format complex mathematical equations, symbols, and diagrams.

  • Portability: LaTeX documents can be easily converted to a variety of formats, including PDF, HTML, and other document types.

LaTeX for Scientific Writing

  • LaTeX is a preferred tool for writing scientific documents such as research papers, technical reports, capstone project reports, and theses.

  • LaTeX provides powerful tools for creating and formatting complex equations and symbols, making it ideal for scientific writing.

LaTeX for Presentations

  • LaTeX can be used to create professional-looking presentations using the Beamer class.

  • Beamer provides a variety of presentation templates and themes, and allows users to easily incorporate mathematical equations and graphics.

Basic LaTeX Syntax

LaTeX uses markup syntax to create formatted text, equations, and graphics.

Here are some basic syntax elements of LaTeX.

Math Mode

Math mode is used to create mathematical equations and symbols. To enter math mode, use the $ symbol to enclose your equation or symbol.

$f(x) = x^2$

Commands

LaTeX uses commands to perform various formatting and typesetting tasks. Commands are preceded by a backslash (\).

\section{Introduction}

Environments

Environments are used to apply formatting or styles to a block of text or content. Environments are enclosed by the \begin{environment} and \end{environment} commands.

\begin{itemize}
\item Item 1
\item Item 2
\item Item 3
\end{itemize}

Integration

LaTeX can be used in combination with other tools, such as BibTeX for managing bibliographic references and citations.

Free Resources for Learning LaTeX

  • LaTeX Project: The official website for LaTeX, with documentation, tutorials, and resources.

  • Overleaf: A cloud-based LaTeX editor with templates, tutorials, and collaboration tools.

  • ShareLaTeX: A cloud-based LaTeX editor with templates, tutorials, and collaboration tools.

  • LaTeX Wikibook: A community-driven LaTeX guide with tutorials, examples, and reference materials.

  • LaTeX Tutorial by Overleaf: A beginner-friendly LaTeX tutorial by Overleaf, with examples and exercises.

File Extensions

Here are some common file extensions used in LaTeX.

  • .tex: The main file extension for LaTeX documents.

  • .bib: The file extension for bibliographic data files, used with BibTeX to manage references and citations in reports and documents.

See Also

Markdown

Markdown is a lightweight markup language for formatting text on the web.

Why Markdown?

Markdown is an essential tool for data analysts and developers. With its simple syntax and powerful features, Markdown is easy to learn, widely used, and perfect for creating structured documents and web content.

For data analysts and developers:

  • Markdown is an invaluable skill for creating clear and concise documentation of our work.
  • Markdown skills help communicate our findings more effectively to colleagues and stakeholders, and make our work more accessible and engaging to others.

Markdown for READMEs

  • Markdown can be used to create professional README.md files to introduce our project repositories on GitHub.
  • README.md files help others understand the purpose of our project, its features, and how to use it.

Markdown for Jupyter Notebooks

  • Markdown is widely used in Jupyter Notebooks, a popular tool for data analysis and scientific computing.
  • With Markdown, we can create rich and informative narratives alongside our code and visualizations.

Basic Markdown Syntax

Markdown uses plain text formatting to create headers, lists, links, and other formatting elements. Here are some basic syntax elements of Markdown:

Headers

Headers are used to create headings or subheadings in your document. To create a header, use the # symbol followed by a space and the text for your heading. Markdown supports up to six levels of headers.

# This is a level one header
## This is a level two header
### This is a level three header
#### This is a level four header
##### This is a level five header
###### This is a level six header

Lists

Lists are used to create ordered and unordered lists in your document. To create a list, use either the * symbol or the - symbol for an unordered list, or use numbers for an ordered list.

An unordered list in Markdown is created by using the “- " syntax (“dash space”), followed by the list item.

- Item 1
- Item 2
- Item 3

An ordered list in Markdown is created by using the “1. " syntax (“one dot space”), followed by the list item. Markdown will automatically increment the number of each item in the list as the page is rendered, ensuring that the numbers are displayed correctly. This makes it easy to create numbered lists in Markdown without having to manually adjust the numbers.

1. Item 1
1. Item 2
1. Item 3

Links are used to create hyperlinks in your document. To create a link, use square brackets to enclose the link text, followed by the link URL in parentheses.

[Markdown: Getting Started](https://www.markdownguide.org/getting-started/)

Images

Images are used to display images in your document. To add an image, use an exclamation point, followed by square brackets to enclose the alt text, and the image URL in parentheses.

![Alt Text](image.url)

Advanced Markdown Syntax

Markdown also supports more advanced syntax, such as tables, code blocks, and inline code. Here are some examples of advanced Markdown syntax.

Tables

Tables are used to display data in rows and columns. To create a table, use hyphens (-) for the headers and pipes or vertical bars (|) to separate the columns.

Code Blocks

Code blocks are used to display code in your document. To create a code block, use triple backticks followed by the language name, and then your code. End your code block with triple backticks.


```python
print("Hello, world!")
```

Inline Code

Inline code is used to display code within a paragraph. To create inline code, use single backticks (`) to enclose your code.


Use the `print()` function to print a message to the console.

Free Resources for Learning Markdown

Free Resources for Learning GitHub-Flavored Markdown

File Extensions

  • .md
  • .markdown

PowerShell

Cross-Platform Automation and Configuration

PowerShell Core is an open-source automation and configuration programming language for Windows, Linux, and macOS. It provides a powerful command-line interface for managing and automating systems and processes.

Why PowerShell Core?

For developers and system administrators, PowerShell Core offers several advantages over other automation and scripting tools:

  • PowerShell Core is cross-platform.
  • It’s a powerful and flexible scripting language.
  • It has a large and active community of users and contributors, with many resources and tutorials available.

PowerShell Core Syntax

  • PowerShell Core uses a command-line interface and scripting language that is similar to Unix shell scripting.
  • It supports various data types, including strings, numbers, arrays, and objects.
  • PowerShell Core has built-in support for remote management and automation.

Free Resources for Learning PowerShell Core

PowerShell Core Modules and Libraries

  • PowerShell Core has a large and growing collection of modules and libraries, catering to various automation and system management use cases.

Popular PowerShell modules include Pester, PSReadLine, and PowerShell Gallery.

File Extensions

  • .ps1

See Also

There is more information about PowerShell in the Terminals Chapter.

Python

Python is a high-level programming language used for a wide range of applications, from data analysis to web development.

Why Python?

Python is an essential tool for data analysts and developers. With its easy-to-learn syntax, vast library of modules, and robust community support, Python is perfect for:

  • Data analysis, including statistical analysis, data visualization, and machine learning.
  • Web development, including server-side programming, web scraping, and automation.
  • Scripting, including system administration, text processing, and task automation.
  • Scientific computing, including simulations, modeling, and optimization.

Learning Python can be a valuable investment in your career.

Installation

The installation process for Git depends on your operating system. Follow the instructions below based on your platform:

Python Resources

  • Python.org - The official website of the Python programming language. Includes documentation, tutorials, and downloads for the latest versions of Python
  • Python for Data Analysis, 3E Open Edition or 2E Print - A comprehensive guide to using Python for data analysis, written by Wes McKinney, the creator of pandas
  • Python Data Science Handbook - A free online book that covers the fundamentals of data science using Python
  • Real Python - A collection of tutorials, courses, and articles on Python programming, web development, and data science
  • Python Crash Course - A beginner-friendly guide to Python programming, with examples and exercises covering key topics such as variables, functions, and control flow
  • Python Lingo from Luciano Ramalho, author of the advanced book Fluent Python.

File Extensions

  • .py - Python source code files
  • .ipynb - Jupyter Notebook files (“interactive Python notebook”)
  • .pyc - Compiled Python files
  • .pyd - Python extension modules
  • .pyo - Optimized Python files
  • .whl - Python package distribution files (“wheels”)

Subsections of Python

Python: Basics

Python is a popular high-level programming language that is easy to learn and widely used in data analysis, machine learning, web development, and many other fields.

Defining Variables

In Python, we can define a variable and assign a value to it using the “=” operator. For example:

x = 10

Here, we’ve defined a variable x and assigned it the value of 10.

Performing Operations

We can also perform mathematical operations on variables:

y = 5
z = x + y

Here, we’ve defined a variable y and added it to x to create a new variable z.

Expressions

Expressions are combinations of operators and operands that can be evaluated to produce a value.

Python allows us to use expressions to perform operations on variables. For example:

a = 2
b = 3
c = a * b + 1

Here, we’ve defined three variables: a, b, and c. We’ve used the * operator to multiply a and b, and then added 1 to the result.

Expressions can also include functions:

import math
d = math.sqrt(a**2 + b**2)

Here, we’ve imported the math module and used the sqrt() function to calculate the square root of a^2 + b^2.

product = x * y
quotient = x / y

# Print statements
print("x =", x)
print("y =", y)
print("x + y =", sum)
print("x - y =", difference)
print("x * y =", product)
print("x / y =", quotient)
x = 10
y = 5
z = x + y
print(z)

This code will create two variables, x and y, assign them the values 10 and 5, respectively, and then add them together to create a new variable z with the value 15. Finally, the code prints the value of z.

Statements

In Python, a statement is a line of code that performs an action or task.

Statements are the smallest unit of code that can be executed and they represent an action or command. Each statement performs a specific task, such as defining a variable, calling a function, or creating a loop.

x = 10
print("Hello, world!")
def add_numbers(a, b):
    return a + b

In the above example, the first line (x = 10) is a statement that assigns the value 10 to the variable x. The second line (print(“Hello, world!”)) is a statement that prints the message “Hello, world!” to the console. The third line defines a function add_numbers that takes two arguments and returns their sum.

Statements vs Expressions

Some expressions can be statements, such as an assignment expression, which assigns a value to a variable.

However, not all statements are expressions. For example, a print statement does not evaluate to a value and cannot be used as part of an expression.

Script

A Python script is simply a collection of statements executed in order to achieve a desired outcome.

Python: Installation

Python is a high-level programming language used for a wide range of applications, from data analysis to web development.

Mac/Linux Users

  • Option 1: Official installation instructions. Follow instructions on the official Python website. This is the most up-to-date and comprehensive guide to installing Python on your system.

  • Option 2: Step-by-step installation guide. Check out our installation instructions for a step-by-step guide.

Windows Users

  • Option 1: Official installation instructions. Follow instructions on the official Python website. This is the most up-to-date and comprehensive guide to installing Python on your system.

  • Option 2: Step-by-step installation guide. Check out our detailed installation instructions for a step-by-step guide.

Subsections of Python: Installation

Python: Mac/Linux

Task 1 - Install Python

  1. Open a terminal window
  2. Run the following command to install Python:
    1. sudo apt-get install python3
    2. (for Debian/Ubuntu-based systems) or
    3. brew install python3
    4. (for macOS)

Task 2 - Install pip

pip is the default package manager for Python used to install, update, and manage Python packages and dependencies.

  1. Open a terminal window
  2. Run the following command to install pip:
    1. sudo apt-get install python3-pip
    2. (for Debian/Ubuntu-based systems) or
    3. sudo easy_install pip
    4. (for macOS)

Task 3 - Verify

  1. Open a terminal window
  2. Run the following commands to verify installation:
python3 --version
pip3 -- version

or

python --version
pip --version

If you see version information, the installation was successful.

You may need multiple Python versions available on your machine, depending on the requirements of your project and the external tools and libraries required.

Python: Windows

Task 1 - Install Python (includes pip)

  1. Go to the Python download page at https://www.python.org/downloads/windows/
  2. Click the “Download Python” button for the latest version of Python
  3. Read and follow the official instructions here (things change; adapting is key!): https://docs.python.org/3/using/windows.html
  4. Run the installer file that you downloaded as an Administrator, checking both options: 
    1. The first checkbox is checked - keep it checked.
    2. Also check “Add Python to PATH” during the installation process
  5. Click “Install Now” to install Python (which will include pip)

Task 2 - Activate the New Environment

Close and reopen the command prompt or PowerShell window to activate the new environment.

Task 3 - Verify Installation

  1. Open a command prompt or PowerShell window
  2. Run the following commands to verify installation:
python3 --version
pip3 --version

or

python --version
pip --version

If you see version information, the installation was successful.

Python Libraries

Python libraries are collections of pre-written code that can be imported and used in your own programs, saving time and effort when developing complex applications.

Python Standard Library

Any installation of Python will include the standard library which includes a rich set of modules providing access to various system functionalities such as operating system interfaces, file I/O, network programming, data manipulation, and much more.

External Libraries

In addition, Python has a vast ecosystem of external libraries for various purposes, including data analysis, scientific computing, web development, machine learning, artificial intelligence, and more.

Subsections of Python Libraries

Python Standard Library

Python comes with a vast library of modules that are included in any installation of Python, known as the Python Standard Library.

These modules offer a wide range of functionality that can be used for various tasks such as working with data, networking, file handling, and much more.

Here is a brief introduction to some of the commonly used modules in the Python Standard Library:

os

This module provides a way of interacting with the operating system, allowing you to access system files and directories, work with environment variables, and much more.

sys

This module provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter. It allows you to manipulate the Python runtime environment and perform system-specific operations.

datetime

This module provides classes for working with dates and times. It allows you to create, manipulate, and format dates and times and perform calculations with them.

math

This module provides mathematical functions such as trigonometric functions, logarithmic functions, and many others. It also includes constants such as pi and e.

random

This module provides functions for generating pseudo-random numbers. It can be used for simulating random events, creating games, and much more.

re

This module provides support for regular expressions, a powerful tool for text processing. It allows you to search for patterns in text, extract specific parts of text, and perform various operations on text.

urllib

This module provides a high-level interface for working with URLs and URIs. It allows you to retrieve data from web pages, download files, and much more.

json

This module provides support for working with JSON (JavaScript Object Notation), a lightweight data interchange format. It allows you to encode and decode JSON data, convert JSON data to Python objects, and vice versa.

argparse

This module provides a way of creating command-line interfaces. It allows you to specify arguments and options for your program and provides help messages and error handling.

Years of Experience

For the most part, teams assume analysts can master basic Python syntax in a matter of weeks.

It’s learning and using the vast array of libraries available that can take many years of experience.

Learning how to use this freely available code can be very valuable.

Official Documentation

Python External Libraries

Python has a vast ecosystem of external libraries for data analytics, visualization, and statistical processing. Here are some of the most popular and widely used libraries:

NumPy

NumPy is a powerful library for numerical computing in Python. It provides a high-performance multidimensional array object, along with tools for working with these arrays. NumPy is widely used in scientific computing and data analysis, and is the foundation for many other Python libraries.

pandas

pandas is a library for data manipulation and analysis. It provides a high-performance DataFrame object for working with structured data, and includes tools for data cleaning, merging, and reshaping. pandas is widely used in data science and machine learning, and is a key component of the PyData ecosystem.

Matplotlib

Matplotlib is a library for creating static, animated, and interactive visualizations in Python. It provides a wide range of plotting tools and options, and can create a variety of charts, plots, and graphs. Matplotlib is widely used in scientific computing, data analysis, and machine learning.

Seaborn

Seaborn is a library for creating statistical visualizations in Python. It provides a high-level interface for creating a variety of statistical charts, plots, and graphs, including heatmaps, bar plots, and scatter plots. Seaborn is built on top of Matplotlib and integrates well with pandas data structures.

Scikit-learn

Scikit-learn is a library for machine learning in Python. It provides tools for data preprocessing, feature selection, model selection, and evaluation, and includes a wide range of supervised and unsupervised learning algorithms. Scikit-learn is widely used in data science and machine learning, and is the foundation for many other Python machine learning libraries.

TensorFlow

TensorFlow is a library for machine learning and deep learning in Python. It provides tools for building and training deep neural networks, and includes a wide range of pre-built models for image recognition, natural language processing, and more. TensorFlow is widely used in artificial intelligence, data science, and machine learning.

PyTorch

PyTorch is a library for machine learning and deep learning in Python. It provides tools for building and training deep neural networks, and includes a wide range of pre-built models for image recognition, natural language processing, and more. PyTorch is known for its dynamic computational graph, which enables flexible and efficient model building.

More

These are just a few of the many external libraries available for data analytics, visualization, and statistical processing in Python.

Each library has its own strengths and use cases, so it’s important to know enough about the major options to be able to choose the right tool for the job.

Python Tools

When you first install Python, you have access to the Standard Library.

However, to expand your capabilities and work with various Python projects, you want to install additional packages and dependencies.

Python offers a range of tools that make it easy to install, manage, and maintain these packages and dependencies.

We introduce just some of the popular tools along with recommendations for new personal projects.

Recommanded Approach

Since our goal is to get you started quickly, here’s the recommended way to help maximize these benefits from the beginning.

In each new project, create a pyproject.toml file that’s configured to use the following tools. We provide an example file that you can customize.

Don’t get too attached to preferences - each workplace will likely have their own standard set of preferred tools and processes.

⭐ Configure with pyproject.toml ⭐

Use Pyproject.toml for configuration. It can help remind us to set our virtual environment, install our dependencies, and format and lint our files for correctness and ease of use.

The pyproject.toml file can be used to configure these recommended tools.

  • build-system: Hatch is a dependency management tool that can be used to publish Python packages to PyPI. Hatchling is a build backend for Hatch that is used to build packages. Both tools use the pyproject.toml file to configure the package’s metadata and dependencies.

  • tool.black: Black is a Python code formatter that uses a pyproject.toml file to configure its behavior. You can specify options such as line length and whether to use single or double quotes in the pyproject.toml file.

  • tool.pyright: Pyright is a static type checker for Python that can use a pyproject.toml file to configure its behavior. You can specify options such as which files to include or exclude from type checking, and which Python version to use in the pyproject.toml file.

  • tool.ruff: Ruff is a Python build tool that uses a pyproject.toml file to specify tasks and dependencies for your project. You can define tasks such as building documentation or running tests, and specify the dependencies required for each task in the tools.ruff section of the pyproject.toml file.

Package Managers

Package managers allow you to fetch and install packages from the internet into your Python environment. Two widely used package managers in Python are pip and conda.

⭐ pip ⭐

pip is the default package manager for Python and makes it easy to install, update, and manage Python packages and dependencies. It is an essential tool for working with Python projects.

conda

conda is another popular package manager for Python, often used with the Anaconda or Miniconda distributions. It can be used alongside or as an alternative to pip.

Environment Managers

Python projects often require different versions of Python and different packages, making it essential to maintain and activate different environments as we work. Python offers two environment managers: venv and conda.

⭐ venv ⭐

venv is the default environment manager for Python and allows you to create and manage virtual environments within a Python project.

conda

conda can also be used as an environment manager in addition to its role as a package manager. It automatically activates a base environment upon installation and allows you to create and manage other environments as needed.

Python Formatters

Formatters are tools that help ensure consistent and readable code by automatically formatting Python code according to predefined styles and standards.

Many work environments will specify the tools and formats they prefer. Some may automatically apply formatting rules when code is pushed to a repository. The following recommendations are for personal projects.

⭐ Format with black ⭐

Black is a popular and highly-regarded Python formatter that aims to provide a simple and opinionated approach to code formatting. It reformats entire files in place, making it easy to integrate into automated workflows.

isort

Isort is a Python library and command-line tool that helps e nsure Python imports are properly sorted and formatted. It can automatically group imports by type and optimize the order of imports to reduce conflicts and improve readability. The Ruff linter includes isort functionality.

Python Linters

Linters are tools that analyze code and report on potential errors, style violations, and other issues. These tools help ensure that code is well-written, maintainable, and conforms to best practices.

⭐ Lint with Ruff ⭐

A new Python linter, Ruff, is gaining popularity. Ruff is a Rust-powered linter that aims to be faster and more reliable than traditional linters like Pylint and Flake8. It uses a Rust library called syntect for syntax highlighting and parsing, and leverages Rust’s memory safety and concurrency features to provide a faster and more reliable analysis.

Ruff offers several features that make it a promising option for Python developers, including integration with editors like VS Code, support for custom rule sets, and an easy-to-use command-line interface.

Ruff is configured using the standard pyproject.toml file and includes isort functionality.

Python Type Checkers

Type checkers are tools that analyze your code and attempt to find type-related errors before code runs. This helps catch errors earlier in the development process and can improve the overall quality of your code.

⭐ Typecheck with Pyright ⭐

Pyright is a popular type-checking tool for Python that uses static analysis to identify type-related errors in your code. Pyright supports Python 3.6 and above, and can be used in a variety of development environments, including VS Code and other editors.

When used in VS Code, Pyright provides real-time feedback and suggestions as you code, helping you catch errors and improve the overall quality of your code. Pyright also supports type annotations, allowing you to provide additional information about the types of variables and function arguments in your code.

Package Development and Distribution

⭐ Hatch and Hatchling ⭐

Hatch is a command-line tool for managing dependencies and environment isolation for Python developers. It allows developers to easily configure, version, specify dependencies for, and publish packages to PyPI. Hatch can be used to create new Python packages, add dependencies, and manage virtual environments.

Hatchling is a build backend for Hatch that is used to build Python packages. It provides a simple, declarative configuration file format that allows developers to specify the dependencies, entry points, and other package metadata. Hatchling can be used to build packages in different formats, including source distributions and wheels, and to upload them to PyPI or other package repositories.

Setuptools

Setuptools is a package development and distribution tool for Python that provides features such as package metadata, package installation, and dependency management. Setuptools is widely used and integrates with many other Python tools and frameworks, making it a popular choice for package development and distribution.

Flit

Flit is a lightweight tool for building and publishing Python packages. Flit provides features such as dependency management, virtual environments, and metadata management, and is designed to be simple and easy to use. Flit also supports building wheels for distribution, making it a good choice for creating packages that can be easily installed on different systems.

Python Build Tools

Using a build tool or command line runner in Python can help new analysts and developers automate repetitive tasks, streamline their workflow, and avoid having to retype complex commands.

There are several build tools available for Python projects that help automate the build process and manage dependencies.

Make

One such tool is Make, an older and widely used build tool that automates the building and testing of software. It is a powerful and flexible tool that can be used to manage complex build processes with many dependencies.

⭐ Build with Just ⭐

Another popular tool is Just, a newer command runner written in Rust. It uses a simple YAML configuration file called justfile.yaml to define tasks and their dependencies, and is designed to be fast and easy to use. Just is particularly useful for smaller projects that don’t require a full-fledged build system.

Python: AI and ML

Python is a popular programming language that has gained a lot of traction in the fields of artificial intelligence (AI) and machine learning (ML).

Python offers a range of libraries and frameworks that make it easier to develop and deploy AI and ML applications, including:

  • NumPy: A library for numerical computing in Python, NumPy provides support for large, multi-dimensional arrays and matrices, as well as a range of mathematical functions for working with this data.

  • pandas: A library for data manipulation and analysis in Python, pandas provides support for working with structured data in a variety of formats, including CSV, Excel, SQL databases, and more.

  • Scikit-learn: A library for machine learning in Python, Scikit-learn provides a range of algorithms for classification, regression, and clustering, as well as tools for model selection and evaluation.

  • TensorFlow: A popular library for machine learning and deep learning in Python, TensorFlow provides support for building and training neural networks, as well as tools for deploying models on a variety of platforms.

  • Keras: A high-level neural networks API in Python, Keras provides a simple and intuitive interface for building and training deep learning models, as well as support for a range of backends, including TensorFlow.

Python: Environments

Python environments can be confusing at first, but they are essential for developing and deploying Python applications.

Overview

At a high level, you can think of Python environments as isolated “containers” that provide a controlled environment for your code to run in.

They are similar in some ways to operating systems, in that they provide a layer of abstraction between the code and the underlying system, and allow multiple applications to run independently without interfering with each other.

Python Environments

In the case of Python environments, the “container” is a self-contained installation of the Python interpreter and associated packages and dependencies.

Environments allow you to install and manage different versions of Python and packages without affecting other environments or your system Python installation.

By creating separate environments for each project, you can ensure that each project has access to the correct versions of Python and packages, and that packages do not conflict with each other.

This can help ensure that your code works consistently across different machines and environments, and can make it easier to manage and deploy your code.

Importance

There are several reasons why Python environments are important:

Version management

Different projects may require different versions of Python or packages. By creating separate environments for each project, you can ensure that each project has access to the correct versions of Python and packages.

Dependency management

Python packages often have complex dependencies on other packages. By isolating each project in its own environment, you can avoid conflicts between different packages and ensure that each project has the correct dependencies installed.

Reproducibility

By using environments, you can ensure that your code works consistently across different machines and environments. This is important when collaborating with others or when deploying your code to a production environment.

Tools

There are several tools available for managing Python environments, including:

  • virtualenv
  • conda
  • pipenv

These tools make it easy to create, manage, and switch between environments, and can be integrated with development tools like IDEs and text editors.

Create / Activate / Install

In practice, creating a new environment involves using a tool like virtualenv or conda to create a new environment directory, activating the environment, and then installing the required packages and dependencies using pip or conda.

Using The Active Environment

Once the environment is set up, you can run your code within that environment, and any packages you install will be isolated to that environment.

Python: Fundamentals

Here is a quick summary of some basic concepts to get started programming with Python.

Human Languages

Python introductions are available in many human languages. See https://wiki.python.org/moin/Languages for more.

Syntax

Python has a simple and consistent syntax which makes it easy to learn and read.

Indentation is used to indicate a block of code, as opposed to curly braces or keywords like ‘begin’ and ’end’ in other languages.

Indentation matters! (a tab is not the same as spaces)

Comments

Comments are denoted by the hashtag or pound sign (#). Any text that follows the hashtag on the same line is c onsidered a comment and is ignored by the Python interpreter. Comments can be used to provide additional information about the code or to temporarily disable parts of the code during development or testing.

Variables

Variables are used to store values, like numbers or text strings. In Python, you can create a variable by assigning a value to it, like this:

x = 5

Data Types

Python has several built-in data types, including integers (whole numbers), floating point numbers, and strings (text). For example:

x = 5         # an integer
y = 3.14      # a floating-point number
z = "hello"   # a string

Basic Operations

Python supports basic mathematical operations like addition, subtraction, multiplication and division, use the following signs +, - , * , / respectively:

x = 5
y = 3
print(x + y)   # prints 8

Conditional Statements

Conditional statements allow you to check if certain conditions are true, and then run different code depending on the result. In Python, elif is used as the keyword for “else if”. For example:

x = 5
if x > 0:
    print("x is positive")
elif x ==0:
    print("x is zero")
else:
    print("x is negative")

Functions

Functions are blocks of code that can be reused throughout your program. They can take input values called parameters, and return one or more output values. For example:

def double(x):
    return x * 2

result = double(5)
print(result)  # prints 10

Loops

Loops are used to repeat a block of code multiple times. Python has two types of loops: for loops and while loops. For example:

for i in range(5):
    print(i) # will print 0, 1, 2, 3, 4
x = 0
while x < 5:
    print(x)
    x += 1

How To Learn

The best way to learn is by doing - experiment, type code, and build personal projects to gain skills. This is the only course in the program where we work through all the foundational topics. Other courses will jump right in to Python programming by example. It’s best to take this course early in the program and master these basics early.

Big Wins

Reddit comment on suggested “big wins in Python” - when you see these, know they’re considered pretty useful skills.

https://www.reddit.com/r/learnpython/comments/10ka2dm/comment/j5pciik/

  • extended libraries (e.g. pandas, NumPy)
  • context managers (with open() as file:)
  • lambda functions, zip(), map(), filter(), enumerate()
  • comprehensions (concise and very impressive/useful)
  • regular expressions
  • sorting (faker is pretty useful, too)
  • type-hinting - easy and recommended! No more wondering if x is a string or an int - make it so! This is pretty new and valuable skill. It looks a lot like Swift.

Python: Organization

This page provides an overview of the fundamental building blocks of Python code organization.

Variable

A variable is a named memory location that holds a value.

Expression

An expression is a combination of operators (e.g. +) and operands (e.g. 1 or age) that resolves to a value.

Statement

A statement is the smallest unit of a Python script. All scripts are made up of statements. Some statements are expressions, while others are not (e.g. print(“hello”)).

Function

A function is a reusable block of code that performs a specific task. Functions help to break up large programs into smaller, more manageable pieces, and can be reused across different parts of a program or across different programs.

Class

A class is a blueprint for creating objects in Python. Classes define the attributes (data fields) and methods (functions) that all objects of a certain type will have. Using classes, we can create multiple instances (objects) of a certain type, each with their own unique attributes.

In Python, classes are optional, but many modules use an object-oriented approach to organizing code.

Object

An object is a specific instance of a class that holds real data. For example, if we have a Dog class, we can create two objects, sam = Dog("Sam", 3) and fido = Dog("Fido", 4), each with their own name and age attributes.

File / Module

In Python, a module is a file containing Python definitions and statements. Each .py file is a Python script and, by definition, a Python file is also a module. The name of the module is the same as the name of the file, without the .py extension.

Package

A package is a way of organizing related modules together. Packages allow us to group together related functionality in a way that is easy to import and use. A package is simply a directory that contains one or more Python modules.

Library

A library is a collection of packages and modules that provides a set of pre-written code for specific tasks. For example, the Python Standard Library is a large collection of libraries that are included with Python and provide a wide range of functionality, f rom file input/output to regular expressions to networking.

Python Distribution Methods

Python also has some special entities related to distributing Python code to users.

Python Distributions

A distribution is a bundle of Python software, which typically includes the Python interpreter, the Python standard library, and various additional packages and tools.

There are several popular Python distributions available, such as Anaconda, which includes many data science packages and tools, and Python(x,y), which is geared towards scientific computing.

Python distributions can make it easier to set up and manage a Python environment, especially for beginners, and come with many pre-installed packages and tools.

Python Wheels

A wheel is a self-contained installation executable that can be used to easily distribute and install Python packages across different systems.

Wheels are different from source distributions or packages, which are typically distributed as source code and must be compiled or built before they can be installed.

A wheel is essentially a ZIP archive that contains the files and dependencies necessary for a Python package to be installed on a system. It makes installation faster and easier, as the package does not need to be built from source code each time it is installed on a new system.

Wheels are platform-specific, meaning that a wheel built on one operating system or architecture may not work on another system with a different operating system or architecture. To address this, Python has a system of tags to identify which platforms a wheel is compatible with, so the correct version of the wheel can be downloaded and installed on each system.

Understanding Organization

Understanding the fundamental building blocks of Python organization and distribution is essential for employing available Python tools and writing clean, well-structured code that is easy to read, maintain, and reuse.

Python: pandas

pandas is a popular open-source library for data analytics in Python. It provides powerful tools for working with tabular data, such as data frames and series. With pandas, you can easily read, manipulate, and analyze data in a variety of formats, including CSV, Excel, SQL databases, and more.

One of the key features of pandas is its ability to handle missing data. pandas provides a number of methods for filling in missing data, interpolating values, and dropping missing data altogether. This is a critical feature for data analytics, as real-world data is often incomplete or inconsistent.

pandas performs complex data transformations and aggregations. With pandas, you can group data by one or more columns, apply functions to subsets of data, and pivot data to reshape it in different ways.

pandas provides tools for merging and joining data from multiple sources, making it easy to combine data from different sources into a single data set.

Being good with pandas is a valuable skill.

Faster Options

pandas can be a bit slow. Options include:

  • Moving to the faster pandas 2.0
  • Trying Polars

New! Read More about this important 2.0 update

pandas 2.0

pandas 2.0 is a significant update the to the beloved pandas.

Learn more at:

Polars

Polars is a data manipulation library written with Rust that aims to provide a fast, memory-efficient alternative to pandas for large-scale data processing. It’s still a relatively new library, having been first released in 2019, a nd its user base and ecosystem are still growing.

Polars has a lot of potential as a fast and memory-efficient data manipulation library for large datasets, but it’s still a relatively new library and may not have the same level of maturity and ecosystem as pandas.

Python: Project Management

There are several ways to manage dependencies and project metadata in Python. While they differ in their syntax and capabilities, they can all be used to specify the dependencies required for a Python project.

pyproject.toml

pyproject.toml is a configuration file used in modern Python projects to specify various aspects of the project, including its dependencies, build settings, and package metadata. It is part of the Python Packaging ecosystem, which provides a standardized way to manage Python packages and their distribution.

pyproject.toml is similar to Project.toml, used in Julia projects.

pyproject.toml is used by the Poetry package manager, a popular tool for managing dependencies and building Python projects. Poetry relies on the pyproject.toml file to define the project’s dependencies, and uses this information to create a virtual environment for the project and install the necessary dependencies.

It provides a simple, declarative way to manage project dependencies, without the need for separate requirements.txt or setup.py files. It also allows developers to specify other project metadata, such as its version number, author, and license.

With pep-0621 pyproject.toml is the standard for managing Python projects, and increasingly used by many popular Python packages and tools.

By adopting pyproject.toml and the Python Packaging ecosystem, developers ensure that their projects are well-organized, maintainable, and easily sharable with others in the Python community.

Poetry

Poetry is a modern Python packaging and dependency management tool that helps simplify the process of managing dependencies and building projects. It allows developers to define their project dependencies in a declarative way using a simple pyproject.toml file, rather than relying on separate requirements.txt or setup.py files.

One of the key advantages of using Poetry is that it provides a streamlined workflow for managing dependencies and virtual environments. It can automatically create and manage virtual environments for each project, isolating project dependencies and avoiding conflicts with system-level packages. Poetry also provides powerful tools for managing dependencies, including automatic dependency resolution, dependency locking, and the ability to publish and install packages from both PyPI and private repositories.

Another advantage of using Poetry is that it provides a simple, consistent interface for managing all aspects of a Python project, from dependency management to building and publishing packages. This makes it easier for developers to focus on writing code and building their projects, without getting bogged down in the details of project management.

Legacy Project Management

Although pyproject.toml is the new standard for managing dependencies and metadata in modern Python projects, you may still encounter older projects that use requirements.txt and setup.py.

requirements.txt is a file used to specify a project’s dependencies in a simple, text-based format. Each line in the file lists a package name and version number, separated by an equals sign. This format is easy to read and edit, and is supported by many Python tools and frameworks. However, it lacks some of the advanced features provided by pyproject.toml, such as the ability to specify package metadata and build settings.

setup.py is a script used to build and distribute Python packages. It includes metadata about the package, such as its name, version, and author information, as well as instructions for building and installing the package. Although setup.py is still used in many projects, it has some limitations, such as the inability to specify dependencies with the same level of detail as pyproject.toml.

Recommendations

If you’re starting a new Python project from scratch, it’s generally not recommended to use requirements.txt or setup.py as the primary method for managing dependencies and metadata. Instead, you should use pyproject.toml, which is the modern standard for these tasks.

However, if you’re working with an existing project that uses requirements.txt or setup.py, it’s often necessary to keep these files around for compatibility reasons. For example, if you’re working on a project that is already deployed to production and relies on requirements.txt to specify its dependencies, you may not want to switch to pyproject.toml right away, since this could cause compatibility issues or require a significant amount of testing.

Python: Scripts

A Python script is a file containing Python code that can be executed by the Python interpreter.

Scripts can be used to automate tasks, perform calculations, or interact with other software systems.

Run A Script

To run a Python script, you need to have the Python interpreter installed on your system. Once you have installed Python, you can run a script by opening a terminal or command prompt, navigating to the directory containing the script, and typing python myscript.py (replacing myscript with the name of your script).

For example, if you have a script named myscript.py in a directory called myproject, you can run it by opening a terminal or command prompt, navigating to the myproject directory, and typing python myscript.py.

If your script requires any command-line arguments, you can pass them to the script by including them after the script name. For example, if your script requires a filename as an argument, you could run it like this: python myscript.py myfile.txt.

When you run a Python script, the interpreter reads the code in the file and executes it. Any output produced by the script is printed to the console.

Python: Try/Except

Code Might Fail

It’s important to use try/except/finally whenever your application could fail through no fault of your own.

Why Plan for Errors?

People ask:

  • Why plan for errors?
  • Shouldn’t we fix all errors in our code before we release it?
  • Why do we need try/except/finally?

Perfect Code Can Still Have Exceptions

We should always strive to fix all coding and logic errors. However, sometimes our code can be perfect - but exceptions can still happen. try/except/finally is a way to gracefully handle unexpected errors and prevent our program from crashing.

Example

Suppose you write a script to read baseball_game_results.csv each night at midnight.

It runs fine until someone changes the filename to rslts.csv.

Now, your code terminates with an ugly error because the necessary file can’t be found.

To code professionally, we can use try/except to handle this error gracefully.

try:
    # Attempt to open the file
    with open('baseball_game_results.csv', 'r') as f:
        # Do something with the file
        
except FileNotFoundError:
    # Handle the case where the file is not found
    print('ERROR: File not found. Please name the file to baseball_game_results.csv')
finally:
    # Clean up any resources (e.g. file handles) used by the code
     

Other Programming Languages

Other programming languages use something very similar, but might use the keywords try/catch/finally. As in “try this, and if you catch an exception, do this.”

Throwing Exceptions

Exceptions are thrown by nested functions, up, up, up, until some level “catches” the exception and deals with it, or the program terminates with an ugly error.

It’s important to handle exceptions gracefully and prevent our programs from crashing.

Python: Uninstalling

Python seems to install a bit like a virus and traces can get everywhere.

At times, removing an old version of Python can be challenging.

Cleaning up unneeded Python installations can help avoid conflicts between different Python versions and packages.

Using package managers and virtual environments can help.

Uninstalling

Installations can leave traces on your system that may no longer be needed. Here are some recommendations for cleaning up old Python installations:

  1. Uninstall Python from the Control Panel: If you have installed Python using the official installer on a Windows machine, you can uninstall it from the Control Panel. Simply search for “Add or Remove Programs” in the Start menu, then find the Python installation you want to remove and click “Uninstall”.

  2. Delete Python folders: Python installations typically create folders on your system that can be deleted to remove the installation. The main folders are typically located in C:\Python or C:\Users\{user}\AppData\Local\Programs\Python. Be careful when deleting folders to ensure you are only deleting the correct installation.

  3. Clean up environment variables: Python installations can add environment variables to your system that may no longer be needed. You can clean these up by going to the System Properties window, selecting “Advanced System Settings”, then clicking the “Environment Variables” button. Here you can remove any Python-related environment variables that are no longer needed.

Management Tools

Managing Python well can help avoid issues. The following recommendations can help.

  1. Use a package manager: Using a package manager like conda or pipenv can help keep track of Python installations and dependencies. These package managers allow you to create isolated environments for specific projects, so you can avoid installing unnecessary packages and versions of Python.

  2. Use virtual environments: Another way to manage multiple Python installations is to use virtual environments. Virtual environments allow you to create isolated environments for specific projects, so you can avoid conflicts between different Python versions and packages. You can create virtual environments using the venv module or third-party tools like virtualenv or conda.

R

Data Analysis and Statistics

R is a programming language designed for data analysis and statistical computing. It is widely used by data scientists, statisticians, and researchers for various purposes.

Why R?

For developers, R offers several advantages over other programming languages:

  • R has a focus on data analysis and statistical computing, with a range of built-in functions and libraries for these tasks.
  • It provides a high-level interface for data manipulation and visualization, making it easy to explore and analyze complex data sets.
  • R has a large and active community of users and contributors, with many resources and tutorials available.

R Syntax

  • R has a simple and intuitive syntax, with a focus on data manipulation and analysis.
  • It supports various data types, including vectors, matrices, data frames, and lists.
  • R has built-in support for statistical functions and libraries.

Free Resources for Learning R

  • R Project: The official R website, with downloads, documentation, and resources.
  • R Tutorial: A comprehensive tutorial for learning R, covering the basics of data analysis and visualization.
  • R for Data Science: A book by Hadley Wickham and Garrett Grolemund, covering the fundamentals of data science with R.
  • Coursera: Various online courses on R programming and data science.

R Frameworks and Libraries

  • R has a large and diverse ecosystem of libraries and packages, catering to various data science use cases such as data manipulation, visualization, machine learning, and more.
  • Popular R libraries include ggplot2, dplyr, tidyr

File Extensions

  • .R

Rust

Powerful and Safe Programming Language

Rust is an open-source programming language developed by Mozilla. It aims to provide a fast and safe alternative to C and C++, with a focus on memory safety and concurrency.

Why Rust?

For developers, Rust offers several advantages over other programming languages:

  • Rust has a focus on safety, with memory and thread-safety enforced at compile-time.
  • It provides low-level control like C and C++ but without the risk of memory errors and vulnerabilities.
  • Rust’s borrow checker prevents data races and other concurrency issues.
  • It has a growing ecosystem and community, with a range of libraries and frameworks available.

Rust Syntax

  • Rust has a clean and modern syntax, influenced by C and other systems programming languages.
  • It uses static typing and supports various data types, including integers, floats, strings, and arrays.
  • Rust has built-in support for concurrent programming with threads and channels.

Free Resources for Learning Rust

  • The Rust Programming Language: The official Rust website, with documentation, tutorials, and downloads.
  • Rust By Example: An interactive introduction to Rust, with hands-on coding exercises.
  • Rustlings: A collection of small exercises to get started with Rust.
  • Rust Cookbook: A collection of practical examples and snippets for learning Rust.
  • Rust Playground: An online environment for writing and testing Rust code.

Rust Frameworks and Libraries

  • Rust has a growing ecosystem of libraries and frameworks, catering to various use cases such as web development, game development, and systems programming.
  • Popular Rust frameworks and libraries include Actix, Rocket, and Serde.

File Extensions

  • .rs

Typst

Typst is a modern typesetting system designed for creating professional-looking documents, with a focus on simplicity and ease of use.

It offers several advantages over traditional word processors and other typesetting systems, such as:

  • Ease of Use: Typst is designed to be easy to learn and use, even for beginners.

  • Flexibility: Typst provides powerful tools for creating and formatting complex mathematical equations, symbols, tables, and figures.

  • Portability: Typst documents can be easily converted to a variety of formats, including PDF, HTML, and other document types.

Typst for Scientific Writing

Typst is a preferred tool for writing scientific documents such as research papers, technical reports, capstone project reports, and theses. It provides powerful tools for creating and formatting complex equations and symbols, making it ideal for scientific writing.

Typst Syntax

Typst uses dollar signs ($) for math mode, like LaTeX. For more on syntax, see:

Integration

Typst can be used in combination with other tools, such as BibTeX for managing bibliographic references and citations.

Free Resources for Learning Typst

File Extensions

Here are some common file extensions used with Typst.

  • .typ: The main file extension for Typst documents.

  • .bib: The file extension for bibliographic data files, used with BibTeX to manage references and citations in reports and documents.

See Also

Chapter 2

Tools

This chapter introduces some popular tools.

Chocolatey

Chocolately is a popular package manager for Windows that makes it easy to install, update, and manage software packages. It offers a large selection of packages and advanced features. See also Winget.

Docker

Docker is a platform for building, shipping, and running applications in containers.

Git

Git is a popular version control system that allows developers to track changes to their code and collaborate with others on a project. It provides a way to manage and organize code, and allows for easy branching and merging. Git is widely used in software development, and is an essential tool for any developer’s toolkit.

GitHub

GitHub is a web-based platform that provides a range of features for managing Git repositories. It allows developers to host their code online, collaborate with others on a project, and track issues and bugs. GitHub is widely used in the open-source community and is a popular tool for managing software development projects.

Homebrew

Homebrew is a package manager for macOS that makes it easy to install, update, and manage software packages.

Jupyter

Jupyter is a popular web-based interactive computing environment that allows data analysts to create and share documents containing live code, visualizations, and narrative text.

PowerShell

PowerShell is a command line shell and scripting language developed by Microsoft. It is designed to automate system administration tasks and provide an extensible platform for developers to write their own scripts and tools. PowerShell is widely used on Windows systems, and is becoming increasingly popular as a cross-platform tool for managing and automating IT infrastructure.

VS Code

Visual Studio Code, often referred to as VS Code, is a lightweight but powerful source code editor that is popular among developers. It is highly customizable and supports a wide range of programming languages, making it a versatile tool for developers of all skill levels. VS Code also has a large ecosystem of extensions that can be used to extend its functionality.

Winget

Winget is a newer lightweight package manager for Windows 10 developed by Microsoft that makes it easy to install, update, and manage software packages. See also Chocolatey.

Language-Specific Tools

In addition, there are several important language-specific tools.

Languages / Python / Tools / conda

Conda is a popular package manager for Python often used with the Anaconda or Miniconda distributions. See also pip.

Languages / Python / Tools / pip

pip is the default and widely-used package manager for Python that makes it easy to install, update, and manage Python packages and dependencies. It is an essential tool for working with Python projects. See also conda.

Subsections of Tools

Chocolatey

Chocolatey is a package manager for Windows, similar to Homebrew for macOS. It simplifies the installation, updating, and management of Windows software, including command-line tools, applications, and libraries. Chocolatey uses NuGet infrastructure and PowerShell to manage packages, making it a powerful tool for Windows users.

Alternatives

Microsoft has been working on a package manager called Winget. Winget is an official package manager developed by Microsoft, and it is designed to be the native package manager for Windows. It is gaining new features and improvements over time.

To choose the best package manager for your needs, consider the following.

Community adoption

  • Both Chocolatey and Winget have growing communities.
  • Chocolatey has been around for longer and has a larger repository of packages.
  • As Winget gains traction, its community and package offerings will likely grow.

Official support

  • As an official Microsoft product, Winget may receive better long-term support and integration with the Windows ecosystem.
  • This could make it a more future-proof choice.

Features and functionality

  • Chocolatey has more mature features and a comprehensive set of tools.
  • However, Winget is expected to gain more features and improvements over time.

Docker

Docker is an open-source platform that automates the deployment, scaling, and management of applications by using containerization technology. It allows developers to package an application and its dependencies (libraries, configuration files, etc.) into a single, lightweight, and portable container. These containers can run consistently across different environments, simplifying application development, testing, and deployment.

Docker provides the following features.

Containerization

Docker uses containerization to isolate applications and their dependencies into separate, self-contained units. This approach ensures that each application runs in a consistent environment, reducing conflicts and improving security.

Image Management

Docker images are templates used to create containers. They are lightweight and can be easily shared, stored, and versioned. Docker Hub, the official public registry, hosts thousands of pre-built images for various programming languages, frameworks, and tools.

Portability

Docker containers can run on any system that supports Docker, regardless of the underlying infrastructure or platform. This makes it easy to deploy and migrate applications across different environments, such as development, testing, and production.

Scalability

Docker enables horizontal scaling of applications by allowing you to deploy multiple instances of the same container. This approach can help distribute the load across multiple resources and improve application performance.

Version Control

Docker images can be versioned and stored in registries, making it easy to rollback, upgrade, or downgrade applications as needed. This also facilitates collaboration among team members, as they can share and use the same image versions.

Ecosystem

Docker has a rich ecosystem of tools and services and many third-party tools and plugins integrate with Docker to extend its functionality.

Managing Containers

Docker containers can be managed with Kubernetes, a popular open-source container orchestration platform. Kubernetes is designed to automate the deployment, scaling, and management of containerized applications, including Docker containers.

Kubernetes provides features such as automatic scaling, self-healing, and load balancing. Kubernetes can manage Docker containers running on a single host or across a cluster of hosts, abstracting away the underlying infrastructure and providing a consistent and scalable platform for running containerized workloads.

Technologies such as Docker Swarm, Apache Mesos, Nomad, and OpenShift perform similar functions to Kubernetes.

Installation

The installation process for Docker depends on your operating system. Follow the instructions below based on your platform.

Common Files

When working with Docker, you’ll encounter several common files.

Dockerfile

File used to define the steps required to build a Docker image.

Dockerfile contains instructions such as

  • FROM - specifies the base image to use
  • RUN - runs commands to install dependencies and set up the environment
  • COPY - copies files from the host machine into the image
  • CMD - specifies the command to run when the container is started

docker-compose.yml

Defines and runs multi-container Docker applications.

docker-compose.yml allows developers to define the services that make up the application, their dependencies, and how they are connected. This file can be used to start, stop, and manage containers in a multi-container application.

.dockerignore

Like .gitignore in Git repositories, .dockerignore is used to specify files and directories that should be excluded from the Docker build context.

By excluding unnecessary files and directories, the Docker build process is faster and more efficient.

Dockerfile.dev

Dockerfile.dev is a Dockerfile variant for development environments.

It contains additional instructions for setting up a development environment, such as installing development tools and enabling debugging.

See Also

Learn more about Docker and the associated tools.

Subsections of Docker

Docker: Installation

Docker is an open-source platform that automates the deployment, scaling, and management of applications by using containerization technology.

Use Docker to create, manage, and deploy containerized applications.

Mac/Linux Users

  • Option 1: Official installation instructions. Follow instructions on the official Docker website. This is the most up-to-date and comprehensive guide to installing Docker on your system.

  • Option 2: Step-by-step installation guide. Check out our installation instructions for a step-by-step guide.

Windows Users

  • Option 1: Official installation instructions. Follow the instructions on the official Docker website. This is the most up-to-date and comprehensive guide to installing Docker Desktop on your Windows system.

  • Option 2: Step-by-step installation guide. Check out our installation instructions for a step-by-step guide.

Subsections of Docker: Installation

Docker: Mac/Linux

The best way to install Docker for Mac and Linux is by using Docker Desktop (for Mac) and Docker Engine (for Linux). Docker provides a complete development environment for containerized applications.

Warning: Docker is a resource-intensive application that may consume a significant amount of disk space, memory, and CPU resources. Installing and running Docker on your system may slow down your machine, especially if it has limited resources. Make sure your system meets the minimum requirements before installing Docker, and consider monitoring resource usage to ensure optimal performance.

Follow these steps to install Docker on Mac and Linux.

For Mac:

  1. Ensure your system meets the requirements:

    • macOS 10.14 (Mojave) or later
  2. Download Docker Desktop for Mac from the official Docker website.

  3. Run the installer:

    • Double-click the downloaded Docker Desktop Installer.dmg file and follow the on-screen instructions.
  4. Start Docker Desktop:

    • After the installation is complete, Docker Desktop should start automatically. If it doesn’t, you can launch it from the Applications folder.
    • You will see the Docker icon in the menu bar, indicating that Docker is running.
  5. Verify the installation:

    • Open a Terminal window.
    • Run the following command to check the Docker version.

    docker --version

  6. Run a test container to ensure that Docker is working correctly.

    docker run hello-world

For Linux:

  1. Choose the appropriate installation instructions for your Linux distribution from the official Docker Engine documentation.

  2. Follow the provided instructions to install Docker Engine on your system.

  3. Verify the installation.

    • Open a Terminal window.
    • Run the following command to check the Docker version.

    docker --version

  4. Run a test container to ensure that Docker is working correctly.

    docker run hello-world

Save Resources

Docker takes a lot of resources. You may want to stop Docker when you are not using it.

For Mac

  1. Locate the Docker icon in the menu bar, which is typically located in the upper-right corner of the screen.
  2. Click on the Docker icon to open the dropdown menu.
  3. Click on “Quit Docker Desktop” or “Exit” to stop Docker Desktop.

For Linux

  1. Open a Terminal window.
  2. Run the following command to stop the Docker daemon.

sudo systemctl stop docker

To start Docker again, simply launch the application from the Applications folder (Mac) or run the following command in a Terminal window (Linux):

sudo systemctl start docker

Docker: Windows

The best way to install Docker for Windows is by using Docker Desktop. Docker Desktop is an easy-to-use application that allows you to run containers on your Windows machine. It includes both Docker Engine and Docker Compose, providing a complete development environment for containerized applications.

Warning: Docker is a resource-intensive application that may consume a significant amount of disk space, memory, and CPU resources. Installing and running Docker on your system may slow down your machine, especially if it has limited resources. Make sure your system meets the minimum requirements before installing Docker, and consider monitoring resource usage to ensure optimal performance.

Follow these steps to install Docker Desktop for Windows.

  1. Ensure your system meets the requirements:
  • Windows 10 64-bit: Pro, Enterprise, or Education (Build 16299 or later) or Windows 11.

  • Virtualization must be enabled in the BIOS. You can usually find this setting under “CPU Configuration,” “Virtualization,” or “VT-x” settings.

  1. Download Docker Desktop for Windows from the official Docker website. (600+ MB).

  2. Run the installer:

  • Double-click on the downloaded Docker Desktop Installer.exe file to start the installation process.
  • Follow the on-screen instructions, accepting the default settings or customizing them as needed.
  1. Start Docker Desktop:
  • After the installation is complete, Docker Desktop should start automatically.
  • If it doesn’t, you can launch it from the Start menu.
  • You will see the Docker icon in the system tray, indicating that Docker is running.
  • Right-click on the icon and select “Dashboard” to open the Docker Desktop dashboard.
  1. Verify the installation:
  • Open a command prompt or PowerShell window.
  • Run the following command to check the Docker version.

docker --version

  1. Run a test container to ensure that Docker is working correctly.

docker run hello-world

Save Resources

To stop Docker Desktop when you are not using it:

  1. Locate the Docker icon in the system tray, which is typically located in the lower-right corner of the screen.

  2. Right-click on the Docker icon to open the context menu.

  3. Click on “Quit Docker Desktop” or “Exit” to stop Docker Desktop.

Git

Git is a popular tool used to help collaborate with others and keep track of code changes over time.

At a high level, Git is a version control system for tracking changes in evolving code projects. Using Git allows you to easily revert to an earlier version of code if you make a mistake or if a change causes unexpected problems.

Git makes it easy to collaborate with others on code. You can use Git to share your code with others, track changes that they make, and merge their changes back into your codebase. This makes it a great tool for open source development, where many people may be working on the same codebase at the same time.

In this Git introduction, we’ll start with the basics of using Git, including setting up your Git environment, creating a repository, and making commits. We’ll also cover more advanced topics like branching, merging, and collaborating with others.

Installation

The installation process for Git depends on your operating system. Follow the instructions below based on your platform:

Configuration

After installing, configure Git with your name and email address.

Using Git

When it comes to using Git, you have a few options for how to interact with it. One option is to use Git in the terminal, which involves typing out commands and working with the Git command line interface. Another option is to use a Git integration in your Integrated Development Environment (IDE), such as Visual Studio Code (VS Code).

Using Git in the terminal can be a bit intimidating, as it requires memorizing and typing out specific commands. However, it can be a useful skill to have, especially if you work on projects that require using Git outside of an IDE.

On the other hand, using a Git integration in your IDE can make the process of working with Git more user-friendly and intuitive, as you can often perform Git actions with a few clicks or keystrokes. For example, VS Code has built-in Git support and provides a visual interface for common Git actions such as committing changes, creating branches, and merging changes.

Git Crash Course (Video)

Check out the recommended Git Crash Course (Video).

Free ProGit (Book)

Check out the free ProGit book for a comprehensive guide to using Git.

See Also

Subsections of Git

Git: Installation

Git is a widely-used version control system that helps data analysts and developers track changes to their code and collaborate with others.

Mac/Linux Users

  • Option 1: Official installation instructions. Follow instructions on the official Git website. This is the most up-to-date and comprehensive guide to installing Git on your system.

  • Option 2: Step-by-step installation guide. Check out our installation instructions for a step-by-step guide.

Windows Users

  • Option 1: Official installation instructions. Follow instructions on the official Git website. This is the most up-to-date and comprehensive guide to installing Git on your system.

  • Option 2: Step-by-step installation guide. Check out our detailed installation instructions for a step-by-step guide.

Use Git to manage your code and collaborate with others.

Subsections of Git: Installation

Git: Mac/Linux

Task 1 - Download and install Git

  1. Open a terminal window
  2. Run the following command to install Git:
    • sudo apt-get install git
    • (for Debian/Ubuntu-based systems) or
    • brew install git
    • (for macOS)

Task 2 - Configure Git

  1. Open a terminal window
  2. Run the following commands to configure Git with your name (your real name, e.g. “Denise Case”) and the email address you used for GitHub.
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"
  1. Important: Replace “Your Name” with your name and “your.email@example.com” with the email address associated with your GitHub account
  2. This configuration will be used for all of your Git repositories

Task 3 - Verify

  1. Run the following command to verify your Git configuration:
git config --list
  1. You should see your name and email address listed under the “user” section
  2. If the information is not correct, you can run the git config command again to update it

Git: Windows

Task 1 - Download and install Git

  1. Go to the Git download page at https://git-scm.com/download/win
  2. Click the “Download” button to download the Git installer
  3. Run the installer file that you downloaded
  4. Accept the default installation options and click “Install”
  5. Choose the appropriate options for line ending conversion and terminal emulator during the installation process

Task 2 - Configure Git

  1. Open a command prompt or PowerShell window
  2. Run the following commands to configure Git with your name (your real name, e.g. “Denise Case”) and the email address you used for GitHub:
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"
  1. Important: Replace “Your Name” with your name and “your.email@example.com” with the email address associated with your GitHub account
  2. This configuration will be used for all of your Git repositories

Task 3 - Verify

  1. Run the following command to verify your Git configuration:
git config --list
  1. You should see your name and email address listed under the “user” section
  2. If the information is not correct, run the git config command again to update it

Git: Basics

Git is a widely-used version control system that helps you track changes to your code and collaborate with others. With Git, you can create a complete history of your work, from the initial commit to the latest changes. This makes it easy to work on a project with others, keep track of your progress, and recover from mistakes.

Creating a Repository

To get started with Git, you need to create a repository. This is where you’ll store your code and track changes to it. There are several ways to create a repository:

  • Clone an existing repository: If you want to work on code that’s already been created and shared by someone else, you can clone their repository to your local machine. To do this, you’ll need the repository’s URL and you can use the git clone command to create a local copy of the code.

  • Fork an existing repository: If you want to make changes to someone else’s code and contribute those changes back to their repository, you can fork their repository. This creates a copy of their repository in your GitHub account, which you can then clone to your local machine and work on.

  • Create a new repository: If you want to start a new project from scratch, you can create a new repository by clicking the “+” sign in the top right corner of your GitHub account.

Getting Code onto Your Machine

Once you have a repository set up, you’ll want to get the code onto your local machine so you can work on it. To do this, clone the repository using the git clone sourceurl command. Change sourceurl to the address shown in the browser when viewing the root folder of the repository. This will create a local copy of the repository on your machine that you can work with.

Saving Changes with Git

Once you have a copy of the repository on your machine, you can make changes to the code and save those changes to the repository using Git. The basic workflow for this is:

  1. Add changes: Use the git add . command to add the changes you’ve made to the code to the staging area. It’s said “git add dot”. See the dot at the end? That means add all the newly created files into source control.

  2. Commit changes: Use the git commit -m "add feature n command to save the changes to the local repository with a descriptive commit message.

  3. Push changes: Use the git push origin main command to push the changes from your local repository back up to the remote repository on GitHub.

This sequence of commands is very common:

git add .
git commit -m "tell us what you did"
git push origin main

Editing on your Machine

We typically like to edit files on our machine using editors like VS Code or IDEs like PyCharm and Spyder. These local tools provide advanced features including syntax highlighting, code completion, and debugging, which can make our work more efficient.

Editing in the Cloud

However, the power of our local editors and IDEs is increasingly becoming available in the cloud, and we can make many updates to our repositories right from the GitHub web interface. For example, you can use the github.dev web-based editor to edit files and commit your changes.

It’s important to note that if we edit files both on our machine and in the cloud, we can end up with conflicts when trying to merge our changes. Therefore, it’s important to ensure that we always pull down the latest changes from the cloud before making any local edits, and that we push our changes back up to the cloud as soon as we’re finished with them.

Using the git pull command will bring any changes made directly in your GitHub (or other cloud) repository down to your machine.

git pull

Read more about the github.dev editor at:

Remotes

In Git, “origin” is a shorthand name that refers to the remote repository where your code is stored. When you clone a repository, Git automatically creates an “origin” remote that points to the original repository on the server. You can use this remote to pull changes from the server or push your local changes back to it.

You can add more than one remote to a repository.

Branches

Git branches are separate lines of development that allow multiple contributors to work on different features or versions of a project simultaneously.

The default branch in Git is now called “main”, but “master” was previously used, so you may still see it.

Pull and Push

When we use git pull, Git already knows the source and destination of the changes (i.e., the remote and local repositories) because it’s been configured using the git clone command.

When we use git push, we need to specify both the remote repository (the source) and the branch we want to push the changes to (the destination). The origin in git push origin main refers to the remote repository we want to push the changes to, and main refers to the branch on the remote repository that we want to update with our changes.

Git: Branches

In Git, we are always working on a branch of code, which is like a separate “timeline” for the code.

Default Branch

The default branch is employed automatically when we first create a repository, and it is typically and by default named main. On older repos, you may see a master branch instead, but the old terminology is discouraged and easy to update.

Working Alone

For independent projects, we may work directly on the main default branch.

Individual developers may choose to use branches to work on new features or fixes without affecting their main codebase.

Working Together

In a professional environment, it’s generally recommended to create new branches for new features or changes to avoid conflicts with other developers and to make it easier to manage and review changes.

Individual developers can also use branches to experiment with new features or make changes without affecting the main codebase.

We can make changes, commit them to our branch, and then merge our branch back into the default branch when appropriate. Multiple branches allow a team to work on different features or changes at once without worrying about conflicts or breaking the main codebase.

Once we’re satisfied with our changes on a branch, we can create a pull request to request that the changes be reviewed and merged into the default branch. Team leads can then review and merge the changes as needed. The default branch is typically set to “main” and is the primary branch for the project.

You can create a new branch with the git branch command, and switch to that branch with the git checkout command. Once you’re on the new branch, any changes you make and commit will only affect that branch.

To merge a branch back into the main codebase, you can use the git merge command. This will bring any changes from the branch into the main codebase, and you can resolve any conflicts that arise during the merge.

Git branches are an important tool for managing complex projects with multiple contributors, and they allow for efficient collaboration and code review.

Git: Configuration

After installing, configure Git with your name and email.

Use your GitHub email for best results.

Open Git Bash on Windows

To open Git Bash on Windows:

  1. Press the Windows key on your keyboard to open the Start menu.
  2. Type “Git Bash” into the search bar and select it from the list of results.
  3. Git Bash should now open in a new window.

Open Terminal on Mac or Linux

On Mac or Linux, open Terminal app.

Check Git Configuration

Type the following command to display your Git configuration:

git config --list

Look for the following lines in the output:

user.name=Your Name
user.email=your.email@example.com

If you see your name and email listed, then they are set in Git.

Set Git Configuration

If you don’t see your name and email listed, set them using the following commands:

git config --global user.name "Your Name"
git config --global user.email your.email@example.com

Replace “Your Name” and “your.email@example.com” with your actual name and email address.

The --global flag ensures the settings are applied globally across all your Git repositories.

Git: Conflicts

We can edit project files in at least two places:

  • locally, on our machine
  • in the cloud, e.g., by using the editing features in GitHub

Bad Practices

We want to keep our local version and cloud version in sync at all times.

Some of the worst things we can do are:

  1. Forget to pull before we start our work.
  2. Pull code and leave it for a long time, then start working on old, stale code.
  3. Make huge, expansive contributions that take a long time (unless we know how to branch - an intermediate Git skill.)
  4. Wait to push our completed changes to the cloud.

Good Practices

To minimize the chance of conflits:

  1. Always pull code before you start working locally. Never work on stale code!
  2. Make small, incremental changes.
  3. As soon as you finish a useful contribution, git add, commit, and push up to the cloud.

Keep your local and cloud repositories synchronized. Use these for each session.

Before you start:

git pull

After you finish a set of edits:

git add .
git commit -m "add title"
git push

When working collaboratively, communicate with team members and establish a clear workflow. Ensure the team knows who is working on which files and when changes are being made. You might create different small, focused branches that don’t overlap much in terms of the files they modify.

Merge Conflicts

Merge conflicts can occur when:

  • two people edit the same file simultaneously
  • changes are made to a file both locally and in the cloud at the same time.
  • two branches with different changes are merged.

For example, we might use the GitHub cloud editor to make a quick fix to our README.md - forgetting that we’re also in the process of updating installation instructions on the local README.md.

Merge conflicts can be frustrating, but they are an inevitable part of collaborative work.

If you do run into a merge conflict, don’t worry - it’s not the end of the world. Git provides tools to help you resolve conflicts and merge changes together. The first step is to understand which files have conflicts by running git status. The files with conflicts will be marked as “unmerged”.

To resolve the conflict, open the conflicted file and look for the conflicting sections marked with «««< HEAD, =======, and »»»>. Manually edit the file to remove the conflicting sections and keep the changes you want. Once you’ve resolved the conflict, fstage the changes with git add and commit them with git commit.

If you’re still unsure how to resolve the conflict, ask for help from your team members or consult Git documentation. Stay calm and take your time to carefully resolve the conflict.

Experience managing merge conflicts can be very valuable.

Git: Crash Course

Student-recommended video on Git - definitely worth sharing! It covers things in a similar way and you can jump right to the parts you need.

Note: Watch when you have time or use it when you’re ready to learn more about Git. Many students find it very helpful. I don’t know how anyone could provide more information, more efficiently than this.

https://www.youtube.com/watch?v=RGOj5yH7evk

Git and GitHub for Beginners - Crash Course

Over 2 million views.

From the video description:

Learn about Git and GitHub in this tutorial. These are important tools for all developers to understand. Git and GitHub make it easier to manage different software versions and make it easier for multiple people to work on the same software project. This course was developed by Gwen Faraday.

Git: Remotes

In Git, the term “origin” refers to the default remote repository that a local repository is linked to. When you clone a repository from a remote server to your local machine, Git automatically sets up the “origin” remote for you. This allows you to push changes from your local repository to the remote repository, and pull changes from the remote repository to your local repository.

When you clone a repository, Git sets up the origin remote by default, pointing to the repository you cloned from. This means that when you push changes to the remote repository, they will be added to the branch on the remote repository that you cloned from.

Using the “origin” remote allows you to collaborate with others by sharing changes to the same repository. When someone else pushes changes to the remote repository, you can pull those changes down to your local repository and merge them with your own changes.

However, if you edit the same file in both your local repository and the remote repository, conflicts can arise. To avoid conflicts, it’s important to always pull down changes from the remote repository before making your own changes, and to carefully review any merge conflicts that arise.

Working with Remote Repositories

Git provides a set of commands that allow you to work with remote repositories. Here are some commonly used commands:

  • git remote - List the remote repositories that are connected to your local repository.

  • git remote -v - List the remote repositories along with their URLs.

  • git remote add <name> <url> - Add a new remote repository to your local repository. The name parameter is the name you want to give the remote, and url is the URL of the remote repository.

  • git remote rm <name> - Remove a remote repository from your local repository.

  • git push <remote> <branch> - Push your local changes to a remote repository. The remote parameter is the name of the remote repository, and branch is the branch you want to push to.

  • git pull <remote> <branch> - Pull changes from a remote repository into your local repository. The remote parameter is the name of the remote repository, and branch is the branch you want to pull from.

  • git fetch <remote> - Fetch the changes from a remote repository, but don’t apply them to your local repository.

  • git clone <url> - Clone a remote repository to your local machine.

Git Learning: Concepts Over Memorization

Learning every Git command by heart is not necessary nor efficient. Instead, focus on understanding the concepts and workflows of Git, and how the commands fit into those workflows. The vast amount of online resources available will serve as reliable references when you need them.

As you work with Git more frequently, the most common commands will become second nature. However, for the rest, don’t hesitate to look them up. Remember that the value of Git lies not in memorizing commands but in leveraging its powerful version control capabilities to manage your projects effectively.

GitHub

GitHub is a popular, web-based platform that allows data analysts and developers to store and manage their code and collaborate with others.

GitHub is built on Git, which is a distributed version control system that allows developers to track changes to their code over time and collaborate with others on the same codebase.

With GitHub, developers can create their own repositories, which are essentially folders that contain their code, documentation, and other files related to a specific project. They can also fork other people’s repositories to create their own copies, which they can then modify and contribute back to the original repository. This allows for easy collaboration and code sharing among developers.

GitHub provides tools for developers to manage their code, such as the ability to track and resolve issues, review and merge pull requests, and create and manage branches. It also provides a web-based interface for viewing and editing code, as well as a built-in code editor. Additionally, it has a wide range of integrations and APIs that allow developers to automate various development tasks and integrate with other tools and services.

Sign Up For A Free Account

Sign up for a free account with GitHub.com, a code hosting platform that manages a vast number of programming projects. Follow their website instructions to get started.  See the recommendations on GitHub email and username below.

GitHub Email

You’ll need an email. I use a permanent personal email for most GitHub work, rather than a work or school account (which may be temporary). Your email will not be made public.

GitHub Username

You’ll create a GitHub username. Your username will be public. Your username can be anonymous (e.g., ‘analystextraordinaire’) or publicly associated with you. For example, I use ‘denisecase’. Your username will be a part of the URL to all of your projects.

Students New to GitHub

  • Recruiters may look at GitHub and LinkedIn profiles - it can be helpful to show your skills using modern tools. 
  • Be courageous. The best way to learn is by doing, and don’t be too concerned about making mistakes.
  • Git mistakes and do-overs are common getting started.
  • Learning to fix issues is a key skill in data analytics.
  • Keep and share your latest, most useful, and best work in GitHub. 

GitHub Repositories

Each coding project lives in a GitHub repository (called ‘repo’ for short) in the ‘cloud’ (a distributed group of machines).

Git (the system) keeps track of committed changes to an evolving project. 
- The GitHub repo can be kept in sync with a git repo on your local machine. 
- For example     - If a GitHub repo is named datafun-01-getting-started     - On my machine, it’s in my Documents/datafun-01-getting-started directory

Quick Quiz

Go to: https://github.com/denisecase/datafun-01-getting-started

Q: What is the username? 

Q: What is the repo name in the URL? 

Get Started 

After you have an account, you can use the Get Started Guide that the GitHub team has created to help you understand the platform.

For more information on getting started on GitHub, view the “Getting Started with GitHub” video below from the GitHub Training & Guides Youtube Channel.

GitHub Training &amp; Guides GitHub Training &amp; Guides

More About GitHub

The following definition of GitHub comes from Kinsta.com

At a high level, GitHub is a website and cloud-based service that helps developers store and manage their code, as well as track and control changes to their code. To understand exactly what GitHub is, you need to know two connected principles: Version control, which helps developers track and manage changes to a software project’s code, and Git, which is a specific open-source version control system.

Learn more about GitHub in the following video from the GitHub YouTube.

GitHub Video GitHub Video

Free Stuff For Students

For more fun stuff, check these out. 

See Also

There is more information about GitHub in the Hosting Chapter.

Homebrew

Homebrew is a package manager for macOS and Linux that simplifies the installation, updating, and management of software on your system. Homebrew allows you to install various command-line tools, applications, and libraries with ease. It is designed to work seamlessly with macOS and Linux, providing a user-friendly interface for managing software packages.

Jupyter

Jupyter is an open-source web application that allows users to create and share documents that contain live code, equations, visualizations, and narrative text. It is a popular tool for data analysis, scientific computing, and machine learning, and is widely used in academic research, industry, and data science education.

Jupyter gets its name from Julia-Python-and-R - some of the original programming languages supported.

Jupyter provides the following features.

Interactive Computing

Jupyter notebooks allow users to write and execute code interactively, providing an interactive computing environment. This allows users to explore data, prototype algorithms, and create visualizations in a single, cohesive environment.

Multiple Language Support

Jupyter supports multiple programming languages, including Python, R, and Julia. This makes it easy to integrate different tools and frameworks and collaborate with colleagues who use different programming languages.

Collaboration

Jupyter notebooks can be shared with others, allowing for easy collaboration and reproducibility of analyses. This also facilitates communication and knowledge sharing among team members and stakeholders.

Visualization

Jupyter notebooks support interactive visualization libraries such as Matplotlib, Bokeh, and Plotly, making it easy to create and share data visualizations.

Integration

Jupyter notebooks can be integrated with other tools and frameworks such as Git, GitHub, and Docker. This makes it easy to manage version control, share code and data, and deploy projects.

Ecosystem

Jupyter has a rich ecosystem of tools and services, such as JupyterLab, JupyterHub, and Binder, that can help streamline the development and deployment process. Many third-party tools and plugins also integrate with Jupyter to extend its functionality.

Jupyter Installation

The installation process for Jupyter depends on your operating system and your preferred installation method. Follow the instructions below based on your platform.

Jupyter Ecosystem

Here’s a short guide to clarify some of the terms used with Jupyter.

  • JupyterLab: An interactive development environment (IDE) for working with Jupyter notebooks, code, and data. It provides a flexible and powerful user interface that can be customized to suit the needs of individual users.

  • Jupyter Notebook: A web-based interactive computational environment for creating and sharing Jupyter notebooks, which allow you to create and share documents that contain live code, equations, visualizations, and narrative text.

  • JupyterHub: A multi-user server that allows multiple users to access Jupyter notebooks and other resources from a shared server. It is commonly used in educational settings or for collaborative research projects.

  • Jupyter Book: A tool for building beautiful, publication-quality books and documents from computational material, such as Jupyter notebooks. It provides a simple way to create interactive documents with executable code and visualizations.

  • nbconvert: A command-line tool that converts Jupyter notebooks to other formats, such as HTML, PDF, or Markdown. This allows you to share your work with others who may not have Jupyter installed.

  • ipywidgets: A library for creating interactive widgets in Jupyter notebooks. Widgets are user interface elements, such as buttons and sliders, that allow you to interact with and visualize data in real time.

  • nbviewer: A web application that allows you to view Jupyter notebooks without having to install Jupyter yourself. You can simply paste the URL of a notebook and view it in your browser.

Get Started with Jupyter Notebooks

There are excellent resources available for getting started with Jupyter Notebooks.

See:

VS Code

Visual Studio Code (VS Code) is a free and open-source code editor developed by Microsoft. It is available on Windows, Linux, and macOS and offers features such as debugging, syntax highlighting, and intelligent code completion.

Some of the key features of VS Code include:

  • Built-in Git integration
  • Support for multiple languages and frameworks
  • Extensions for customizing the editor and adding new functionality
  • Debugging capabilities for Node.js, Python, and other languages
  • Integrated terminal for running commands and scripts

Using a modern editor or IDE can make your coding experience more efficient and productive.

Installation

The installation process depends on your operating system. Follow the instructions below based on your platform:

VS Code Extensions

VS Code extensions are add-ons that allow users to customize and enhance the functionality of the VS Code.

For example, IntelliSense is a popular VS Code extension that provides intelligent code suggestions, auto-completion, and parameter hints while writing code. It is a built-in extension enabled by default in VS Code.

To learn more about extensions, visit the official documentation at https://code.visualstudio.com/docs/introvideos/extend.

Why VS Code

One reason we teach VS Code over other IDEs (.e.g., Spyder, PyCharm, IDLE) is that VS Code is a more general-purpose code editor that supports multiple languages and workflows, and works on Windows, Mac, and Linux machines. VS Code is capable of handling a wide range of tasks and can be used for web development, data analysis, scripting, and more.

VS Code has a lot of built-in functionality for working with other languages including Markdown, SQL, PowerShell, Julia, and more. Learning VS Code is a great skill for someone getting started with programming, data analysis, and/or automation and wants to learn a versatile environment that will accomodate growing skills.

VS Code is widely used and well-supported, with many resources for learning how to use it effectively. In addition to the comprehensive official documenttaion, there are articles and videos available for begineers through experts.

Subsections of VS Code

VS Code: Installation

PowerShell is a powerful command-line shell and scripting language designed for system administration and automation tasks. Here are some options for installing PowerShell on your system:

Windows Users

  • Option 1: Install via Microsoft Store. If you’re running Windows 10 or later, you can install PowerShell via the Microsoft Store. This is the recommended method, as it ensures that you have the latest version of PowerShell and allows for easy updates.

  • Option 2: Download the MSI installer. If you’re not able to install via the Microsoft Store, you can download the MSI installer from the PowerShell GitHub repository. Choose the appropriate version for your system architecture (32-bit or 64-bit) and follow the installation wizard.

macOS Users

  • Option 1: Install via Homebrew. If you’re using Homebrew on your Mac, you can install PowerShell by running the following command in your terminal: brew install --cask powershell.

  • Option 2: Download the PKG installer. You can also download the PKG installer from the PowerShell GitHub repository. Choose the appropriate version for your macOS version and system architecture (Intel or Apple Silicon) and follow the installation wizard.

Linux Users

  • Option 1: Package manager installation. Most Linux distributions include PowerShell in their package repositories. You can search for PowerShell in your package manager and install it from there. For example, on Ubuntu or Debian, you can run sudo apt-get install powershell.

  • Option 2: Download the package manually. You can also download the package for your distribution directly from the PowerShell GitHub repository and install it manually. Follow the instructions for your specific distribution on the download page.

Once you have PowerShell installed, you can use it to perform a wide range of tasks and automate common system administration tasks. Happy scripting!

Winget

Winget (Windows Package Manager) is an official package manager for Windows systems, developed by Microsoft. It simplifies the process of discovering, installing, upgrading, and removing software on Windows machines. Winget provides command-line access to manage software packages, s imilar to package managers on Linux and macOS systems.

With Winget, you can search for, install, update, and uninstall software packages without having to manually navigate to a website, download installers, or follow installation wizards. Winget automates these tasks and makes it easy to manage software on your Windows system.

Alternatives

For a while yet, Chocolatey is a popular alternative. Chocolatey has been around for a longer time, offering a mature set of features and a large repository of packages. The Chocolatey community is well-established, and it has extensive documentation and support. Chocolatey is known for its versatility and integration with various Windows tools, such as PowerShell and NuGet infrastructure. This makes it a popular choice for many Windows users looking for a reliable and comprehensive package management solution.

Chapter 3

Terminals

A terminal, or command line interface, is a text-based way to interact with your computer. Terminals can be faster and more powerful than graphical user interfaces (GUIs) for many tasks, especially tasks that involve file management, development, or automation.

This chapter provides an introduction to some widely-used terminals.

Mac and Linux

Mac and Linux systems offer the Terminal app, which provides a command-line interface for navigating the file system, running commands, and executing scripts. The Terminal supports a wide range of commands and utilities, and can be customized with various themes and configurations to suit individual preferences.

PowerShell

PowerShell is a powerful terminal and scripting language offered by Microsoft for all platforms, including Windows, macOS, and Linux. PowerShell provides a more powerful and flexible command-line environment than the Command Prompt, with support for features like object-oriented pipelines, remote management, and scripting with .NET objects.

VS Code Terminal

The terminal in Visual Studio Code offers a built-in command-line interface for developers. It provides access to a range of commands and utilities, including those specific to development tasks like running build scripts, testing code, and debugging applications.

Windows Terminals

Windows offers a variety of terminals depending on the user’s needs and preferences. These terminals include:

  • Command Prompt: A basic terminal emulator that has been included in Windows since the early days.
  • Git Bash: A terminal emulator that is bundled with Git for Windows. It provides a Unix-like command-line environment for Windows, including support for common Unix utilities and shell scripting.
  • PowerShell: A command-line shell and scripting language that is designed for system administration and automation in Windows.
  • Windows Subsystem for Linux (WSL): A feature of Windows 10 that allows users to run a Linux environment directly on Windows, without the need for a virtual machine or container.

Each of these terminals provides a unique set of features and capabilities, allowing users to choose the terminal that best fits their needs and workflow.

Subsections of Terminals

Mac and Linux

On Mac and Linux machines, the Terminal app is widely used.

Terminal is a command-line interface that allows users to interact with their computer using text commands. It provides a way to execute commands, manage files, and run scripts without the need for a graphical user interface.

The Terminal can be accessed by opening the Terminal application on a Mac or by opening a terminal emulator on a Linux distribution.

Once in the Terminal, users can navigate the file system, run commands, and install and manage software packages. The Terminal is a powerful tool for advanced users and developers, allowing for efficient and precise control over the computer. However, it does require some knowledge of command-line interfaces and syntax, which can be intimidating when getting started.

Homebrew is useful commnad line tool for managing software packages in Terminal.

PowerShell

PowerShell is a cross-platform terminal and scripting language used for a wide range of purposes including automation, administration, analytics, and digital forensics.

PowerShell includes Intellisense to help complete commands. To accept, use the right arrow key (instead of tab).

PowerShell Core

This section provides an overview of the cross-platform PowerShell Core, including some use cases and capabilities.

PowerShell vs Windows PowerShell

When we say PowerShell, we generally mean PowerShell Core, the cross-platform version. There is also Windows PowerShell. They are different.

Installation

This section helps with installing PowerShell on various operating systems, including Windows, Linux, and macOS.

Multple Versions

This section covers how to manage multiple versions of PowerShell on the same machine, and remove extra versions when no longer needed.

PowerShell in VS Code

This section covers the use of PowerShell in the popular code editor, Visual Studio Code (VS Code).

Subsections of PowerShell

PowerShell Core

PowerShell Core is the cross-platform version of PowerShell. It was developed by Microsoft as an open-source project and is designed to run on Windows, Linux, and macOS, making it a truly cross-platform tool. PowerShell Core has a similar syntax and features to the traditional Windows PowerShell, but with additional cross-platform capabilities and support for running on a variety of operating systems.

PowerShell Core also includes many improvements and new features compared to the older Windows PowerShell, including improved performance, support for Docker containers, and enhanced security features. PowerShell Core also has its own module repository, called the PowerShell Gallery, where users can download and install community-built PowerShell modules.

Here are some common use cases:

Analytics

  • Data collection and analysis: PowerShell can be used to collect and process data from various sources, including databases, logs, and web services. It can also be used to perform data analysis tasks such as data cleansing, transformation, and aggregation.

  • Reporting: PowerShell can be used to generate reports and visualizations based on data collected and analyzed from various sources. This can include dashboards, charts, and other types of visualizations.

  • Machine learning: PowerShell can be used to develop and train machine learning models using libraries such as Microsoft’s Cognitive Services and TensorFlow.

Digital forensics

  • Data acquisition: PowerShell can be used to acquire data from disk images, memory dumps, and other sources for forensic analysis.

  • Evidence examination: PowerShell can be used to search for specific file types, keywords, and other indicators of compromise within acquired data.

  • Data recovery: PowerShell can be used to recover data from damaged or corrupted drives, and to reconstruct deleted or lost files.

  • Network analysis: PowerShell can be used to analyze network traffic and investigate potential security incidents or threats.

System administration

PowerShell provides a powerful command-line interface for managing Windows operating systems, including managing users and groups, configuring network settings, installing software, and performing other administrative tasks.

Automation

PowerShell can be used to automate repetitive tasks, such as backups, file transfers, and system maintenance tasks. PowerShell scripts can be scheduled to run automatically or triggered by events such as system startup or user logon.

Development

PowerShell includes a full-featured scripting language with access to the .NET Framework and other APIs, making it a powerful tool for developing and testing applications and scripts.

Cloud computing

PowerShell can be used to manage cloud services and resources on platforms such as Microsoft Azure and Amazon Web Services (AWS). PowerShell modules and cmdlets are available for managing virtual machines, storage, networking, and other cloud services.

Security

PowerShell includes a range of security features, including execution policies, code signing, and encryption, making it a valuable tool for securing systems and managing user access to resources.

PowerShell: Core vs Windows

Windows PowerShell and PowerShell Core are two different versions of PowerShell that differ in their supported operating systems and features.

Windows PowerShell (older, Windows-specific)

Windows PowerShell is the original version of PowerShell that was released in 2006 and is included by default in Windows operating systems. It runs on the .NET Framework and is designed to work specifically on Windows operating systems. Windows PowerShell has a wide range of built-in cmdlets and supports scripting in PowerShell, as well as other scripting languages like VBScript and JScript. The typical installation path for Windows PowerShell is C:\Windows\System32\WindowsPowerShell\v1.0.

PowerShell Core (newer, cross-platform)

PowerShell Core, on the other hand, is an open-source version of PowerShell that was released in 2016 and is designed to be cross-platform. It runs on .NET Core and supports Windows, Linux, and macOS operating systems. PowerShell Core has a smaller footprint than Windows PowerShell and is designed to be more lightweight and modular. It includes many of the same built-in cmdlets as Windows PowerShell, but also has some additional features like support for SSH remoting and improved performance. The typical installation path for PowerShell Core is C:\Program Files\PowerShell\7.

Using Both

The two versions differ in their operating system support and feature set. Windows PowerShell is designed specifically for Windows operating systems and is included by default, while PowerShell Core is designed to be cross-platform and is a separate install.

PowerShell: Installation

PowerShell is a powerful cross-platform terminal and scripting language used for a wide range of purposes. Here are some options for installing PowerShell on your system:

Mac/Linux Users

Windows Users

  • Official installation instructions. Follow the instructions on the official PowerShell website to download and install PowerShell on your system.

Verify Version

Open PowerShell and run the following command to verify installation:

$PSVersionTable.PSVersion

PowerShell: Multiple Versions

There may be multiple versions of PowerShell on your computuers, especially on Windows. This can sometimes cause confusion.

It’s perfectly fine to have multiple installations, but you can remove some if they are unneeded.

On Windows machines, it’s common to have both:

  • PowerShell Core (the newer, cross-platform version)
  • Windows Powershell (the original Windows-specific verison)

Read more about Windows PowerShell vs PowerShell Core.

Uninstall

To uninstall older versions of PowerShell on Windows, follow these steps:

  1. Open the Start menu and type “Add or remove programs” in the search box. Click on the “Add or remove programs” option that appears in the search results.

  2. In the list of installed programs, locate the PowerShell versions to uninstall. Search for “PowerShell” or sort the list by name or date to find the relevant entries.

  3. Click on each PowerShell version that you want to uninstall, then click “Uninstall”. Follow the prompts to complete the process.

  4. Once uninstalled, ensure the version you want to keep is still installed and functional.

  5. Check the version of PowerShell by opening a PowerShell window and running the command $PSVersionTable.PSVersion. This will display the version number of the PowerShell installation that is currently in use.

PowerShell in VS Code

Visual Studio Code (VS Code) is a popular code editor that supports many programming languages, including PowerShell.

Terminals in VS Code

  1. Open VS Code and navigate to the “Terminal” panel by clicking on the terminal icon in the left-hand panel, or by using the keyboard shortcut Ctrl+Shift+ ``.

  2. In the Terminal panel, click on the plus (+) icon to open a new terminal.

Default Shell

By default, the new terminal will use the default shell associated with your system, e.g., Windows Command Prompt (cmd).

Open a PowerShell Terminal from Panel

To open a PowerShell terminal in VS Code, click on the drop-down arrow in the terminal panel and select “PowerShell” from the list of available shells.

Set Default Terminal

There are several options for setting the default terminal to PowerShell in VS Code:

  1. Use the Command Palette (recommended)
  2. Use the Terminal Dropdown
  3. Use the VS Code Settings

This method helps with multiple PowerShell installations when you want to open a specific PowerShell option.

In VS Code, from the menu, select View / Command Palette / Terminal: Select Default Profile.

Look carefully at the PowerShell options. For example, you may have:

  • PowerShell
  • Winddows PowerShell

the typical installation path for PowerShell Core is something like C:\Program Files\PowerShell\7.

The typical installation path for Windows PowerShell is something like C:\Windows\System32\WindowsPowerShell\v1.0.

Read more about Windows PowerShell vs PowerShell Core.

2. Set Default from Terminal Dropdown

Alternatively, to make PowerShell your default terminal in VS Code, click on the drop-down arrow in the terminal panel and select “Select Default Profile”.

  1. In the “Select Default Profile” dropdown, select “PowerShell” from the list.

3. Set Default using Settings

Or to configure the Settings directly, follow these steps:

  1. Open the VS Code settings editor by clicking on the gear icon in the lower-left corner of the window and selecting “Settings” from the menu.

  2. In the search bar at the top of the settings editor, type “terminal.integrated.shell.windows”. This will filter the settings to show the terminal shell settings for Windows.

  3. Click on the edit icon (pencil icon) next to “Terminal > Integrated > Shell: Windows” to open the editing interface.

  4. Enter the path to the PowerShell executable that you want to use as the default shell in VS Code. For example, if you want to use PowerShell Core (the cross-platform version), you might enter “C:\Program Files\PowerShell\7\pwsh.exe” (assuming that PowerShell Core is installed in the default location).

  5. Save the changes to the “Settings” editor by clicking on the “Save” button or using the keyboard shortcut (Ctrl+S on Windows/Linux or Command+S on macOS).

Run PowerShell as Administrator

Sometimes you’ll need to run PowerShell as an Adminstrator (Admin), for example, when installing packages with Chocolatey.

Outside VS Code, Start / Windows PowerShell / Run as Administrator.

Use as needed and then return to VS Code for non-admin commands.

VS Code Terminal

The Terminal in Visual Studio Code is a powerful tool.

It provides a command-line interface within the editor, allowing you to perform various tasks without leaving the editor.

Here are some of the key features of the Terminal in VS Code.

Multiple Integrated Terminals

You can have multiple terminal instances in VS Code, each running a different shell or command. This allows you to work with multiple environments at the same time.

Customizable Shell Environment

You can customize the shell environment by specifying shell-specific settings like environment variables, shell arguments, and shell location. This is useful for working with specific development environments or tools.

Keyboard Shortcuts

The Terminal in VS Code has several built-in keyboard shortcuts that can help you navigate and interact with the terminal more efficiently.

Integrated Tasks

VS Code allows you to define custom tasks that can be executed within the terminal. This is useful for automating repetitive tasks or building and testing your code.

Debugging

You can also use the Terminal in VS Code for debugging your code. You can set breakpoints and step through your code within the terminal, making it easier to identify and fix issues.

To open the Terminal in VS Code, press Ctrl+` or navigate to the View menu and select Terminal. From there, you can customize the terminal settings and start working with the command line directly within the editor.

Changing the Default Terminal in VS Code

To change the default terminal in VS Code, follow these steps:

  1. Open VS Code and go to “File” -> “Preferences” -> “Settings”.

  2. In the search bar, type “terminal.integrated.shell”.

  3. Click on “Edit in settings.json” on the right-hand side.

  4. In the “settings.json” file, add the following line:

    “terminal.integrated.shell.windows”: “C:\Path\To\Your\Terminal\Executable”

    Replace “C:\Path\To\Your\Terminal\Executable” with the path to the executable file for the terminal you want to use.

  5. Save the “settings.json” file and close it.

  6. Restart VS Code for the changes to take effect.

After following these steps, your chosen terminal will be set as the default terminal in VS Code.

Windows

Windows offers multiple terminal options for developers and users, and many are widely used. Skills in any or all of these can be very valuable.

Command Prompt

The Command Prompt is a basic terminal emulator that has been included in Windows since the early days. It supports running basic commands, navigating the file system, and running batch scripts. However, it lacks some of the advanced features found in modern terminal emulators.

Git Bash

Git Bash is a terminal emulator that is bundled with Git for Windows. It provides a Unix-like command-line environment for Windows, including support for common Unix utilities and shell scripting. It also includes Git-specific features, such as auto-completion for Git commands and syntax highlighting for diffs.

PowerShell

PowerShell is a command-line shell and scripting language designed for system administration and automation. It provides a more powerful and flexible command-line environment than the Command Prompt, with support for features like object-oriented pipelines, remote management, and scripting with .NET objects.

PowerShell includes Intellisense to help complete commands. To accept a suggestion, use the right arrow key (instead of tab).

Windows Subsystem for Linux

Windows Subsystem for Linux (WSL) is a feature of Windows 10 that allows users to run a Linux environment directly on Windows, without the need for a virtual machine or container. It provides access to a full-fledged Linux system, including a terminal emulator and support for running Linux applications and scripts. WSL can be used for development, testing, and running Linux-based tools and utilities on Windows.

Knowledge of Linux commands can be especially helpful in security and digital forensic investigations.

Subsections of Windows

Windows: Command Prompt

The Command Prompt is a basic terminal emulator that has been included in Windows since the early days. It provides a command-line interface for interacting with the file system, running basic commands, and executing batch scripts.

The Command Prompt supports a range of commands, including:

  • dir: List the contents of a directory.
  • cd: Change the current directory.
  • md: Create a new directory.
  • del: Delete files.
  • copy: Copy files.
  • move: Move files.

In addition to these basic commands, the Command Prompt supports a range of advanced features, such as redirection of input and output, piping of commands, and batch scripting.

It’s a simple and lightweight tool well-suited for basic tasks and for users who prefer a minimalist interface.

However, it lacks some of the advanced features and flexibility found in more modern terminal emulators like PowerShell and GitBash.

Windows: Git Bash

Git Bash is a terminal emulator that is bundled with Git for Windows. It provides a Unix-like command-line environment for Windows, including support for common Unix utilities and shell scripting. Git Bash includes a range of features and commands that are useful for developers working with Git repositories, including:

  • git: A command-line interface for interacting with Git repositories, including tasks like cloning, committing, pushing, and merging changes.
  • ssh: A command-line interface for managing secure shell connections to remote servers and devices.
  • curl: A command-line tool for transferring data over various protocols, including HTTP and FTP.

In addition to these Git-specific features, Git Bash supports a range of general-purpose commands and utilities, including:

  • ls: List the contents of a directory.
  • cd: Change the current directory.
  • mkdir: Create a new directory.
  • rm: Delete files.
  • cp: Copy files.
  • mv: Move files.

Windows: PowerShell

PowerShell is a command-line shell and scripting language designed for system administration and automation in Windows. PowerShell provides a more powerful and flexible command-line environment than the Command Prompt, with support for features like object-oriented pipelines, remote management, and scripting with .NET objects.

PowerShell includes a wide range of built-in commands, called cmdlets, that provide access to various Windows management features, such as:

  • Get-Process: Display information about running processes, including process ID (PID), CPU and memory usage, and parent process ID.
  • Get-Service: Display information about system services, including status, startup type, and dependencies.
  • Get-ChildItem: List the contents of a directory and display file information, including timestamps and permissions.
  • Set-Content: Write text to a file.
  • Invoke-WebRequest: Retrieve content from a web page or API.

In addition to these built-in cmdlets, PowerShell supports a range of scripting and automation features, including:

  • Variables and data types: PowerShell supports a range of data types, including strings, numbers, and arrays, as well as variables for storing and manipulating data.
  • Control flow statements: PowerShell supports a range of control flow statements, including loops, conditionals, and switch statements.
  • Functions: PowerShell supports the creation of reusable functions, allowing for modular and organized scripts.
  • Remote management: PowerShell can be used to manage remote Windows machines, allowing for automation and management of distributed systems.

PowerShell is a powerful and flexible tool for system administration and automation in Windows, providing access to a wide range of management features and automation capabilities.

See Also

There is more information about PowerShell in the Language Chapter and the Terminals Chapter.

Windows: WSL2

Windows Subsystem for Linux (WSL) is a feature of Windows 10 that allows users to run a Linux environment directly on Windows, without the need for a virtual machine or container. WSL provides access to a full-fledged Linux system, including a terminal emulator and support for running Linux applications and scripts.

WSL supports two different versions:

  • WSL 1: Uses a translation layer to provide compatibility between Linux system calls and Windows kernel system calls. WSL 1 provides access to a full Linux environment, but can be slower than running Linux natively.

  • WSL 2: Uses a lightweight virtual machine to provide a complete Linux kernel running directly on Windows. WSL 2 provides improved performance and compatibility with Linux applications, but requires more system resources.

WSL includes a range of Linux commands and utilities, allowing users to perform tasks like navigating the file system, managing packages, and running scripts. Users can also install and use Linux applications and development tools directly within WSL, including:

  • Python, Ruby, and other programming languages.
  • Git, Subversion, and other version control systems.
  • Apache, NGINX, and other web servers.
  • Docker, Kubernetes, and other containerization tools.

Subsections of Techniques

Data Cleaning

This page provides an overview of different techniques for cleaning data, a key step in the data analysis process. Data cleaning involves identifying and correcting errors, inconsistencies, and inaccuracies in data to ensure that it is accurate, complete, and usable for analysis.

Why Data Cleaning is Important

Data cleaning is essential for ensuring the accuracy and reliability of data analysis results. Dirty data, or data that has not been properly cleaned, can lead to inaccurate conclusions, wasted time and resources, and missed opportunities for insights and discoveries. By cleaning data before analysis, data analysts can ensure that their findings are based on accurate and high-quality data.

Techniques for Data Cleaning

There are many different techniques for data cleaning, depending on the type and quality of the data.

Some common techniques include:

  • Handling missing or incomplete data
  • Removing duplicate or irrelevant data
  • Correcting data errors and inconsistencies
  • Handling outliers and anomalies
  • Standardizing data formats and units
  • Validating and verifying data quality

Data analysts may use a combination of these techniques to clean data and prepare it for analysis.

Tools for Data Cleaning

There are many tools and software applications available for data cleaning, ranging from basic spreadsheet programs to advanced data cleaning and analysis platforms. Some popular tools for data cleaning include:

Spreadsheets and Workbooks

Microsoft Excel and Google Sheets are widely-used spreadsheet programs with basic data cleaning and manipulation features.

Scripting Languages

Python and R are popular programming languages for data analysis and cleaning, with a range of libraries and packages available for data cleaning tasks.

Great Expectations

Great Expectations is an open-source platform for data quality management. It provides a suite of tools and libraries for validating, profiling, and documenting data, and helps data analysts and engineers ensure their data is accurate, consistent, and reliable.

It allows users to define expectations about data and use those expectations to validate and clean the data. It supports a wide range of data sources, including CSV files, databases, and cloud storage, and provides a user-friendly interface for defining and managing data expectations.

Great Expectations integrates into existing data workflows, and is widely used in various industries, including finance, healthcare, and e-commerce. It is free and can be customized.

OpenRefine

OpenRefine is a free and open-source data cleaning and transformation tool that allows users to explore, clean, and transform large datasets. It provides a user-friendly interface for performing a wide range of data cleaning tasks, such as removing duplicates, transforming data formats, and handling missing or inconsistent data. OpenRefine supports a variety of data sources, including CSV files, Excel spreadsheets, and JSON data, and provides powerful tools for filtering, faceting, and clustering data. With OpenRefine, data analysts can easily manipulate nd reshape their data, and prepare it for further analysis or visualization.

OpenRefine is widely used in the data science and analytics community, and is a popular choice for cleaning and preparing data for machine learning and other advanced analytics tasks. OpenRefine is free and can be customized with plugins and extensions.

Data Visualization

Data visualization is the process of representing data visually, using charts, graphs, and other graphical elements to help people understand and interpret data more effectively. Effective data visualization is critical for communicating complex data insights and findings, and can help businesses make more nformed decisions based on data.

Why Data Visualization is Important

Data visualization is important because it allows people to understand complex data more quickly and effectively than with tables or raw data alone. By representing data visually, data analysts and business users can identify patterns, trends, and outliers more easily, and gain insights that may not be apparent with raw data alone. Data visualization is also an important tool for communicating data findings and insights to non-technical stakeholders, such as executives, investors, or customers.

There are many Python libraries and tools available for creating data visualizations in a variety of programming languages.

Popular data visualization libraries.

Matplotlib

Matplotlib is a popular data visualization library for Python that provides a wide range of 2D and 3D plotting capabilities. It is a flexible and versatile library that can be used for creating a variety of charts and graphs, including line charts, bar charts, scatter plots, and histograms.

Seaborn

Seaborn is another popular data visualization library for Python that is built on top of Matplotlib. It provides a high-level interface for creating statistical graphics, such as heatmaps, violin plots, and box plots, and makes it easy to create complex visualizations with just a few lines of code.

Plotly

Plotly is a web-based data visualization platform that allows users to create interactive charts and graphs in a variety of programming languages, including Python, R, and JavaScript. Plotly provides a wide range of chart types and customization options, and allows users to create and share interactive dashboards and reports.

Visualizations in dashboards and Jupyter notebooks are often web-based. Native web tools can be helpful for analysts to understand and use.

D3.js

D3.js is a JavaScript library for creating dynamic, interactive data visualizations on the web. D3.js provides a low-level interface for creating custom visualizations, allowing users to control every aspect of their visualizations. This flexibility comes with a steeper learning curve, but also allows for greater control and customization options.

Tableau

Tableau is a powerful data visualization tool that provides a drag-and-drop interface for creating a wide range of visualizations, including maps, charts, and dashboards. Tableau is known for its ease of use and accessibility, and is a popular choice for data analysts and business users who need to create visualizations quickly and efficiently.

Tableau offers a range of pricing plans, including a free trial, and also provides a robust community of users and resources.

Microsoft Power BI

Power BI is a business analytics service by Microsoft that provides interactive visualizations and business intelligence capabilities with an interface simple enough for end users to create their own reports and dashboards. Power BI allows users to connect to a wide range of data sources, including Excel spreadsheets, cloud-based and on-premises data sources, and more.

Power BI offers a range of pricing plans, including a free trial, and provides integration with other Microsoft tools and services.

Git

This page provides an overview of different techniques and skill levels related to Git, including basic, intermediate, and advanced techniques.

Basic

These are the basic skills, helpful even for beginning courses and activities.

Intermediate

These would be considered intermediate skills, applied in higher-level courses and activities.

Advanced

These are advanced skills, useful for more experienced users and advanced projects.

Subsections of Git

Basics

The following Git skills and techniques may be considered basic level.

Basic Git

  • Creating and cloning repositories: Know how to create new repositories on Git and how to clone existing repositories to your local machine so that you can work on them.

  • Adding and committing changes: Know how to add changes to your local repository and commit them so that they are saved to your repository’s history.

  • Pushing and pulling changes: Know how to push your local changes to your Git repository so that others can see them, and how to pull changes from the remote repository to your local machine to get the latest changes.

Intermediate

The following GitHub skills and techniques may be considered intermediate level.

Intermediate Git

  • Resolving merge conflicts: Learn how to handle conflicts that arise when merging branches or changes from multiple contributors.

  • Creating and managing branches: Know how to create and switch between branches, and how to merge changes between branches.

  • Using Git tags: Learn how to use Git tags to mark important points in your repository’s history, such as release versions or major milestones.

  • Reverting and resetting changes: Know how to revert changes to a previous commit, or reset your repository to a specific point in its history.

  • Understanding Git workflows: Gain a deeper understanding of Git workflows, such as Gitflow or GitHub flow, to better manage changes and collaboration in your projects.

  • Working with remote repositories: Know how to add and remove remote repositories, and how to push and pull changes between them and your local repository.

Advanced

The following Git skills and techniques may be considered advanced level.

Advanced Git

  • Rebasing: Know how to rebase a branch to update it with changes from another branch, while maintaining a clean history.

  • Cherry-picking: Know how to apply specific changes from one branch to another, without merging the entire branch.

  • Squashing commits: Know how to combine multiple commits into a single commit, to create a more coherent commit history.

  • Stashing: Know how to temporarily save changes that are not yet ready to be committed, so that you can work on other tasks without losing your progress.

  • Git hooks: Know how to create custom scripts that are automatically run by Git at specific times, such as before a commit or push.

  • Git submodules: Know how to use Git submodules to include one repository as a subdirectory of another repository, to better manage complex projects.

  • Git bisect: Know how to use Git bisect to find the commit that introduced a bug, by systematically testing different commits until the bug is found.

GitHub

This page provides an overview of different techniques and skill levels related to GitHub, including basic, intermediate, and advanced techniques.

Basic

These are the basic skills, helpful even for beginning courses and activities.

Intermediate

These would be considered intermediate skills, applied in higher-level courses and activities.

Advanced

These are advanced skills, useful for more experienced users and advanced projects.

Subsections of GitHub

Basics

The following GitHub skills and techniques may be considered basic level.

Basic GitHub

[ ] Creating and cloning repositories: Know how to create new repositories on GitHub and how to clone existing repositories to your local machine so that you can work on them.

[ ] Adding and committing changes: Know how to add changes to your local repository and commit them so that they are saved to your repository’s history.

[ ] Pushing and pulling changes: Know how to push your local changes to your GitHub repository so that others can see them, and how to pull changes from the remote repository to your local machine to get the latest changes.

Intermediate

The following GitHub skills and techniques may be considered intermediate level.

Intermediate GitHub

[ ] Working with branches: Know how to create and switch between branches, and how to merge changes between branches.

[ ] Using issues and pull requests: Know how to create and manage issues and pull requests, which are useful for tracking tasks, discussing changes, and requesting code reviews.

[ ] Collaboration: Know how to work collaboratively with others on a project using Git, including resolving merge conflicts and managing team workflows.

[ ] Rebasing: Know how to use the git rebase command to reapply changes from one branch onto another and resolve conflicts.

Advanced

The following GitHub skills and techniques may be considered advanced level.

Advanced GitHub

[ ] Git hooks: Know how to write and use Git hooks to automate tasks and enforce standards.

[ ] Git workflows: Know how to use Git workflows like GitFlow or GitHub Flow to manage complex projects and team collaboration.

[ ] Advanced Git commands: Be familiar with advanced Git commands like git cherry-pick, git bisect, and git stash.

[ ] Git submodules: Know how to use Git submodules to include and manage external dependencies in your projects.

[ ] Git LFS: Know how to use Git LFS (Large File Storage) to manage large binary files in your repositories.

[ ] CI/CD: Know how to integrate Git with Continuous Integration/Continuous Deployment (CI/CD) tools to automate testing, building, and deployment of your projects.

Machine Learning

Machine learning is a branch of artificial intelligence that involves the use of algorithms and statistical models to enable computer systems to improve their performance on a specific task over time. Machine learning is used in a wide range of applications, from natural language processing and computer vision to recommendation systems and fraud detection.

Why Machine Learning is Important

Machine learning is important because it allows computer systems to learn from data and improve their performance on a specific task without being explicitly programmed to do so. This enables systems to adapt and improve over time, and to make more accurate predictions or decisions based on data. Machine learning is also a powerful tool for automating complex tasks and processes, such as image recognition or natural language processing, and can help businesses make more informed decisions based on data.

There are many libraries and tools available for machine learning in a variety of programming languages. Some of the most popular machine learning libraries include:

Scikit-Learn

Scikit-Learn is a popular machine learning library for Python that provides a range of tools and algorithms for data mining, analysis, and modeling. It includes tools for classification, regression, clustering, and dimensionality reduction, and supports a wide range of data types and formats.

TensorFlow

TensorFlow is an open-source machine learning library developed by Google that provides a range of tools and algorithms for building and training neural networks. It supports a wide range of platforms and devices, and includes tools for distributed computing, model optimization, and deployment.

Keras

Keras is a high-level machine learning library for Python that provides a user-friendly interface for building and training neural networks. It includes a wide range of pre-built models and layers, and supports both CPU and GPU acceleration for faster training and inference.

Other Machine Learning Tools

In addition to these popular machine learning libraries, there are many other tools and platforms available for machine learning, including PyTorch, Caffe, and Microsoft Azure Machine Learning. The choice of tool or library will depend on the specific needs and requirements of the machine learning project, as well as the programming language and skill set of the data analyst or team.

Microservices

Microservices are a software architecture pattern that breaks down a large monolithic application into smaller, independently deployable services that communicate with each other using APIs.

As a data analyst, understanding the concept of microservices can be useful when working with data-driven applications. Microservices make it easier to build, deploy, and maintain data-driven applications by isolating parts of the application into smaller, manageable services.

In a microservices architecture, each service has its own codebase, data storage, and dependencies. This makes it easier to update and deploy individual services without affecting the rest of the application. It allows flexibility in choosing different technologies for different services, depending on their requirements.

Microservices can be particularly useful for real-time processing and analysis of large volumes of data. By breaking down an application into smaller services, developers can optimize each service for its specific task, allowing more efficient processing and analysis.

Working with microservices requires additional skills and knowledge, including understanding APIs, containerization, and service discovery.

A solid foundation in programming and software development is required to work effectively with microservices-based applications.

Sentiment Analysis - Flask

Create a new Flask application.

python3 -m venv env
source env/bin/activate
pip install flask
pip install textblob

Create a new route in app.py.

from flask import Flask, request
from textblob import TextBlob

app = Flask(__name__)

@app.route('/sentiment', methods=['POST'])
def sentiment():
    text = request.json['text']
    blob = TextBlob(text)
    polarity = blob.sentiment.polarity
    subjectivity = blob.sentiment.subjectivity
    return {'polarity': polarity, 'subjectivity': subjectivity}

if __name__ == '__main__':
    app.run(debug=True)

Run the app with the following command.

python app.py

Test the API with curl (or Postman).

curl --header "Content-Type: application/json" --request POST --data '{"text":"This is a positive sentence."}' http://localhost:5000/sentiment

Sentiment Analysis - AWS Lambda

Alternatively, we could create a simple function and host it on Amazon Web Services (AWS) Lambda. AWS offers a free tier that allows up to one million requests per month.

Python

This page provides an overview of different techniques and skill levels related to Python, including basic, intermediate, and advanced techniques.

Basic

These are the basic skills, helpful even for beginning courses and activities.

Intermediate

These would be considered intermediate skills, applied in higher-level courses and activities.

Advanced

These are advanced skills, useful for more experienced users and advanced projects.

Subsections of Python

Basics

The following Python skills and techniques may be considered basic level in the context of data analysis.

Data Structures

  • Lists: Know how to create and manipulate lists, and use them to store and organize data.

  • Dictionaries: Know how to create and manipulate dictionaries, and use them to store and organize data in key-value pairs.

Control Structures

  • Conditional Statements: Know how to use if-else statements to conditionally execute code.

  • Loops: Know how to use for and while loops to iterate over data.

Functions

  • Defining Functions: Know how to define functions to organize and reuse code.

  • Lambda Functions: Know how to define and use lambda functions for short and simple functions.

File I/O

  • Reading and Writing Files: Know how to read and write data from files using Python.

External Libraries

  • NumPy: Know how to use NumPy to perform numerical operations and calculations.

  • pandas: Know how to use pandas to work with structured data and perform data analysis tasks.

  • Matplotlib: Know how to use Matplotlib to create basic plots and visualizations.

These skills and the associated techniques provide a strong foundation for data analysis in Python, and can be built upon with more advanced topics and libraries as needed.

Intermediate

This page provides an overview of intermediate skills for working with Python in the context of data analysis.

External Libraries

  • NumPy: Know how to work with arrays, manipulate data, and perform mathematical operations.

  • pandas: Know how to work with data frames and manipulate data for exploratory data analysis.

  • Matplotlib: Know how to create customized visualizations for data analysis.

Data Cleaning

  • Merging and joining data frames: Know how to combine data from multiple sources.

  • Handling missing data: Know how to identify missing data and impute it using various methods.

  • Data normalization and scaling: Know how to standardize data and scale it to compare across different variables.

Data Analysis

  • Descriptive statistics: Know how to calculate basic summary statistics like mean, median, and standard deviation.

  • Inferential statistics: Know how to perform hypothesis testing and confidence intervals.

  • Regression analysis: Know how to perform linear regression and interpret regression coefficients.

Workflow and Collaboration

  • Version control with Git: Know how to use Git for version control and collaborate with others on code.

  • Unit testing and debugging: Know how to write and run unit tests and debug code.

  • Code organization and project structure: Know how to structure a Python project for scalability and reproducibility.

Type Hints

  • Type hints: Know how to use type hints in Python to specify function argument types, return types, and class attributes.

Employing important new features such as type hints shows a deeper understanding of Python and a commitment to writing clean, maintainable, and efficient code.

By using type hints, developers improve the documentation of their code, catch errors more easily, and help other developers understand how to use their code.

With the increasing adoption of type hints in the Python community, it is becoming an essential intermediate to advanced skill for those working on larger projects or collaborating with other developers.

def add_numbers(x: int, y: int) -> int:
    return x + y

The type hints are specified using the : syntax, where x: int means that x is of type int. The -> int syntax after the function arguments specifies the return type of the function as int.

Type hints are not enforced by the Python interpreter, but are used by static analysis tools and linters to catch type-related errors early in the development process.

Advanced

Advanced Python Skills

These skills are considered advanced and will be useful for more advanced data analysis tasks.

Object-Oriented Programming

  • Understand the basics of object-oriented programming (OOP) and how to apply it in Python.
  • Create and use classes to encapsulate related data and functionality.
  • Use inheritance and polymorphism to extend existing classes and create new ones.

Functional Programming

  • Understand the principles of functional programming and how to use functional programming concepts in Python.
  • Use lambda functions and higher-order functions to create more expressive and powerful code.
  • Apply functional programming techniques to data processing and analysis tasks.

Decorators

  • Understand what decorators are and how to use them to modify the behavior of functions and methods.
  • Use built-in Python decorators like @property, @staticmethod, and @classmethod.
  • Create custom decorators to add functionality to your code.

Generators and Iterators

  • Understand the difference between generators and iterators and how to use them in Python.
  • Use generators to lazily generate and process data without creating large in-memory data structures.
  • Implement custom iterators to provide custom ways of iterating over data.

Concurrency and Parallelism

  • Understand the difference between concurrency and parallelism and how to achieve both in Python.
  • Use threads and processes to perform multiple tasks simultaneously.
  • Use asynchronous programming techniques to handle I/O-bound tasks efficiently.

Performance Optimization

  • Understand how to optimize Python code for performance.
  • Use profiling tools to identify performance bottlenecks in your code.
  • Apply performance optimization techniques like caching, memoization, and vectorization to speed up your code.

Independent Study

Books remain a surprisingly cost-effective investment.

When you’re ready to truly master this powersful language, consider investing in a top-rated book like “Fluent Python” by Luciano Ramalho. The second edition is current, published in March 2022 covering up to Python 3.10 for the newest features.

Or High Performance Python: Practical Performant Programming for Humans by Micha Gorelick and Ian Ozsvald covering high-performance options for processing big data, multiprocessing, and more.

GitHub Resouces

Participate in Open Source

Chapter 5

Hosting

This chapter provides an introduction to popular hosting providers.

Binder

Binder is a cloud-based platform that allows users to share and execute Jupyter Notebooks as interactive environments. It utilizes Docker containers to create custom environments for notebooks and provides public and private hosting options. Binder runs on a public infrastructure and is provided as a free service by the Binder team.

BinderHub

BinderLab is a development environment for JupyterLab that runs on top of Binder. It allows users to create and run custom JupyterLab environments and provides a collaborative workspace for multiple users. Unlike Binder, BinderLab is designed for development purposes rather than just sharing and executing notebooks.

BitBucket

BitBucket is a popular hosting provider for Git repositories. It provides users with a cloud-based platform to store and manage their Git projects, and includes features such as code collaboration and continuous integration.

DockerHub

DockerHub is a cloud-based repository for Docker images and containers. It provides users with a centralized location to store, manage and share their Docker images, and allows for easy distribution and deployment of containerized applications. DockerHub also provides features such as automated builds, which allows users to automatically build Docker images from source code repositories.

GitHub

GitHub is a widely used platform for hosting Git repositories, collaborating on code, and managing projects. In addition to hosting Git repositories, GitHub provides users with a variety of tools for project management, such as issue tracking, code reviews, and continuous integration.

GitHub Pages

GitHub Pages is a free service for hosting static websites directly from a GitHub repository. Users can create websites using static site generators or by manually writing HTML, CSS, and JavaScript. GitHub Pages also provides custom domain support, SSL encryption, and GitHub Actions for automating website builds and deployments.

JupyterHub

JupyterHub is a multi-user server for Jupyter Notebooks that allows multiple users to access and use the same Jupyter Notebook environment simultaneously.

Choosing Hosting for Jupyter Notebooks

The choice of platform to share a Jupyter Notebook depends on the intended use case, audience, and level of collaboration required.

Here are some guidelines for when to use each platform.

GitHub

Use GitHub to share Jupyter Notebooks that are part of a larger project, and to collaborate on code with others. GitHub provides version control, code reviews, and issue tracking, making it an ideal platform for collaborating on code. However, GitHub is not optimized for executing Jupyter Notebooks in the cloud, and users may need to download and run the notebooks locally to interact with them.

Binder

Use Binder to share Jupyter Notebooks that require specific software dependencies, or to share a single notebook with a small group of users. Binder provides a cloud-based environment that allows users to interact with Jupyter Notebooks directly in their web browser, without having to install any software. However, Binder is not intended for large-scale collaboration, and may not be suitable for running notebooks with high computational requirements.

BinderHub

Use BinderHub to create a custom, scalable cloud-based environment for running Jupyter Notebooks. BinderHub is built on top of Kubernetes and Docker, providing automatic scaling and load balancing for your Jupyter Notebook environments. BinderHub is suitable for sharing Jupyter Notebooks with a large audience, and provides customizable environments and version control integration. However, setting up and managing a BinderHub instance requires technical expertise and resources.

JupyterHub

Use JupyterHub to provide a shared computing environment for a group of users. JupyterHub allows multiple users to access and use the same Jupyter Notebook environment simultaneously, making it suitable for classroom settings or collaborative research projects. JupyterHub provides user authentication and access control, and can be customized to provide specific software dependencies and resources for different groups of users. However, setting up and managing a JupyterHub instance requires technical expertise and resources.

Subsections of Hosting

Binder

Binder is a cloud-based platform that allows you to share your Jupyter Notebooks as interactive, executable environments. It’s built on top of Docker, and provides a way to easily share and launch Jupyter Notebooks in the cloud.

Here are some key features of Binder.

Customizable environments

Binder allows you to create custom environments for your Jupyter Notebooks using Docker containers. This makes it easy to package and distribute your code and dependencies, and ensures that your environment is consistent across different machines.

Public and private hosting

Binder allows you to host your Jupyter Notebooks on either a public or private server. Public hosting allows anyone to access and run your notebooks, while private hosting restricts access to authorized users.

Version control

Binder integrates with GitHub and other version control platforms, allowing you to easily share and update your notebooks as you develop and improve them.

Scalability

Binder provides automatic scaling and load balancing for your Jupyter Notebook environments. This ensures that your notebooks are always available and responsive, no matter how many users are accessing them.

Accessibility

Because Binder runs in the cloud, it provides a way to make your Jupyter Notebooks accessible from anywhere, on any device with an internet connection. This makes it easy to collaborate on projects and share your work with others.

Get Started

To get started with Binder, you can either use the public hosting service provided by the Binder team, or you can set up your own instance of Binder on your own servers. The Binder documentation provides detailed instructions for both options.

BinderHub

BinderHub is a cloud-based platform that allows you to share your Jupyter Notebooks as interactive, executable environments. It’s built on top of Kubernetes and Docker, and provides a way to easily share and launch Jupyter Notebooks in the cloud.

Here are some key features of BinderHub.

Customizable environments

BinderHub allows you to create custom environments for your Jupyter Notebooks using Docker containers. This makes it easy to package and distribute your code and dependencies, and ensures that your environment is consistent across different machines.

Public and private hosting

BinderHub allows you to host your Jupyter Notebooks on either a public or private server. Public hosting allows anyone to access and run your notebooks, while private hosting restricts access to authorized users.

Version control

BinderHub integrates with GitHub and other version control platforms, allowing you to easily share and update your notebooks as you develop and improve them.

Scalability

BinderHub is built on top of Kubernetes, which provides automatic scaling and load balancing for your Jupyter Notebook environments. This ensures that your notebooks are always available and responsive, no matter how many users are accessing them.

Accessibility

Because BinderHub runs in the cloud, it provides a way to make your Jupyter Notebooks accessible from anywhere, on any device with an internet connection. This makes it easy to collaborate on projects and share your work with others.

Get Started

To get started with BinderHub, you can either use the public hosting service provided by the Binder team, or you can set up your own instance of BinderHub on your own servers. The BinderHub documentation provides detailed instructions for both options.

Bitbucket

Bitbucket is a web-based version control repository hosting service owned by Atlassian. It is designed for team collaboration and primarily used for source code and development projects that use either Mercurial or Git revision control systems. Bitbucket provides a platform for teams to plan projects, collaborate on code, test, and deploy applications efficiently.

Atlassian is the company behind Jira, a popular project management and issue tracking software. Jira is designed to help teams plan, track, and manage their work, including software development, agile project management, bug tracking, and task management. It offers a flexible and customizable platform that allows teams to organize their work, prioritize tasks, and monitor progress efficiently.

Dockerhub

Docker Hub is a cloud-based repository service provided by Docker, similar to GitHub, but specifically designed for sharing and managing Docker images. Just as GitHub is a platform for storing and collaborating on code repositories, Docker Hub allows you to store, share, and collaborate on Docker images. Docker Hub simplifies the process of distributing and deploying containerized applications and streamlines collaboration with other developers.

Here’s a brief introduction to some key features.

Public and private repositories

Docker Hub allows you to create both public and private repositories for your Docker images. Public repositories are accessible to everyone, while private repositories can only be accessed by authorized users.

Image versioning

Docker Hub supports versioning of Docker images using tags, which allows you to maintain multiple versions of an image in the same repository. This is similar to using branches in a Git repository to manage different versions of your code.

Automated builds

You can link your Docker Hub repository to a GitHub or Bitbucket repository, enabling automated builds of Docker images whenever code is pushed to the linked repository. This feature ensures that your Docker images are always up-to-date with your source code.

Webhooks

Docker Hub supports webhooks, which can be used to trigger events or notifications when a new image is pushed to a repository. This can help automate deployment workflows and keep your applications up-to-date.

Official images

Docker Hub hosts a wide range of official images for popular programming languages, frameworks, and tools. These images are maintained by their respective organizations or developers and can be used as a base for building your own custom images.

Community-contributed images

In addition to official images, Docker Hub also hosts thousands of community-contributed images. These images are created and maintained by Docker users and can be a valuable resource when you’re looking for a pre-built solution or starting point for your own images.

Get Started

To get started with Docker Hub, sign up for a free account. Once you have an account, you can create your own repositories, browse and search for images, and collaborate with other users.

If you’re familiar with GitHub, you’ll find many similarities in the way Docker Hub organizes and manages repositories, making it easy to transition between the two platforms.

Github

GitHub is a web-based platform that allows developers to store and manage their code and collaborate with others. It is built on top of Git, which is a distributed version control system that allows developers to track changes to their code over time and collaborate with others on the same codebase.

With GitHub, developers can create their own repositories, which are essentially folders that contain their code, documentation, and other files related to a specific project. They can also fork other people’s repositories to create their own copies, which they can then modify and contribute back to the original repository. This allows for easy collaboration and code sharing among developers.

GitHub also provides a range of tools for developers to manage their code, such as the ability to track and resolve issues, review and merge pull requests, and create and manage branches. It also provides a web-based interface for viewing and editing code, as well as a built-in code editor. Additionally, it has a wide range of integrations and APIs that allow developers to automate various development tasks and integrate with other tools and services.

Integrated Editing - New!

You can use the github.dev web-based editor to edit files and commit your changes from your browser.

To try it, when viewing a repository in your browser:

  1. Change the URL from “github.com” to “github.dev”.
  2. When viewing a file, use the dropdown menu next to and select github.dev.

Read more at https://docs.github.com/en/codespaces/the-githubdev-web-based-editor.

GitHub Renders Executed Jupyter Notebooks

GitHub has the ability to render Jupyter notebooks directly within its platform, allowing users to view and interact with notebooks without having to download or install any software. However, this feature only works if the Jupyter notebook has been executed first and saved as an .ipynb file.

When a Jupyter notebook is executed, the output cells are stored within the .ipynb file along with the code cells. This means that when the notebook is opened on GitHub, the output cells can be rendered and displayed alongside the code.

If a Jupyter notebook has not been executed before being uploaded to GitHub, then the output cells will not be present in the file, and therefore will not be displayed when the notebook is opened on GitHub. In this case, users will need to download the notebook and execute it locally in order to view the output cells.

GitHub Alternatives

Read More

Read more about GitHub at the links below.

Github Pages

GitHub Pages is a free and easy way to host static websites directly from your GitHub repositories. It is ideal for hosting personal blogs, project documentation, or simple websites. With GitHub Pages, you can quickly create a site using Markdown, HTML, CSS, and JavaScript, and have it automatically generated and hosted by GitHub.

Some key features of GitHub Pages include:

Easy setup

Setting up a GitHub Pages site is simple and requires only a few steps. You can create a new repository or use an existing one, add your content, and enable GitHub Pages in the repository settings.

Custom domains

By default, GitHub Pages sites are hosted under a github.io subdomain, but you can also configure a custom domain for your site.

Jekyll integration

GitHub Pages integrates seamlessly with Jekyll, a popular static site generator. This allows you to write your content in Markdown, use templates and themes, and have Jekyll automatically generate your site’s HTML, CSS, and JavaScript.

Hugo support

GitHub Pages also supports hosting websites generated with Hugo. Hugo is an open-source static site generator written in Go, known for its speed and flexibility. It enables you to create websites using Markdown, HTML, CSS, and JavaScript, and comes with a rich set of features, including templates, themes, shortcodes, and built-in support for taxonomies such as categories and tags. Hugo is designed to handle large sites efficiently, making it a popular choice for blogs, portfolios, and documentation sites.

Version control

Since GitHub Pages sites are hosted directly from your GitHub repositories, you get all the benefits of version control. This makes it easy to track changes, collaborate with others, and revert to previous versions of your site if needed.

SSL support

GitHub Pages provides free SSL support for both github.io subdomains and custom domains, ensuring secure connections for your site visitors.

JupyterHub

JupyterHub is a multi-user server for Jupyter Notebooks that allows multiple users to access and use the same Jupyter Notebook environment simultaneously. This is useful for teaching or collaborative work where multiple people need to work on the same code or data simultaneously.

How JupyterHub Works

JupyterHub allows users to access their own private Jupyter Notebook servers from within a shared environment. This means that each user can have their own Jupyter Notebook environment, complete with their own set of packages and dependencies, while still being able to share data and collaborate with others in the same project.

JupyterHub runs on a server and can be accessed through a web browser. Once logged in, users can create new notebooks, access existing notebooks, and share their work with others in the same project.

Key Features of JupyterHub

Multi-user support

JupyterHub is designed to support multiple users accessing the same Jupyter Notebook environment simultaneously. This makes it ideal for teaching environments or collaborative projects where multiple people need to work on the same code or data at the same time.

Customizable environments

Each user in JupyterHub can have their own customized Jupyter Notebook environment, complete with their own set of packages and dependencies. This means that users can work with the tools and libraries they are most comfortable with, while still collaborating with others in the same project.

Centralized control

JupyterHub provides centralized control over user access and permissions, making it easy to manage access to Jupyter Notebook servers and data. This makes it easy to control who has access to what data and to ensure that users are only able to access the data and tools they need for their work.

Scalability

JupyterHub is designed to be scalable and can support large numbers of users simultaneously. This makes it ideal for use in teaching environments or for collaborative projects with a large number of participants.

Getting Started with JupyterHub

To get started with JupyterHub, you will need to set up a server that meets the system requirements for running JupyterHub. Once you have set up your server, you can install JupyterHub using a package manager like pip or conda.

Once JupyterHub is installed, you can configure it to meet your specific needs, including setting up user accounts, creating custom environments, and managing access and permissions.

JupyterHub is a powerful tool for collaborative work and teaching environments, and provides a flexible and customizable environment for working with Jupyter Notebooks.

Chapter 6

Data

This chapter provides an introduction to key aspects of data management.

Data At Rest

This section covers data at rest, storage technologies, and architectures, including data lakes and lake house architectures.

Data In Motion

This section covers data in motion, pipeline technologies,and architectures, including stream processing and ETL.

Data Formats

This section covers common data formats used in modern analytics, including CSV, JSON, and Parquet.

Data Stores

This section covers popular data storage systems, including SQL and NoSQL databases, graph databases, and key-value stores. Examples include PostgreSQL, MongoDB, Neo4j, and Redis.

Message Queues

This section covers popular message queue systems used to route information from decoupled applications and services. It introduces popular options including RabbitMQ and Apache Kafka.

Data Processing Engines

This section covers popular data processing engines, including Apache Spark, Apache Flink, and Apache Beam.

Subsections of Data

Data at Rest

Data at rest refers to data that is stored and not actively being processed or transmitted. This includes data stored in databases, data warehouses, data lakes, and other storage systems.

Data Lakes

A data lake is a centralized repository that allows organizations to store and manage large amounts of structured and unstructured data at scale. Data lakes provide a cost-effective and flexible way to store data from various sources and formats, and can support a wide range of data processing and analytics tools.

Delta Lake

Delta Lake is an open-source data storage and management system that is designed for large-scale data lakes. It was developed by Databricks, and it provides a number of advanced features, including ACID transactions, version control, and schema enforcement, that make it easier to manage large and complex data sets.

Delta Lake is built on top of Apache Spark, and it provides a unified platform for managing structured, semi-structured, and unstructured data in a single system. With Delta Lake, data analysts and engineers can build and manage data lakes and ensure their data is accurate, consistent, and reliable.

Lake House Architecture

Lake house architecture is an emerging approach to data storage and management that combines the benefits of data lakes and traditional data warehouses. It provides a scalable and flexible platform for storing and processing data, while also providing strong data governance and security controls.

SQL Databases

SQL databases are a popular type of relational database that use Structured Query Language (SQL) to manage and retrieve data. SQL databases are widely used for a range of applications, including transaction processing, data warehousing, and analytics.

Some popular SQL databases include:

  • MySQL
  • PostgreSQL
  • Microsoft SQL Server
  • Oracle Database

NoSQL Databases

NoSQL databases are non-relational databases that can handle large volumes of unstructured and semi-structured data. They are often used for applications that require high scalability and performance, such as real-time data processing and analytics.

Some popular NoSQL databases include:

  • MongoDB
  • Cassandra
  • Couchbase
  • Amazon DynamoDB

Graph Databases

Graph databases are specialized databases designed to store and manage graph data structures. They are often used for applications that involve complex relationships and dependencies, such as social networks and recommendation systems.

Some popular graph databases include:

  • ArangoDB
  • Neo4j
  • OrientDB

Data Warehouses

Data warehouses are specialized databases designed for storing and analyzing large volumes of structured data. They are often used for business intelligence and analytics applications, and typically support complex queries and reporting.

Some popular data warehouses include:

  • Snowflake
  • Amazon Redshift
  • Google BigQuery
  • Microsoft Azure SQL Data Warehouse

Snowflake (Commerical)

Snowflake is a cloud-based data warehousing platform that allows organizations to store, manage, and analyze large volumes of structured and semi-structured data in real-time. Snowflake’s architecture is designed to separate storage and compute, allowing users to scale each independently and pay only for what they use. This makes it a popular choice for data-intensive workloads that require fast and flexible access to data. Additionally, Snowflake provides a range of tools and features for data management, such as data sharing, secure data exchange, and automated data governance, which can help organizations simplify their data operations and reduce costs. Snowflake is a powerful and flexible data warehousing platform gaining popularity among organizations of all sizes and industries.

Open-Source Options

There are several open source alternatives to Snowflake that offer similar functionality, including:

Apache Druid

A high-performance, column-oriented, distributed data store designed for real-time analytics. Druid is optimized for OLAP queries on large-scale datasets and offers sub-second query response times.

Apache Pinot

Another high-performance, distributed data store designed for OLAP queries. Pinot is optimized for handling large-scale datasets and supports real-time ingestion and querying of data.

ClickHouse

A high-performance column-oriented database management system designed for OLAP queries. ClickHouse is optimized for handling large-scale datasets and provides sub-second query response times. It also supports real-time data ingestion and offers a variety of storage formats and compression algorithms.

Presto

A distributed SQL query engine designed for querying large-scale datasets stored in a variety of data sources, including Hadoop, HDFS, Amazon S3, and more. Presto is optimized for ad hoc querying and provides fast query response times.

Apache Arrow

A columnar in-memory data structure and computation framework designed for high-performance data processing. Arrow is optimized for efficient data sharing and serialization across different programming languages and systems.

See Also

Data in Motion

Data in motion refers to data that is actively being transmitted or processed in a system, often through pipelines that move data between different stages of processing.

Data Pipelines

Data pipelines are a series of connected processing elements that move data between different stages of processing. A typical data pipeline involves several stages of processing, such as ingestion, transformation, enrichment, and analysis, with data moving from one stage to the next as it is processed.

Pipelines can be used to move data between different applications or systems, or to transform and enrich data as it moves through a system. They are often used in real-time or near-real-time applications, such as stream processing or real-time analytics.

Stream Processing

Stream processing is a type of data processing that involves processing data as it is generated or ingested, rather than processing it after it has been stored. Stream processing can be used to analyze or filter data in real-time, and is often used in applications such as fraud detection, sensor data processing, and real-time analytics.

Popular stream processing platforms include Apache Kafka, Apache Flink, and Apache Storm.

Batch Processing

Batch processing involves processing large volumes of data in a batch or offline mode, often in a scheduled or periodic manner. Batch processing can be used to perform complex data transformations, such as ETL (extract, transform, load) operations, and is often used in applications such as data warehousing and business intelligence.

Popular batch processing platforms include Apache Spark and Apache Hadoop.

Data Integration

Data integration involves combining data from multiple sources and making it available for analysis or processing. This can involve moving data between different systems or applications, or transforming and merging data to create a unified view of the data.

Popular data integration platforms include Apache NiFi, Talend, and Microsoft Azure Data Factory.

Data Governance

Data governance involves managing the availability, usability, integrity, and security of data used in an organization. Data governance can be applied to data in motion as well as data at rest, and involves establishing policies and procedures for managing data throughout its lifecycle.

Popular data governance platforms include Collibra and Informatica.

Summary

Data in motion is a critical component of modern data architectures, and involves moving and processing data in real-time or near-real-time. Data pipelines, stream processing, batch processing, data integration, and data governance are all important aspects of managing and analyzing data in motion.

See Also

Data Formats

Modern analytics involves working with a variety of data formats that can be structured or unstructured, batch or streaming, and of varying sizes. Understanding these different data formats is essential for data analysts to effectively work with and derive insights from data.

CSV

Comma-separated values (CSV) is a widely-used file format for storing and exchanging tabular data. It consists of rows of data, where each row represents a single record, and columns of data, where each column represents a specific attribute of the record. CSV files are easy to create and read, and can be easily imported and exported by most software applications.

JSON

JavaScript Object Notation (JSON) is a lightweight and flexible file format for storing and exchanging data. It consists of key-value pairs, where each key represents a specific attribute of the data, and each value represents the value of that attribute. JSON files are easy to read and write, and can be easily parsed by most programming languages.

YAML

YAML is a human-readable data serialization language that is commonly used for configuration files, data exchange, and data storage.

The name “YAML” stands for “YAML Ain’t Markup Language,” highlighting its focus on being a data-oriented language rather than a markup language like XML or HTML.

YAML syntax is designed to be simple and easy to understand, using indentation and a minimal set of punctuation marks.

It supports a wide range of data types, including strings, integers, lists, and dictionaries. YAML is often used in software development to define configuration settings for applications and services, and is also used in data analysis and machine learning workflows to define and exchange data. YAML’s simplicity, readability and support for hierarchical data structures and lists, make it a popular choice for configuration files.

Python pyproject.toml

In Python, the pyproject.toml file is used to specify project metadata, dependencies, and build configurations in a standard format that can be easily read and understood by humans.

The pyproject.toml file is often used with pip and Poetry package managers to manage Python projects and their dependencies.

The pyproject.toml file was introduced in PEP 518
as a way to specify build system requirements for Python projects. The file is used with build tools like flit, poetry, and setuptools to manage build configuration, dependencies, and metadata.

It is becoming increasingly popular in the Python ecosystem as it provides a unified way to manage and specify various aspects of a Python project’s build process. The pyproject.toml file typically includes information such as:

  • project name,
  • version,
  • author,
  • license,
  • dependencies, and
  • build tool configuration.

By using this file, developers ensure their projects are built in a consistent and reproducible manner across different systems and environments.

Julia Project.toml

In Julia, the Project.toml file serves a similar purpose, providing a standard format for specifying project metadata and dependencies. The Manifest.toml file is used to keep track of the exact versions of packages that have been installed in a Julia environment.

Parquet

Apache Parquet is a columnar storage format for Hadoop that is optimized for performance and efficiency. It stores data in a compressed and binary format, and is designed to work with big data processing frameworks like Apache Spark and Apache Hadoop. Parquet files can be easily read and written by most data processing tools, and provide fast access to large amounts of data.

Avro

Apache Avro is a binary data serialization system that is designed for high-performance and efficient data processing. It provides a compact binary format for data storage and exchange, and includes features for schema evolution and data compression. Avro files can be easily read and written by most programming languages, and are widely used in big data processing and messaging systems.

Other Data Formats

There are many other data formats used in modern analytics, including XML and Apache Arrow.

The choice of data format depends on the specific needs and requirements of the data analysis project, including factors such as data size, performance, and compatibility with existing data processing tools and systems.

Data Stores

Data stores are used to store, manage and retrieve data.

There are many types of data stores available, each with their own strengths and weaknesses. Some of the most popular data stores include the following.

Relational Databases

Relational databases store data in tables with columns and rows. They use Structured Query Language (SQL) to query and manipulate data.

Examples of relational databases include:

  • MySQL
  • PostgreSQL
  • Oracle Database
  • Microsoft SQL Server

Relational databases are popular for their ability to manage large volumes of structured data and their support for transaction processing.

NoSQL Databases

NoSQL databases are designed to handle large volumes of unstructured or semi-structured data.

They are not based on the traditional relational data model, and use a variety of data models such as key-value, document, column-family and graph.

Examples of NoSQL databases include:

  • MongoDB
  • Cassandra
  • Couchbase
  • Neo4j

NoSQL databases are popular for their ability to handle big data and their support for horizontal scaling.

Graph Databases

Graph databases are designed to store and manage data in the form of nodes and edges, which represent entities and relationships between entities.

They are used for applications that require deep relationships between data, such as social networks or recommendation engines.

Examples of graph databases include:

  • ArangoDB
  • Neo4j
  • Amazon Neptune
  • OrientDB

Graph databases are popular for their ability to manage complex relationships between data and their support for fast traversal of graph data.

Column-Family Databases

Column-family databases store data in column families, which are groups of related columns. They are designed to handle large amounts of structured and semi-structured data, such as log data or time-series data. Examples of column-family databases include:

  • Apache Cassandra
  • HBase
  • ScyllaDB
  • Amazon DynamoDB

Column-family databases are popular for their ability to handle large volumes of data and their support for fast read and write operations.

In-Memory Databases

In-memory databases store data in memory instead of on disk.

They are designed to handle large volumes of data with fast read and write operations.

Examples of in-memory databases include:

  • Redis
  • Memcached
  • VoltDB
  • Apache Ignite

In-memory databases are popular for their ability to handle high-throughput workloads and their support for real-time data processing.

File-based Databases

File-based databases store data in files on disk. They are designed to be simple and easy to use, with minimal setup and configuration required.

Examples of file-based databases include:

  • SQLite
  • Microsoft Access
  • Berkley DB
  • LevelDB

File-based databases are popular for their ease of use and low cost of entry.

Choosing

Choosing the right data store for your application or system depends on many factors, including the type of data you are storing, the volume of data, the performance requirements, and the need for scalability and flexibility.

Message Queues

Message queues are a key component of modern distributed systems and are used to manage the flow of data between applications, services, and processes.

They provide a way to decouple different parts of a system and to ensure reliable delivery of messages.

Some popular message queue systems include RabbitMQ and Apache Kafka.

RabbitMQ

RabbitMQ is a widely-used open-source message broker that implements the Advanced Message Queuing Protocol (AMQP). It is written in Erlang and provides a scalable and reliable platform for distributing messages between applications and services. RabbitMQ supports a wide range of messaging patterns, including point-to-point, publish-subscribe, and request-reply.

Apache Kafka

Apache Kafka is a distributed streaming platform that is used for building real-time data pipelines and streaming applications. It provides a scalable and fault-tolerant platform for processing and storing streams of data in real-time. Kafka is designed to handle high-throughput and low-latency data streams and provides a wide range of tools and APIs for building real-time data pipelines.

Redis

Redis is an open-source in-memory data structure store that is often used as a message broker or as a database for caching and real-time applications. Redis supports a wide range of data structures, including strings, hashes, lists, and sets, and provides a number of features that make it well-suited for real-time applications, such as pub/sub messaging and transactions.

Other Message Queue Systems

There are many other message queue systems available, including Apache ActiveMQ, ZeroMQ, and Amazon Simple Queue Service (SQS).

The choice of message queue system depends on the specific needs and requirements of the application or system, including factors such as performance, scalability, reliability, and ease of use.

Processing Engines

Big data processing engines are designed to process and analyze large amounts of data in distributed environments.

These engines provide a scalable and fault-tolerant platform for processing data and can be used for a wide range of use cases, including batch processing, stream processing, and machine learning.

Apache Spark

Apache Spark is a widely-used big data processing engine that provides a unified analytics engine for large-scale data processing.

It supports a wide range of data sources and provides APIs for batch processing, stream processing, and machine learning.

Spark provides a scalable and fault-tolerant platform for processing data and can be used in a wide range of industries, including finance, healthcare, and e-commerce.

Apache Flink is a distributed processing engine for batch processing, stream processing, and machine learning.

It provides a unified programming model for batch and stream processing, and supports a wide range of data sources and sinks.

Flink provides a scalable and fault-tolerant platform for processing data and can be used for a wide range of use cases, including fraud detection, predictive maintenance, and real-time analytics.

Apache Beam

Apache Beam is an open-source unified programming model for batch and stream processing.

It provides a set of SDKs for Java, Python, and Go that can be used to build batch and stream processing pipelines.

Beam provides a portable and extensible platform for processing data and can be used with a wide range of data sources and sinks, including Apache Kafka, Google Cloud Pub/Sub, and Amazon S3.

Other Big Data Processing Engines

Other big data processing engines are available, including Apache Hadoop, Apache Storm, and Apache Samza.

The choice of big data processing engine depends on the specific needs and requirements of the use case, including factors such as performance, scalability, reliability, and ease of use.

Chapter 7

Other

This chapter covers a few more things, not related to analytics directly.

About

Learn more about this site and the tools used.

Subsections of Other

About

We hope you enjoy this centralized information for data fundamentals and getting started with professional data analytics.

These powerful, user-friendly, industry-standard languages, tools, and techniques form core foundations of a productive environment.

We introduce use them for many purposes including

  • data analysis,
  • data science,
  • computer science,
  • application development,
  • digital forensics analysis,
  • streaming data and data in motion,
  • data lakes and data at rest,
  • continoous intelligence,
  • and more.

This Site

We use Hugo to generate this site. You can find the source code for Hugo at GitHub: gohugoio/hugo.

See the source code for the amazing Relearn theme at GitHub: McShelby/hugo-theme-relearn.

Host your site for free with GitHub Pages. For more information, see the GitHub Pages documentation.

This site was developed in collaboration with ChatGPT and other resources. For more information, please visit the OpenAI website.