Chapter 4

Techniques

This chapter provides covers various techniques and skills used in data analytics, investitagions, and development.

Data Cleaning

Introduction to Data Cleaning

Data Visualization

Introduction to Data Visualization

Git

Git Skills

GitHub

GitHub Skills

Machine Learning

Introduction to Machine Learning

Microservices

Microservices

Python

Python Skills

Data Cleaning

This page provides an overview of different techniques for cleaning data, a key step in the data analysis process. Data cleaning involves identifying and correcting errors, inconsistencies, and inaccuracies in data to ensure that it is accurate, complete, and usable for analysis.

Why Data Cleaning is Important

Data cleaning is essential for ensuring the accuracy and reliability of data analysis results. Dirty data, or data that has not been properly cleaned, can lead to inaccurate conclusions, wasted time and resources, and missed opportunities for insights and discoveries. By cleaning data before analysis, data analysts can ensure that their findings are based on accurate and high-quality data.

Techniques for Data Cleaning

There are many different techniques for data cleaning, depending on the type and quality of the data.

Some common techniques include:

Handling missing or incomplete data
Removing duplicate or irrelevant data
Correcting data errors and inconsistencies
Handling outliers and anomalies
Standardizing data formats and units
Validating and verifying data quality

Data analysts may use a combination of these techniques to clean data and prepare it for analysis.

Tools for Data Cleaning

There are many tools and software applications available for data cleaning, ranging from basic spreadsheet programs to advanced data cleaning and analysis platforms. Some popular tools for data cleaning include:

Spreadsheets and Workbooks

Microsoft Excel and Google Sheets are widely-used spreadsheet programs with basic data cleaning and manipulation features.

Scripting Languages

Python and R are popular programming languages for data analysis and cleaning, with a range of libraries and packages available for data cleaning tasks.

Great Expectations

Great Expectations is an open-source platform for data quality management. It provides a suite of tools and libraries for validating, profiling, and documenting data, and helps data analysts and engineers ensure their data is accurate, consistent, and reliable.

It allows users to define expectations about data and use those expectations to validate and clean the data. It supports a wide range of data sources, including CSV files, databases, and cloud storage, and provides a user-friendly interface for defining and managing data expectations.

Great Expectations integrates into existing data workflows, and is widely used in various industries, including finance, healthcare, and e-commerce. It is free and can be customized.

OpenRefine

OpenRefine is a free and open-source data cleaning and transformation tool that allows users to explore, clean, and transform large datasets. It provides a user-friendly interface for performing a wide range of data cleaning tasks, such as removing duplicates, transforming data formats, and handling missing or inconsistent data. OpenRefine supports a variety of data sources, including CSV files, Excel spreadsheets, and JSON data, and provides powerful tools for filtering, faceting, and clustering data. With OpenRefine, data analysts can easily manipulate nd reshape their data, and prepare it for further analysis or visualization.

OpenRefine is widely used in the data science and analytics community, and is a popular choice for cleaning and preparing data for machine learning and other advanced analytics tasks. OpenRefine is free and can be customized with plugins and extensions.

Data Visualization

Data visualization is the process of representing data visually, using charts, graphs, and other graphical elements to help people understand and interpret data more effectively. Effective data visualization is critical for communicating complex data insights and findings, and can help businesses make more nformed decisions based on data.

Why Data Visualization is Important

Data visualization is important because it allows people to understand complex data more quickly and effectively than with tables or raw data alone. By representing data visually, data analysts and business users can identify patterns, trends, and outliers more easily, and gain insights that may not be apparent with raw data alone. Data visualization is also an important tool for communicating data findings and insights to non-technical stakeholders, such as executives, investors, or customers.

Popular Data Visualization Python Libraries

There are many Python libraries and tools available for creating data visualizations in a variety of programming languages.

Popular data visualization libraries.

Matplotlib

Matplotlib is a popular data visualization library for Python that provides a wide range of 2D and 3D plotting capabilities. It is a flexible and versatile library that can be used for creating a variety of charts and graphs, including line charts, bar charts, scatter plots, and histograms.

Seaborn

Seaborn is another popular data visualization library for Python that is built on top of Matplotlib. It provides a high-level interface for creating statistical graphics, such as heatmaps, violin plots, and box plots, and makes it easy to create complex visualizations with just a few lines of code.

Plotly

Plotly is a web-based data visualization platform that allows users to create interactive charts and graphs in a variety of programming languages, including Python, R, and JavaScript. Plotly provides a wide range of chart types and customization options, and allows users to create and share interactive dashboards and reports.

Popular Data Visualization JavaScript Libraries

Visualizations in dashboards and Jupyter notebooks are often web-based. Native web tools can be helpful for analysts to understand and use.

D3.js

D3.js is a JavaScript library for creating dynamic, interactive data visualizations on the web. D3.js provides a low-level interface for creating custom visualizations, allowing users to control every aspect of their visualizations. This flexibility comes with a steeper learning curve, but also allows for greater control and customization options.

Popular Commercial Data Visualization Tools

Tableau

Tableau is a powerful data visualization tool that provides a drag-and-drop interface for creating a wide range of visualizations, including maps, charts, and dashboards. Tableau is known for its ease of use and accessibility, and is a popular choice for data analysts and business users who need to create visualizations quickly and efficiently.

Tableau offers a range of pricing plans, including a free trial, and also provides a robust community of users and resources.

Microsoft Power BI

Power BI is a business analytics service by Microsoft that provides interactive visualizations and business intelligence capabilities with an interface simple enough for end users to create their own reports and dashboards. Power BI allows users to connect to a wide range of data sources, including Excel spreadsheets, cloud-based and on-premises data sources, and more.

Power BI offers a range of pricing plans, including a free trial, and provides integration with other Microsoft tools and services.

Git

This page provides an overview of different techniques and skill levels related to Git, including basic, intermediate, and advanced techniques.

Basic

These are the basic skills, helpful even for beginning courses and activities.

Intermediate

These would be considered intermediate skills, applied in higher-level courses and activities.

Advanced

These are advanced skills, useful for more experienced users and advanced projects.

Basics

The following Git skills and techniques may be considered basic level.

Basic Git

Creating and cloning repositories: Know how to create new repositories on Git and how to clone existing repositories to your local machine so that you can work on them.
Adding and committing changes: Know how to add changes to your local repository and commit them so that they are saved to your repository’s history.
Pushing and pulling changes: Know how to push your local changes to your Git repository so that others can see them, and how to pull changes from the remote repository to your local machine to get the latest changes.

Intermediate

The following GitHub skills and techniques may be considered intermediate level.

Intermediate Git

Resolving merge conflicts: Learn how to handle conflicts that arise when merging branches or changes from multiple contributors.
Creating and managing branches: Know how to create and switch between branches, and how to merge changes between branches.
Using Git tags: Learn how to use Git tags to mark important points in your repository’s history, such as release versions or major milestones.
Reverting and resetting changes: Know how to revert changes to a previous commit, or reset your repository to a specific point in its history.
Understanding Git workflows: Gain a deeper understanding of Git workflows, such as Gitflow or GitHub flow, to better manage changes and collaboration in your projects.
Working with remote repositories: Know how to add and remove remote repositories, and how to push and pull changes between them and your local repository.

Advanced

The following Git skills and techniques may be considered advanced level.

Advanced Git

Rebasing: Know how to rebase a branch to update it with changes from another branch, while maintaining a clean history.
Cherry-picking: Know how to apply specific changes from one branch to another, without merging the entire branch.
Squashing commits: Know how to combine multiple commits into a single commit, to create a more coherent commit history.
Stashing: Know how to temporarily save changes that are not yet ready to be committed, so that you can work on other tasks without losing your progress.
Git hooks: Know how to create custom scripts that are automatically run by Git at specific times, such as before a commit or push.
Git submodules: Know how to use Git submodules to include one repository as a subdirectory of another repository, to better manage complex projects.
Git bisect: Know how to use Git bisect to find the commit that introduced a bug, by systematically testing different commits until the bug is found.

GitHub

This page provides an overview of different techniques and skill levels related to GitHub, including basic, intermediate, and advanced techniques.

Basic

These are the basic skills, helpful even for beginning courses and activities.

Intermediate

These would be considered intermediate skills, applied in higher-level courses and activities.

Advanced

These are advanced skills, useful for more experienced users and advanced projects.

Basics

The following GitHub skills and techniques may be considered basic level.

Basic GitHub

[ ] Creating and cloning repositories: Know how to create new repositories on GitHub and how to clone existing repositories to your local machine so that you can work on them.

[ ] Adding and committing changes: Know how to add changes to your local repository and commit them so that they are saved to your repository’s history.

[ ] Pushing and pulling changes: Know how to push your local changes to your GitHub repository so that others can see them, and how to pull changes from the remote repository to your local machine to get the latest changes.

Intermediate

The following GitHub skills and techniques may be considered intermediate level.

Intermediate GitHub

[ ] Working with branches: Know how to create and switch between branches, and how to merge changes between branches.

[ ] Using issues and pull requests: Know how to create and manage issues and pull requests, which are useful for tracking tasks, discussing changes, and requesting code reviews.

[ ] Collaboration: Know how to work collaboratively with others on a project using Git, including resolving merge conflicts and managing team workflows.

[ ] Rebasing: Know how to use the git rebase command to reapply changes from one branch onto another and resolve conflicts.

Advanced

The following GitHub skills and techniques may be considered advanced level.

Advanced GitHub

[ ] Git hooks: Know how to write and use Git hooks to automate tasks and enforce standards.

[ ] Git workflows: Know how to use Git workflows like GitFlow or GitHub Flow to manage complex projects and team collaboration.

[ ] Advanced Git commands: Be familiar with advanced Git commands like git cherry-pick, git bisect, and git stash.

[ ] Git submodules: Know how to use Git submodules to include and manage external dependencies in your projects.

[ ] Git LFS: Know how to use Git LFS (Large File Storage) to manage large binary files in your repositories.

[ ] CI/CD: Know how to integrate Git with Continuous Integration/Continuous Deployment (CI/CD) tools to automate testing, building, and deployment of your projects.

Machine Learning

Machine learning is a branch of artificial intelligence that involves the use of algorithms and statistical models to enable computer systems to improve their performance on a specific task over time. Machine learning is used in a wide range of applications, from natural language processing and computer vision to recommendation systems and fraud detection.

Why Machine Learning is Important

Machine learning is important because it allows computer systems to learn from data and improve their performance on a specific task without being explicitly programmed to do so. This enables systems to adapt and improve over time, and to make more accurate predictions or decisions based on data. Machine learning is also a powerful tool for automating complex tasks and processes, such as image recognition or natural language processing, and can help businesses make more informed decisions based on data.

Popular Machine Learning Libraries

There are many libraries and tools available for machine learning in a variety of programming languages. Some of the most popular machine learning libraries include:

Scikit-Learn

Scikit-Learn is a popular machine learning library for Python that provides a range of tools and algorithms for data mining, analysis, and modeling. It includes tools for classification, regression, clustering, and dimensionality reduction, and supports a wide range of data types and formats.

TensorFlow

TensorFlow is an open-source machine learning library developed by Google that provides a range of tools and algorithms for building and training neural networks. It supports a wide range of platforms and devices, and includes tools for distributed computing, model optimization, and deployment.

Keras

Keras is a high-level machine learning library for Python that provides a user-friendly interface for building and training neural networks. It includes a wide range of pre-built models and layers, and supports both CPU and GPU acceleration for faster training and inference.

Other Machine Learning Tools

In addition to these popular machine learning libraries, there are many other tools and platforms available for machine learning, including PyTorch, Caffe, and Microsoft Azure Machine Learning. The choice of tool or library will depend on the specific needs and requirements of the machine learning project, as well as the programming language and skill set of the data analyst or team.

Microservices

Microservices are a software architecture pattern that breaks down a large monolithic application into smaller, independently deployable services that communicate with each other using APIs.

As a data analyst, understanding the concept of microservices can be useful when working with data-driven applications. Microservices make it easier to build, deploy, and maintain data-driven applications by isolating parts of the application into smaller, manageable services.

In a microservices architecture, each service has its own codebase, data storage, and dependencies. This makes it easier to update and deploy individual services without affecting the rest of the application. It allows flexibility in choosing different technologies for different services, depending on their requirements.

Microservices can be particularly useful for real-time processing and analysis of large volumes of data. By breaking down an application into smaller services, developers can optimize each service for its specific task, allowing more efficient processing and analysis.

Working with microservices requires additional skills and knowledge, including understanding APIs, containerization, and service discovery.

A solid foundation in programming and software development is required to work effectively with microservices-based applications.

Sentiment Analysis - Flask

Create a new Flask application.

python3 -m venv env
source env/bin/activate
pip install flask
pip install textblob

Create a new route in app.py.

from flask import Flask, request
from textblob import TextBlob

app = Flask(__name__)

@app.route('/sentiment', methods=['POST'])
def sentiment():
    text = request.json['text']
    blob = TextBlob(text)
    polarity = blob.sentiment.polarity
    subjectivity = blob.sentiment.subjectivity
    return {'polarity': polarity, 'subjectivity': subjectivity}

if __name__ == '__main__':
    app.run(debug=True)

Run the app with the following command.

python app.py

Test the API with curl (or Postman).

curl --header "Content-Type: application/json" --request POST --data '{"text":"This is a positive sentence."}' http://localhost:5000/sentiment

Sentiment Analysis - AWS Lambda

Alternatively, we could create a simple function and host it on Amazon Web Services (AWS) Lambda. AWS offers a free tier that allows up to one million requests per month.

Python

This page provides an overview of different techniques and skill levels related to Python, including basic, intermediate, and advanced techniques.

Basic

These are the basic skills, helpful even for beginning courses and activities.

Intermediate

These would be considered intermediate skills, applied in higher-level courses and activities.

Advanced

These are advanced skills, useful for more experienced users and advanced projects.

Basics

The following Python skills and techniques may be considered basic level in the context of data analysis.

Data Structures

Lists: Know how to create and manipulate lists, and use them to store and organize data.
Dictionaries: Know how to create and manipulate dictionaries, and use them to store and organize data in key-value pairs.

Control Structures

Conditional Statements: Know how to use if-else statements to conditionally execute code.
Loops: Know how to use for and while loops to iterate over data.

Functions

Defining Functions: Know how to define functions to organize and reuse code.
Lambda Functions: Know how to define and use lambda functions for short and simple functions.

File I/O

Reading and Writing Files: Know how to read and write data from files using Python.

External Libraries

NumPy: Know how to use NumPy to perform numerical operations and calculations.
pandas: Know how to use pandas to work with structured data and perform data analysis tasks.
Matplotlib: Know how to use Matplotlib to create basic plots and visualizations.

These skills and the associated techniques provide a strong foundation for data analysis in Python, and can be built upon with more advanced topics and libraries as needed.

Intermediate

This page provides an overview of intermediate skills for working with Python in the context of data analysis.

External Libraries

NumPy: Know how to work with arrays, manipulate data, and perform mathematical operations.
pandas: Know how to work with data frames and manipulate data for exploratory data analysis.
Matplotlib: Know how to create customized visualizations for data analysis.

Data Cleaning

Merging and joining data frames: Know how to combine data from multiple sources.
Handling missing data: Know how to identify missing data and impute it using various methods.
Data normalization and scaling: Know how to standardize data and scale it to compare across different variables.

Data Analysis

Descriptive statistics: Know how to calculate basic summary statistics like mean, median, and standard deviation.
Inferential statistics: Know how to perform hypothesis testing and confidence intervals.
Regression analysis: Know how to perform linear regression and interpret regression coefficients.

Workflow and Collaboration

Version control with Git: Know how to use Git for version control and collaborate with others on code.
Unit testing and debugging: Know how to write and run unit tests and debug code.
Code organization and project structure: Know how to structure a Python project for scalability and reproducibility.

Type Hints

Type hints: Know how to use type hints in Python to specify function argument types, return types, and class attributes.

Employing important new features such as type hints shows a deeper understanding of Python and a commitment to writing clean, maintainable, and efficient code.

By using type hints, developers improve the documentation of their code, catch errors more easily, and help other developers understand how to use their code.

With the increasing adoption of type hints in the Python community, it is becoming an essential intermediate to advanced skill for those working on larger projects or collaborating with other developers.

def add_numbers(x: int, y: int) -> int:
    return x + y

The type hints are specified using the : syntax, where x: int means that x is of type int. The -> int syntax after the function arguments specifies the return type of the function as int.

Type hints are not enforced by the Python interpreter, but are used by static analysis tools and linters to catch type-related errors early in the development process.

Advanced

Advanced Python Skills

These skills are considered advanced and will be useful for more advanced data analysis tasks.

Object-Oriented Programming

Understand the basics of object-oriented programming (OOP) and how to apply it in Python.
Create and use classes to encapsulate related data and functionality.
Use inheritance and polymorphism to extend existing classes and create new ones.

Functional Programming

Understand the principles of functional programming and how to use functional programming concepts in Python.
Use lambda functions and higher-order functions to create more expressive and powerful code.
Apply functional programming techniques to data processing and analysis tasks.

Decorators

Understand what decorators are and how to use them to modify the behavior of functions and methods.
Use built-in Python decorators like @property, @staticmethod, and @classmethod.
Create custom decorators to add functionality to your code.

Generators and Iterators

Understand the difference between generators and iterators and how to use them in Python.
Use generators to lazily generate and process data without creating large in-memory data structures.
Implement custom iterators to provide custom ways of iterating over data.

Concurrency and Parallelism

Understand the difference between concurrency and parallelism and how to achieve both in Python.
Use threads and processes to perform multiple tasks simultaneously.
Use asynchronous programming techniques to handle I/O-bound tasks efficiently.

Performance Optimization

Understand how to optimize Python code for performance.
Use profiling tools to identify performance bottlenecks in your code.
Apply performance optimization techniques like caching, memoization, and vectorization to speed up your code.

Independent Study

Books remain a surprisingly cost-effective investment.

When you’re ready to truly master this powersful language, consider investing in a top-rated book like “Fluent Python” by Luciano Ramalho. The second edition is current, published in March 2022 covering up to Python 3.10 for the newest features.

https://www.oreilly.com/library/view/fluent-python-2nd/9781492056348

Or High Performance Python: Practical Performant Programming for Humans by Micha Gorelick and Ian Ozsvald covering high-performance options for processing big data, multiprocessing, and more.

https://www.oreilly.com/library/view/high-performance-python/9781492055013/

Techniques

Data Cleaning

Data Visualization

Git

GitHub

Machine Learning

Microservices

Python

Subsections of Techniques

Data Cleaning

Why Data Cleaning is Important

Techniques for Data Cleaning

Tools for Data Cleaning

Spreadsheets and Workbooks

Scripting Languages

Great Expectations

OpenRefine

Data Visualization

Why Data Visualization is Important

Popular Data Visualization Python Libraries

Matplotlib

Seaborn

Plotly

Popular Data Visualization JavaScript Libraries

D3.js

Popular Commercial Data Visualization Tools

Tableau

Microsoft Power BI

Git

Subsections of Git

Basics

Basic Git

Intermediate

Intermediate Git

Advanced

Advanced Git

GitHub

Subsections of GitHub

Basics

Basic GitHub

Intermediate

Intermediate GitHub

Advanced

Advanced GitHub

Machine Learning

Why Machine Learning is Important

Popular Machine Learning Libraries

Scikit-Learn

TensorFlow

Keras

Other Machine Learning Tools

Microservices

Sentiment Analysis - Flask

Sentiment Analysis - AWS Lambda

Python

Subsections of Python

Basics

Data Structures

Control Structures

Functions

File I/O

External Libraries

Intermediate

External Libraries

Data Cleaning

Data Analysis

Workflow and Collaboration

Type Hints

Advanced

Advanced Python Skills

Object-Oriented Programming

Functional Programming

Decorators

Generators and Iterators

Concurrency and Parallelism

Performance Optimization

Independent Study

GitHub Resouces

Participate in Open Source