Techniques
This chapter provides covers various techniques and skills used in data analytics, investitagions, and development.
This chapter provides covers various techniques and skills used in data analytics, investitagions, and development.
This page provides an overview of different techniques for cleaning data, a key step in the data analysis process. Data cleaning involves identifying and correcting errors, inconsistencies, and inaccuracies in data to ensure that it is accurate, complete, and usable for analysis.
Data cleaning is essential for ensuring the accuracy and reliability of data analysis results. Dirty data, or data that has not been properly cleaned, can lead to inaccurate conclusions, wasted time and resources, and missed opportunities for insights and discoveries. By cleaning data before analysis, data analysts can ensure that their findings are based on accurate and high-quality data.
There are many different techniques for data cleaning, depending on the type and quality of the data.
Some common techniques include:
Data analysts may use a combination of these techniques to clean data and prepare it for analysis.
There are many tools and software applications available for data cleaning, ranging from basic spreadsheet programs to advanced data cleaning and analysis platforms. Some popular tools for data cleaning include:
Microsoft Excel and Google Sheets are widely-used spreadsheet programs with basic data cleaning and manipulation features.
Python and R are popular programming languages for data analysis and cleaning, with a range of libraries and packages available for data cleaning tasks.
Great Expectations is an open-source platform for data quality management. It provides a suite of tools and libraries for validating, profiling, and documenting data, and helps data analysts and engineers ensure their data is accurate, consistent, and reliable.
It allows users to define expectations about data and use those expectations to validate and clean the data. It supports a wide range of data sources, including CSV files, databases, and cloud storage, and provides a user-friendly interface for defining and managing data expectations.
Great Expectations integrates into existing data workflows, and is widely used in various industries, including finance, healthcare, and e-commerce. It is free and can be customized.
OpenRefine is a free and open-source data cleaning and transformation tool that allows users to explore, clean, and transform large datasets. It provides a user-friendly interface for performing a wide range of data cleaning tasks, such as removing duplicates, transforming data formats, and handling missing or inconsistent data. OpenRefine supports a variety of data sources, including CSV files, Excel spreadsheets, and JSON data, and provides powerful tools for filtering, faceting, and clustering data. With OpenRefine, data analysts can easily manipulate nd reshape their data, and prepare it for further analysis or visualization.
OpenRefine is widely used in the data science and analytics community, and is a popular choice for cleaning and preparing data for machine learning and other advanced analytics tasks. OpenRefine is free and can be customized with plugins and extensions.
Data visualization is the process of representing data visually, using charts, graphs, and other graphical elements to help people understand and interpret data more effectively. Effective data visualization is critical for communicating complex data insights and findings, and can help businesses make more nformed decisions based on data.
Data visualization is important because it allows people to understand complex data more quickly and effectively than with tables or raw data alone. By representing data visually, data analysts and business users can identify patterns, trends, and outliers more easily, and gain insights that may not be apparent with raw data alone. Data visualization is also an important tool for communicating data findings and insights to non-technical stakeholders, such as executives, investors, or customers.
There are many Python libraries and tools available for creating data visualizations in a variety of programming languages.
Popular data visualization libraries.
Matplotlib is a popular data visualization library for Python that provides a wide range of 2D and 3D plotting capabilities. It is a flexible and versatile library that can be used for creating a variety of charts and graphs, including line charts, bar charts, scatter plots, and histograms.
Seaborn is another popular data visualization library for Python that is built on top of Matplotlib. It provides a high-level interface for creating statistical graphics, such as heatmaps, violin plots, and box plots, and makes it easy to create complex visualizations with just a few lines of code.
Plotly is a web-based data visualization platform that allows users to create interactive charts and graphs in a variety of programming languages, including Python, R, and JavaScript. Plotly provides a wide range of chart types and customization options, and allows users to create and share interactive dashboards and reports.
Visualizations in dashboards and Jupyter notebooks are often web-based. Native web tools can be helpful for analysts to understand and use.
D3.js is a JavaScript library for creating dynamic, interactive data visualizations on the web. D3.js provides a low-level interface for creating custom visualizations, allowing users to control every aspect of their visualizations. This flexibility comes with a steeper learning curve, but also allows for greater control and customization options.
Tableau is a powerful data visualization tool that provides a drag-and-drop interface for creating a wide range of visualizations, including maps, charts, and dashboards. Tableau is known for its ease of use and accessibility, and is a popular choice for data analysts and business users who need to create visualizations quickly and efficiently.
Tableau offers a range of pricing plans, including a free trial, and also provides a robust community of users and resources.
Power BI is a business analytics service by Microsoft that provides interactive visualizations and business intelligence capabilities with an interface simple enough for end users to create their own reports and dashboards. Power BI allows users to connect to a wide range of data sources, including Excel spreadsheets, cloud-based and on-premises data sources, and more.
Power BI offers a range of pricing plans, including a free trial, and provides integration with other Microsoft tools and services.
This page provides an overview of different techniques and skill levels related to Git, including basic, intermediate, and advanced techniques.
These are the basic skills, helpful even for beginning courses and activities.
These would be considered intermediate skills, applied in higher-level courses and activities.
These are advanced skills, useful for more experienced users and advanced projects.
The following Git skills and techniques may be considered basic level.
Creating and cloning repositories: Know how to create new repositories on Git and how to clone existing repositories to your local machine so that you can work on them.
Adding and committing changes: Know how to add changes to your local repository and commit them so that they are saved to your repository’s history.
Pushing and pulling changes: Know how to push your local changes to your Git repository so that others can see them, and how to pull changes from the remote repository to your local machine to get the latest changes.
The following GitHub skills and techniques may be considered intermediate level.
Resolving merge conflicts: Learn how to handle conflicts that arise when merging branches or changes from multiple contributors.
Creating and managing branches: Know how to create and switch between branches, and how to merge changes between branches.
Using Git tags: Learn how to use Git tags to mark important points in your repository’s history, such as release versions or major milestones.
Reverting and resetting changes: Know how to revert changes to a previous commit, or reset your repository to a specific point in its history.
Understanding Git workflows: Gain a deeper understanding of Git workflows, such as Gitflow or GitHub flow, to better manage changes and collaboration in your projects.
Working with remote repositories: Know how to add and remove remote repositories, and how to push and pull changes between them and your local repository.
The following Git skills and techniques may be considered advanced level.
Rebasing: Know how to rebase a branch to update it with changes from another branch, while maintaining a clean history.
Cherry-picking: Know how to apply specific changes from one branch to another, without merging the entire branch.
Squashing commits: Know how to combine multiple commits into a single commit, to create a more coherent commit history.
Stashing: Know how to temporarily save changes that are not yet ready to be committed, so that you can work on other tasks without losing your progress.
Git hooks: Know how to create custom scripts that are automatically run by Git at specific times, such as before a commit or push.
Git submodules: Know how to use Git submodules to include one repository as a subdirectory of another repository, to better manage complex projects.
Git bisect: Know how to use Git bisect to find the commit that introduced a bug, by systematically testing different commits until the bug is found.
This page provides an overview of different techniques and skill levels related to GitHub, including basic, intermediate, and advanced techniques.
These are the basic skills, helpful even for beginning courses and activities.
These would be considered intermediate skills, applied in higher-level courses and activities.
These are advanced skills, useful for more experienced users and advanced projects.
The following GitHub skills and techniques may be considered basic level.
[ ] Creating and cloning repositories: Know how to create new repositories on GitHub and how to clone existing repositories to your local machine so that you can work on them.
[ ] Adding and committing changes: Know how to add changes to your local repository and commit them so that they are saved to your repository’s history.
[ ] Pushing and pulling changes: Know how to push your local changes to your GitHub repository so that others can see them, and how to pull changes from the remote repository to your local machine to get the latest changes.
The following GitHub skills and techniques may be considered intermediate level.
[ ] Working with branches: Know how to create and switch between branches, and how to merge changes between branches.
[ ] Using issues and pull requests: Know how to create and manage issues and pull requests, which are useful for tracking tasks, discussing changes, and requesting code reviews.
[ ] Collaboration: Know how to work collaboratively with others on a project using Git, including resolving merge conflicts and managing team workflows.
[ ] Rebasing: Know how to use the git rebase command to reapply changes from one branch onto another and resolve conflicts.
The following GitHub skills and techniques may be considered advanced level.
[ ] Git hooks: Know how to write and use Git hooks to automate tasks and enforce standards.
[ ] Git workflows: Know how to use Git workflows like GitFlow or GitHub Flow to manage complex projects and team collaboration.
[ ] Advanced Git commands: Be familiar with advanced Git commands like git cherry-pick, git bisect, and git stash.
[ ] Git submodules: Know how to use Git submodules to include and manage external dependencies in your projects.
[ ] Git LFS: Know how to use Git LFS (Large File Storage) to manage large binary files in your repositories.
[ ] CI/CD: Know how to integrate Git with Continuous Integration/Continuous Deployment (CI/CD) tools to automate testing, building, and deployment of your projects.
Machine learning is a branch of artificial intelligence that involves the use of algorithms and statistical models to enable computer systems to improve their performance on a specific task over time. Machine learning is used in a wide range of applications, from natural language processing and computer vision to recommendation systems and fraud detection.
Machine learning is important because it allows computer systems to learn from data and improve their performance on a specific task without being explicitly programmed to do so. This enables systems to adapt and improve over time, and to make more accurate predictions or decisions based on data. Machine learning is also a powerful tool for automating complex tasks and processes, such as image recognition or natural language processing, and can help businesses make more informed decisions based on data.
There are many libraries and tools available for machine learning in a variety of programming languages. Some of the most popular machine learning libraries include:
Scikit-Learn is a popular machine learning library for Python that provides a range of tools and algorithms for data mining, analysis, and modeling. It includes tools for classification, regression, clustering, and dimensionality reduction, and supports a wide range of data types and formats.
TensorFlow is an open-source machine learning library developed by Google that provides a range of tools and algorithms for building and training neural networks. It supports a wide range of platforms and devices, and includes tools for distributed computing, model optimization, and deployment.
Keras is a high-level machine learning library for Python that provides a user-friendly interface for building and training neural networks. It includes a wide range of pre-built models and layers, and supports both CPU and GPU acceleration for faster training and inference.
In addition to these popular machine learning libraries, there are many other tools and platforms available for machine learning, including PyTorch, Caffe, and Microsoft Azure Machine Learning. The choice of tool or library will depend on the specific needs and requirements of the machine learning project, as well as the programming language and skill set of the data analyst or team.
Microservices are a software architecture pattern that breaks down a large monolithic application into smaller, independently deployable services that communicate with each other using APIs.
As a data analyst, understanding the concept of microservices can be useful when working with data-driven applications. Microservices make it easier to build, deploy, and maintain data-driven applications by isolating parts of the application into smaller, manageable services.
In a microservices architecture, each service has its own codebase, data storage, and dependencies. This makes it easier to update and deploy individual services without affecting the rest of the application. It allows flexibility in choosing different technologies for different services, depending on their requirements.
Microservices can be particularly useful for real-time processing and analysis of large volumes of data. By breaking down an application into smaller services, developers can optimize each service for its specific task, allowing more efficient processing and analysis.
Working with microservices requires additional skills and knowledge, including understanding APIs, containerization, and service discovery.
A solid foundation in programming and software development is required to work effectively with microservices-based applications.
Create a new Flask application.
python3 -m venv env
source env/bin/activate
pip install flask
pip install textblob
Create a new route in app.py.
from flask import Flask, request
from textblob import TextBlob
app = Flask(__name__)
@app.route('/sentiment', methods=['POST'])
def sentiment():
text = request.json['text']
blob = TextBlob(text)
polarity = blob.sentiment.polarity
subjectivity = blob.sentiment.subjectivity
return {'polarity': polarity, 'subjectivity': subjectivity}
if __name__ == '__main__':
app.run(debug=True)
Run the app with the following command.
python app.py
Test the API with curl (or Postman).
curl --header "Content-Type: application/json" --request POST --data '{"text":"This is a positive sentence."}' http://localhost:5000/sentiment
Alternatively, we could create a simple function and host it on Amazon Web Services (AWS) Lambda. AWS offers a free tier that allows up to one million requests per month.
This page provides an overview of different techniques and skill levels related to Python, including basic, intermediate, and advanced techniques.
These are the basic skills, helpful even for beginning courses and activities.
These would be considered intermediate skills, applied in higher-level courses and activities.
These are advanced skills, useful for more experienced users and advanced projects.
The following Python skills and techniques may be considered basic level in the context of data analysis.
Lists: Know how to create and manipulate lists, and use them to store and organize data.
Dictionaries: Know how to create and manipulate dictionaries, and use them to store and organize data in key-value pairs.
Conditional Statements: Know how to use if-else statements to conditionally execute code.
Loops: Know how to use for and while loops to iterate over data.
Defining Functions: Know how to define functions to organize and reuse code.
Lambda Functions: Know how to define and use lambda functions for short and simple functions.
NumPy: Know how to use NumPy to perform numerical operations and calculations.
pandas: Know how to use pandas to work with structured data and perform data analysis tasks.
Matplotlib: Know how to use Matplotlib to create basic plots and visualizations.
These skills and the associated techniques provide a strong foundation for data analysis in Python, and can be built upon with more advanced topics and libraries as needed.
This page provides an overview of intermediate skills for working with Python in the context of data analysis.
NumPy: Know how to work with arrays, manipulate data, and perform mathematical operations.
pandas: Know how to work with data frames and manipulate data for exploratory data analysis.
Matplotlib: Know how to create customized visualizations for data analysis.
Merging and joining data frames: Know how to combine data from multiple sources.
Handling missing data: Know how to identify missing data and impute it using various methods.
Data normalization and scaling: Know how to standardize data and scale it to compare across different variables.
Descriptive statistics: Know how to calculate basic summary statistics like mean, median, and standard deviation.
Inferential statistics: Know how to perform hypothesis testing and confidence intervals.
Regression analysis: Know how to perform linear regression and interpret regression coefficients.
Version control with Git: Know how to use Git for version control and collaborate with others on code.
Unit testing and debugging: Know how to write and run unit tests and debug code.
Code organization and project structure: Know how to structure a Python project for scalability and reproducibility.
Employing important new features such as type hints shows a deeper understanding of Python and a commitment to writing clean, maintainable, and efficient code.
By using type hints, developers improve the documentation of their code, catch errors more easily, and help other developers understand how to use their code.
With the increasing adoption of type hints in the Python community, it is becoming an essential intermediate to advanced skill for those working on larger projects or collaborating with other developers.
def add_numbers(x: int, y: int) -> int:
return x + y
The type hints are specified using the :
syntax,
where x: int
means that x is of type int.
The -> int
syntax after the function arguments
specifies the return type of the function as int.
Type hints are not enforced by the Python interpreter, but are used by static analysis tools and linters to catch type-related errors early in the development process.
These skills are considered advanced and will be useful for more advanced data analysis tasks.
@property
, @staticmethod
, and @classmethod
.Books remain a surprisingly cost-effective investment.
When you’re ready to truly master this powersful language, consider investing in a top-rated book like “Fluent Python” by Luciano Ramalho. The second edition is current, published in March 2022 covering up to Python 3.10 for the newest features.
Or High Performance Python: Practical Performant Programming for Humans by Micha Gorelick and Ian Ozsvald covering high-performance options for processing big data, multiprocessing, and more.