Data

Data is an incredibly broad topic but it can be broken down into many subsections, including (in no particular order):

data processing / wrangling
machine learning
data analysis
visualization
geospatial mapping
persistence via relational databases and NoSQL data stores
object-relational mappers
natural language processing (NLP)
indexing, search and retrieval

The Python community has built and continues to create open source libraries and tutorials for all of the above topics.

Why is Python a great language choice for data tasks?

Python has a wide array of open source code libraries available and a diverse community of people with different backgrounds who contribute to make those libraries better each day.

In addition, Python data manipulation code can be combined with web frameworks and web APIs to build software that would be difficult to create with a single other language. For example, Ruby is a fantastic language for building web applications but its data analysis and visualization libraries are very limited compared to what is currently available in the Python ecosystem.

How did Python become so widely used for working with data?

Python is a general purpose programming language and can be applied to many problem areas. Over the past couple of decades, Python has become increasingly popular in the scientific and financial communities. Projects such as pandas grew out of a hedge-fund while NumPy and SciPy were created in academic environments then improved by the broader open source community.

The question is: why Python was used to created these projects? The answer is a mix of luck, the growth of the open source community as Python was maturing and wide adoption by people not formally trained as computer scientists. The pragmatic syntax and explicit style helped very intelligent people without programming backgrounds to pick up the language and get their work done with less fuss than other programming languages. Over time the code used in the financial world and scientific community was shared at the same time global open source communities were developing, further spreading their usage among a broader base of software developers.

There's no doubt some of the momentum behind Python's wide adoption for all types of data manipulation was that it happened to be the right language in the right place at the right time. Nevertheless, it was ultimately the hard work of a massive number of engineers and scientists around the world who created the incredible mix of data code libraries available today.

Data inspiration

Sometimes you just need to see it to understand how data analysis, visualization and storytelling can intersect in a meaningful way. The following resources do a great job of telling stories with data. There are more links to stories listed on the data analysis and data visualization pages.

Data — from objects to assets covers the history of data collection and usage, from 150 years ago to today. The article covers how initial steps by individual scientists sponsored by wealthy patrons in the 1800s gave way to systematic collection by governments and businesses in the 20th century. A significant amount of personal data is now held by a few dozen large corporations worldwide such as Google, Amazon and Facebook. The article covers some of the implications of data as a valuable asset and in general is a great read as a high-level overview of on this topic.
Metadata Investigation : Inside Hacking Team presents what metadata is and how it can be used to track people even though it is often thought of as less of a problem than typical stored data.
A visual introduction to machine learning is a spectacular example of data visualization to explain what a machine learning model does on a San Francisco and New York housing data set.
Earthquake recurrence and survival analysis: How long should we wait for an overdue earthquake? combines earthquake data with questions around earthquake recurrence probabilities to tell its story.
Data Science Project: Profitable App Profiles for App Store and Google Play is a tutorial that shows you how to use iOS and Android app store data for business analysis. This post is part of a larger series on how to get your first job as a data scientist which is all worth your time reading to understand the intersection of working with data to figure out its value to companies sand organizations.

Example data sets

Looking for freely-available data to use in your projects but aren't sure where to get it? The following links have large free, open data sets.

Check out the awesome public datasets project repository for data in many different categories ranging from finance to museums.
Kickstarter datasets are scraped JSON and CSV structured monthly data from Kickstarter projects.
Data is Plural is a weekly newsletter that highlights open data that you can use for your projects. I have been a subscriber to the newsletter for a couple of years now and love seeing the wide variety of data sources that are freely available.
Data analysis and machine learning projects provides more than just the data, it also includes instructions and code for working with the data in your own development environment.
Discovering millions of datasets on the web introduces Google's dataset search and explains what they learned from iterating on earlier versions of it before they released this one.

General Python data resources

PyData is a community for developer and users of Python data tools. They put on fantastic conferences around the world and fund the continued development of open source data-related libraries.
Anaconda is one of the leading Python companies that pours a tremendous amount of time and funding into the data community.
A crash course in Python for scientists provides an overview of the Python language with iPython Notebook for those in scientific fields.
The videos of Travis Oliphant on Python's Role in Big Data Analytics: Past, Present, and Future and Building the PyData Community give historical perspective on how the Python data tools have evolved over the past 20ish years based on his first-hand experience as a leader and member in that community.
The State of Python Speech Recognition in 2021 is a practical overview of a specific area in data: extracting text from voice recording data. Looking at verticals like this one can make it easier to understand changes that are occurring in some parts of data and programming that could be applied to other areas.
Automated Data Wrangling covers cleaning, labeling, and automating the bunch of activities that are typically necessary before analysis and data usage can begin for a project.
The Open Source Data Science Masters is a well-crafted free curriculum and set of resources for students who want to learn both the theory and technologies for working with data.
Reproducible research: Stripe’s approach to data science goes through the workflow and tools such as Jupyter Notebook that Stripe for their data analysis across the company.
The Definitive Data Scientist Environment Setup explains how to set up both a hardware and software configuration that is conducive to data science research and analysis.

What else would you like to learn about Python and data?

Tell me about standard relational databases.

What're these NoSQL data stores hipster developers keep talking about?

Why is Python a good programming language to use?

Full Stack Python

Full Stack Python is an open book that explains concepts in plain language and provides helpful resources for those topics.

Updates via Twitter & Facebook.

Chapters

1. Introduction 2. Development Environments » 3. Data 4. Web Development 5. Deployment 6. DevOps Changelog What Full Stack Means About the Author Future Directions Page Statuses ...or view the full table of contents.