Data Repositories

More web resources for data science projects


Five Thirty Eight is an excellent site that publishes data-driven articles. Many of the data sets used in the reporting on this site are available on github. View the Five Thirty Eight datasets here:

Socrata Open Data contains numerous clean data sets that are explorable in your browser using the site's visualization tools. Caution: many of these data sets are out of date but you can use Socrata to get topic ideas and use it to trace back to more current versions of the data.

Amazon makes large data sets available on its Amazon Web Services platform. The AWS public data sets are available for free for one year when you sign up with AWS. You can download the data to your computer or do analysis using cloud-based services like Amazon's instances of AC2 or Hadoop.

Google BigQuery Public Datasets allow for 1TB of data to be downloaded free. The collection of public data sets made available by Google are varied and eclectic, such as personal names registered via the Social Security Administration after 1879 (how many people share your name??) and the GDELT book corpus which contains data on all of the public domain books available on the Internet Archive and the Hathi Trust (like 4M books! Awesome!)

The City of Chicago data portal contains current and historical data collected by the City government. Included are data sets on doing business in Chicago, crime statistics, health data, environmental information, and much more.

Or just use Kaggle and go nuts. Kaggle is like social media for data nerds. Kaggle is a continuously growing collection of public data sets. It includes analysis and visualization tools, Kaggle Kernels, and you can publish your analyses of datasets on the site. Explore other users' Kernels for ideas, comment on their work, and learn data science and machine learning using the Kaggle resources. You can even enter data science competitions. Get your machine learning game on! Sign up for free. Look for attribution and rights on the site to give credit to the researchers/data compilers.
Fun stuff:
The Simpsons by the data, MTG Cards, Amazon Fine Food Reviews.