Finding and Using Data Sets

Introduction to this research guide

Watch the video full screen

Librarians:
Denise Cote, Electronic Resources Librarian: [email protected]
Daniel Blewett, Social & Behavioral Sciences Librarian: [email protected]
Let us know if you need help!

Finding Data Sets using Library Resources

Use the Library's subscription resources to find statistical information and to easily browse topic areas for stats on sociological topics. Each description below includes the features of the database including the types of data included, export formats, and other features.

How to access: You must use your COD Library card to access these databases from home. Students in online classes can register for a card here and your barcode will be emailed to you.
Encyclopedia Britannica: World Data Analyst
The World Data Analyst includes basic country-level demographic and government data from around the world. This resource has tools to perform country comparisons on multiple data points using both current and chronological data. It also includes a compiled data feature called "Ranked Statistics" that allows you to view the countries with the highest and lowest ranks on multiple data points. You can use the Country Snapshots feature to read about the countries you are analyzing to help explain your data.
Features
Export data sets: Yes (.txt, .pdf)
Custom reports/tables: Yes
Charts/Graphs: Yes
Citations: Yes
How-to: Guided Tour (.pdf)
Historical Statistics of the United States
The Historical Statistics of the U.S. (HSUS) collects quantitative facts about the United States dating from the early history of our country to 1995. Data is gathered from a variety of sources, including the U.S. Census. Table documentation and commentary are included to help you explain your data.
Features
Export data sets: Yes (.csv, .txt, .sav)
Custom reports/tables: Yes
Charts/Graphs: Yes
Citations: Yes
How-to:
How to search HSUS

How to download tables
Special Note: To create custom tables (combining data sets), you'll have to make a personal account on HSUS. Just make sure you are in the HSUS via the Library website, then create a free account. This account will allow to also store your custom tables.

Statistical Abstract of the United States
The Statistical Abstract of the United States (StatAb) is the comprehensive summary of statistics on the social, political, and economic conditions of the United States, 2010-Present. The data comes directly from U.S. Government departments such as the Census Bureau, the Bureau of Labor Statistics, the National Center of Education Statistics, the Department of Health and Human Services, and many more. Table documentation is included to help you explain your data. Using StatAbs makes exploring government data a bit easier. Note that you can further explore the data sources by referring to the government agency that provided the data.
Features
Export data sets: Yes (.csv, .xls., .pdf)
Custom reports: No
Charts/Graphs: No
Citations: Yes
How-to: StatAb User Guide
Exploring Statistics
Sage Stats
Sage Stats includes United States data measures across the social sciences for all 50 states from 1970-Present. Includes state, county, city, and metropolitan-level data. Sage stats allows for data visualization. Note: Sage is typically used to explore statistics. In other words, Sage Stats is awesome for finding facts about the U.S., helping the user visualize data, and to perform basic cross-tabulations. When selecting data tables to compare, be sure to choose those with an adequate number of cases and variables. You can use Sage Stats to explore data then identify/trace back original sources.
Features
Export data sets: Yes but limited. (.csv, .txt)
Custom reports/tables: Yes
Charts/Graphs: Yes. Also includes visualizations in .jpeg and .pdf
Citations: Yes
How-to: Sage Stats Overview (.pdf); Exporting data, step-by-step (.pdf)

Statista
Statista is a repository of facts and demographics covering the U.S. and other countries that utilizes infographics rather than tables to display data. Statista focuses heavily on business and industry but it also includes social and health data. Statista also includes additional information and "reading support" that can help you explain the data. Statista does not include data sets but it is a good resource for finding a topic and will help you determine/trace back original data sources.

Other databases that include statistical data are here: http://www.codlrc.org/databases/statistics

Web Resources

There are many sources of social data available on the Internet. Following are the librarians' two favorite repositories to search for data sets. Both of these resources compile data from a variety of organizations, research institutes, governmental agencies, and other sources. More data repositories are available here.

**It is important to consult your instructor when selecting data sets to use for class assignments.**

Data.gov: U.S. Government Open Data Repository
Browse Data.gov by topic or use the search feature to find data on your subject. Use the dataset catalog to search and refine your results. Most data sets on Data.gov are downloadable in a variety of formats, including .csv, while others link to web sites or apps that help you access and/or use the data. Each data set includes information on the creator and publisher of the data. Many of the data sets include a description of the data, references to published articles that utilized the data set, and other important information to help you describe and document the data. Here's a quick tip on searching Data.gov
ICPSR
The Inter-university Consortium for Political and Social Research (ICPSR) is part of the Institute for Social Research at the University of Michigan. The ICPSR is a very rich resource that allows searching by subjects, themes, geographic area, and by specific studies. Also includes replication data sets and analysis tools.

Using the ICPSR:
Here is an in-depth tutorial on how to use the ICPSR database (well worth watching!):

Ethical use of Data

The following is a mini-lecture in two parts about ethics in data science research.

Write it down!
Quick notes from the first program:
Informed Consent is:
The Basic Ethical Principles are:
1.
2.
3.
The Common Rule:
An IRB applies to:
Data Science Ethics: The Basics. Part One.

Watch full screen


Data Science Ethics: The Basics. Part Two.

Watch full screen
[oops in this video! my bad. Please note that the infographic from the Pew Research Center is PREDICTIVE analysis based on longitudinal data.]


Citing your sources

cite.jpg

As with all other information resources, documenting and giving credit to the authors of the information is always required. Using data sets requires particular care in determining the original source of the data and the permissions/licensing under which the data can be used. For student researchers, permissions are little more liberal since you are using data for personal education purposes and are not publishing the outcome of your work. It is still extremely important that you find out as much as you can about the data set you are using and document it correctly. Use the Library's Citing Sources page for guidance with citing textual and visual information.


How to Cite Data Sets in APA Style

Typical format for website:

Researcher or Research Sponsor. (Date). Complete title of data set. [Data file (and/or corresponding materials)]. Retrieved from: URL.

For example:

Pew Hispanic Center. (2004). Changing channels and crisscrossing cultures: A survey of Latinos on the news media [Data file and code book]. Retrieved from http:// pewhispanic.org

The in-text citation would be: Pew Hispanic Center (2004) or (Pew Hispanic Center, 2004).

Typical format for data set with DOI (Digital object identifier):

Researcher or Research Sponsor. (Date). Complete title of data set. [Data file (and/or corresponding materials)]. DOI of source.

For example:

U.S. Department of Health and Human Services, Substance Abuse and Mental Health Services Administration, Office of Applied Studies. (2013). Treatment episode data set -- discharges (TEDS-D)-concatenated, 2006 to 2009 [Data set]. doi:10.3886/ICPSR30122.v2

The in-text citation would be: U.S. Department of Health and Human Services, Substance Abuse and Mental Health Services Administration, Office of Applied Studies (2013) or (U.S. Department of Health and Human Services, Substance Abuse and Mental Health Services Administration, Office of Applied Studies, 2013).

From: http://blog.apastyle.org

A real life example of citing a data set

This is a tough one. This data set found on Kaggle was derived from a published scholarly paper. The paper, "The Use of Multiple Measurements in Taxonomic Problems," was published in 1936 in the journal Annals of Eugenics by R.A. Fisher. The Center for Machine Learning and Intelligent Systems at UC Irvine recreated the data set in machine readable format and published the data set on Kaggle. So, we have the original paper with the data and we have this data set created by UCI. The paper is original work and the UCI data set is derivative work. Which one do we cite? [I did some background digging because I'm a librarian. I found that 1.) this article is in the public domain, and 2.) the publisher of the article made it freely available for scholarly use.]

If I was going to use this data set in an assignment, I would cite both the data set on Kaggle AND the article. I'm going to have to mention the original article in my assignment anyway because I need to explain what my data is about and define the variables. My text might look something like this: The Iris Species data set (UC Irvine, 2016) was derived from a paper published by Fisher (1936) which included data on three iris species with 50 samples each with some descriptive elements about each flower.
So, here are my citations:

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems." Annals of Eugenics. 7:179-188. doi:10.1111/j.1469-1809.1936.tb02137.x

University of California Irvine, Center for Machine Learning and Intelligent Systems. (2016). Iris species. [Data file]. https://www.kaggle.com/uciml/iris

It is also important to note that UCI published their work on Kaggle with a Creative Commons public domain license (CC0) so we are free to use this data set in just about any way we'd like.