Blog

Understanding Research Data: Types and How to Differentiate them

Types of Research Data

Data may be grouped into four main types based on methods of collection: observational, experimental, simulation, and derived.

The type of research data, you collect may affect the way you manage that data. For example, data that is hard or impossible to replace (e.g. the recording of an event at a specific time and place) requires extra backup procedures to reduce the risk of data loss. Also, if your research entails the need to combine data points from different sources, you will need to follow the best practices to prevent data corruption.

Observational Data

Observational data are captured through observation of a behavior or activity. It is collected using methods such as human observation, open-ended surveys, or the use of an instrument or sensor to monitor and record information — such as the use of sensors to observe noise levels at the Mpls/St Paul airport. Because observational data are captured in real-time, it would be very difficult or impossible to re-create if lost. An example of observational data can be, a researcher stopping random people on the street to ask how many children they have, then taking this data and using it to decide if there should be more schools in that area.

Experimental Data

Experimental data are collected through active intervention by the researcher to produce and measure change or to create a difference when a variable is altered. Experimental data typically allows the researcher to determine a causal relationship and is typically projectable to a larger population. This type of data is often reproducible, but it often can be expensive to do so. An example of experimental data is a researcher feeding cinnamon extract daily to a group of diabetics while another control group of diabetics are fed none. After a month, the control group of diabetics would probably show no change but the group fed the cinnamon extract would show lesser risks of heart

Simulation Data

Simulation data are generated by imitating the operation of a real-world process or system over time using computer test models. For example, to predict weather conditions, economic models, chemical reactions, or seismic activity. This method is used to try to determine what would, or could, happen under certain conditions. The test model used is often as, or even more, important than the data generated from the simulation. Another example of simulation data can be seen in the dummy crash tests automobile companies carry out on their vehicles to ascertain the level of damage a crash would inflict on an occupant of the vehicle and how best to circumvent damage.

Derived/Compiled Data

Derived data involves using existing data points, often from different data sources, to create new data through some sort of transformation, such as an arithmetic formula or aggregation. For example, combining area and population data from the Twin Cities metro area to create population density data. While this type of data can usually be replaced if lost, it may be very time-consuming (and possibly expensive) to do so. There are a certain number of processes that create derived data, they include, extracting (portions of the data base), restructuring (altering the lay-out of the data base), annotation (augmenting an existing data set to include new fields), summarizing or analyzing (generating statistical summaries of fields in a dataset), correcting (validating and correcting dataset A against dataset B), inferencing (generate new data based on one or more datasets using reasoning) and model generation (sharing neural networks based on prior datasets). However, there are certain processes that do not create derived data, they include, using a data set, copying, changing the format, packaging and validation.

Quantitative data vs Qualitative data

Studies can use quantitative data, qualitative data, or both types of data. Each approach has advantages and disadvantages.

Quantitative Data

Also known as numerical data.

Quantitative variables can be continuous or discrete.

  • Continuous: the variable can, in theory, be any value within a certain range. Quantitative data can be measured. Examples: height, weight, blood pressure, cholesterol.
  • Discrete: the variable can only have certain values, usually whole numbers. It can be counted. Examples: number of visits to a doctor in last year, number of fractures, number of children.

Qualitative Data

Also known as Non-numerical data. Qualitative variables can be nominal or ordinal.

  • Nominal: the variable does not have a specific order. Examples: eye color, blood type, ethnicity.
  • Ordinal: the variable has a specific order. Examples: stages of cancer, class letter grade, position in a race.

Paired Data vs Independent Data

When documenting research, it is reasonable to justify the choice of analysis by making the reader believe that the analysis that best supported the hypothesis was chosen rather than the one most appropriate to the data. The most important thing, when it comes to making this decision, is not to make unsupported assumptions about the data and apply methods assuming “better” data than you have. Instead, you need to ask questions like; Are your data paired?

Paired data are often the result of before and after situations, e.g. before and after treatment. In such a scenario each research subject would have a pair of measurements and it might be that you look for a difference in these measurements to show improvement due to the treatment.

In most Data Analyzing tools the data would be coded into two columns, each row would hold the before and the after measurement for the same individual. We might, for example, measure the balance performance of 10 subjects with a Balance Performance Monitor (BPM) before and after taking a month-long course of exercise designed to improve balance. Each subject would have a pair of balance readings. This would be paired data. In this simple form, we could do several things with the data; we could find average reading for the balance (Means or Medians), we could graph the data on a box-plot this would be useful to show both levels and spread and let us get a feel for the data and see any outliers.

In the example stated above, the data are paired and each subject has a pair of numbers. What if you made your subjects do another month of exercise and measured their balance again, each subject would have three numbers, the data would still be paired, but rather than stretch the English language by talking about a pair of three we call this repeated measures.

A word of warning, sometimes you might gather paired data but end up with independent groups.

Say, for example, you decided that the design above was flawed (which it is) and doesn’t take into account the fact that people might simply get better at balancing on the balance performance monitor due to having had their first go a month before. i.e. we might see an increase in their balance due to using the balance monitor. To counter this possible effect, we could recruit another group of similar subjects, these would be assessed on the BPM but not undertake the exercise sessions. Consequently, we could assess the effect of measurement without exercise on this control group. We then have a dilemma about how to treat the two sets of data. We could analyze them separately and hope to find a significant increase in balance in our treatment group but not in the non-exercise group. A better method would be to calculate the change in balance for each individual and see if there is a significant difference in that change between the groups. This latter method ends with the analysis actually being carried out on non-paired data.

If you are not sure whether two columns of data are paired or not, consider whether rearranging the order of one of the columns would affect your data. If it would, they are paired. Paired data often occur in ‘before and after’ situations. They are also known as ‘related samples’. Non-paired data can also be referred to as ‘independent samples’.

Scatterplots (also called scatter grams) are only meaningful for paired data.