Microdata
and tabular data are two common products of a data collection effort.
Microdata files are the actual electronic record of a particular youth
and include personal information such as name, social security number,
age, race, gender, and offense, along with other demographic factors.
When released, these data files, while rich in information, present
unacceptable risks of disclosing confidential youth information. Tabular
data includes numbers, percents, and rates within a table and the
discussion of these data within the text.
Just as microdata files threaten confidentiality requirements of 28 CFR
22, tabular data files present additional risks. Confidentiality issues
arise when cells within a table include only a few youth or when characteristics,
such as ethnicity, are uniquely distinguishing. Under these conditions,
researchers may be able to identify an individual youth and, in combination
with other tables, identify additional information such as ‘most
serious offense’. Disclosure risk may also occur when table
cells include all youth within a field thus disclosing information
about them. For example, a frequency table that shows the fifteen
learning disabled youth in a school district who are aged 13-15 and
are all under the supervision of the juvenile justice system would
constitute disclosure. To protect these data and to ensure that the
risk for disclosure is minimal, the information in microdata and tabular
data files is restricted through statistical disclosure limitation
techniques. Once made available for public use, the files are considered
restricted data products.
A professional with appropriate statistical knowledge and who is familiar with the
microdata file and the tabular data under consideration should carry
out statistical disclosure limitation techniques. Although this implies
significant expertise and skill, juvenile justice professionals without
such experience and training will nonetheless be able to recognize
what these techniques intend to achieve. The examples of statistical
disclosure limitation techniques below are not intended to be comprehensive
and technical. Numerous ‘how-to’ manuals are available
for researchers who wish to learn how to apply these statistical methods.
STATISTICAL
DISCLOSURE LIMITATION METHODS Microdata
Files
|
- Remove
direct identifiers—name,
social security number, and date of birth.
- Collapse
information into larger categories—rare
offenses, ‘weapons in school’ should be re-categorized
as ‘weapons’.
- Mask subject
identification—create
a new subject identification variable and drop all other identifying
information from the data set.
|
|
Reference pseudonym—coded identifiers
replace personally identifiable youth data. Reference list linking codes
with youth is necessary to break confidentiality rules.
Reversible encryption—the encrypted
format is created by a mathematical algorithm and contains identifiable
youth data in a hidden form that can be unhidden given access to the
encryption algorithm.
Irreversible (one-way) encryption—the
encrypted format is created by a secure and unique algorithm and produces
a unique, personally identifiable code that cannot be converted back
to personally identifiable data. It is computationally impossible to
determine personally identifiable information from the encrypted format.
|
Before
an organization releases microdata files, staff should be knowledgeable
about Federal and agency-level confidentiality regulations along with
related statistical disclosure limitations methods. Each microdata file
intended for public use should be reviewed and analyzed relative to
the application of statistical disclosure processes. In addition, disclosure
limitation practices should be consistent within and among agencies
with overlapping microdata files to prevent disclosure among linked
data sets.
|
STATISTICAL
DISCLOSURE LIMITATION METHODS Tabular
Data |
|
- Suppress
(not publish) sensitive cells—table cells with a count
of one or two.
- Round (adjust)
values in all cells to a specified base—all rounded values
(other than zero) are multiples of 3 (base 3) or 5 (base 5). Base
3 and 5 are the most common choices for rounding.
|
Statistical disclosure limitation methods are not needed for reports
of tabular data that represent a single variable, such as gender. However,
when tables display two or more variables (age by gender by race) then
a method such as rounding is applied to the table cells. Raw data are
used to produce the table; the rounding procedure is the final step
in tabular presentation. For example, when rounding raw data with base
3, data within table cells are rounded to the nearest multiple of 3,
with each cell having one added, one subtracted, or remaining the same.
Procedures applied to cells are not shared with the public when released
as tabular data. When cell numbers are small or when youth characteristics
are unique, rounding eliminates disclosure of youth identity and confidential
information.
Rounded numbers are also used when calculating percentages and rates.
This practice prevents reconstruction of raw data and unintended disclosure
of youth data. Because raw data and rounded tabular data are not the
same, there is the risk of error in secondary analysis of tabular data.
When cells are greater than 25, the actual numbers, percentages, and
rates show minimal differences between actual and rounded calculations;
however cells with fewer counts are less accurate. Researchers conducting
secondary analyses must determine whether or not rounded data are suitable
for analysis and for drawing accurate conclusions.
|