The website contains both historic and current datasets, and covers a wide range of critical topics. Researchers need to take this into account in their analyses by appropriately specifying the sampling design parameters. This module addresses why weights are created and how they are calculated, the importance of weights in making estimates that are representative of the U. Analysts must evaluate the statistical reliability of estimates to determine whether the results are appropriate for their intended research objective.

This module describes a number of measures that can be used to evaluate the reliability of an estimate, including the effective sample size, the design effect, the width and relative width of its confidence interval, the degrees of freedom, and the relative standard error.

Programs are available as SAS programs i. This page also contains code examples that demonstrate the application of tutorial concepts by replicating the estimates in selected NCHS publications.

Click here for Modules 6 - 10, which are currently under construction.

Do not direct questions to this email address, as it can only receive feedback and we will not be able to reply. Skip directly to site content. National Center for Health Statistics. Changes include: adding a new module 5, which describes measures that can be used to assess the reliability of estimates providing sample code to replicate estimates from an NCHS data brief, using SUDAAN, SAS survey, Stata, and R software expanding coverage of some topics such as identifying skip patterns, combining multiple survey cycles, and analyzing NHANES data with Stata or R software updating the content to describe more recent survey cycles and the current NHANES website Modules are under construction and are subject to change.

Related Sites. Links with this icon indicate that you are leaving the CDC website. Linking to a non-federal website does not constitute an endorsement by CDC or any of its employees of the sponsors or the information and products presented on the website. You will be subject to the destination website's privacy policy when you follow the link. CDC is not responsible for Section compliance accessibility on other federal or private website.

Cancel Continue.The SSANA datasets have been temporarily removed to review and possibly update the weighting variable. Data documentation has been added for the Limited Access Data files below. No changes or corrections have been made to the data. The original pooled-sample weight created for the following laboratory files released in July did not accurately take into account the new sample design for this NHANES survey cycle and it was not correctly stratified to the U.

The corrected sample weight was created so that analyses using race and Hispanic origin would be comparable to the three groups used in NHANES non-Hispanic white, non-Hispanic black and Mexican-American. No changes or corrections were made to the lab analyte data in this new release.

Any analyses of the data using the old public use data file should be repeated using the corrected sample weight on this new file. The following laboratory files have been withdrawn from public release because an error was discovered in the creation of sampling weights.

New files with corrected weights will be released as soon as feasible. These datasets have been withdrawn due to misspecification of the pooled-sample weights, and will be corrected and reposted. NHANES subsample weights have been revised to correct for the misspecification of domains used for calculating the sub-sample weights.

The following datasets have been updated with these revised weights:.

### Module 6: Descriptive Statistics

The dietary tutorial is a major supplement to the existing NHANES Web Tutorial and provides information and instructions specific to dietary data and analyses. The first course, Dietary Data Orientationgives an overview of the dietary data, provides roadmaps to the complex data structure and contents, and orients users to the NHANES website and its resources for dietary data analysis.

The second course, Preparing a Dietary Analytic Datasetprovides step-by-step instruction from locating and downloading variables, to merging and appending, formatting, and saving datasets. The latest addition, Advanced Dietary Analysesdescribes techniques for estimating usual intake of dietary and supplement intake, how dietary intakes vary among individuals, and how individual intakes relate to other factors.

These data files have been temporarily removed for further review and will be reposted at a later date.

Additional modules on advanced dietary analyses will be released later in the year. Documentation for more limited access datasets will be released in upcoming months. To locate the limited access dataset documentation, go to a survey data page, and click the Limited Access link.

Direct links to the pages have also been provided below.

## Downloading and analyzing NHANES datasets with Stata in a single .do file

It demonstrates how body measurements are made during the examination. The video is in eight, captioned segments and uses Windows Media Player.

The first course, Dietary Data Survey Orientationgives an overview of the dietary data, provides roadmaps to the complex data structure and contents, and orients users to the NHANES website and its resources for dietary data analysis. Courses covering basic and advanced data analyses are currently under development. Additional instructions on how to prepare an analytic dataset using Stata software will be forthcoming.This module introduces how to generate the descriptive statistics for NHANES data that are most often used to obtain these estimates.

Topics covered in this module include checking frequency distribution and normality, generating percentiles, generating means, and generating proportions.

It is highly recommended that you examine the frequency distribution and normality of the data before starting any analysis. These descriptive statistics are useful in determining whether parametric or non-parametric methods are appropriate to use, and whether you need to recode or transform data to account for extreme values and outliers. A frequency distribution shows the number of individuals located in each category of a categorical variable.

For continuous variables, frequencies are displayed for values that appear at least one time in the dataset. Frequency distributions provide an organized picture of the data, and allow you to see how individual scores are distributed on a specified scale of measurement. For instance, a frequency distribution shows whether the data values are generally high or low, and whether they are concentrated in one area or spread out across the entire measurement scale.

A frequency distribution not only presents an organized picture of how individual scores are distributed on a measurement scale, but also reveals extreme values and outliers. Researchers can make decisions on whether and how to recode or perform data transformation based on the distribution statistics.

Frequency distributions can be structured as tables or graphs, but either should show the original measurement scale and the frequencies associated with each category. Because NHANES data have very large sample sizes with a potentially long list of different values for continuous variables, it is recommended that you use a graphic format to check the distribution for continuous variables, and either frequency tables or graphic forms for nominal or interval variables.

Statistics of normality reveal whether a data distribution is normal and symmetrically bell-shaped or highly skewed. It is important to use these statistics to check the normality of a distribution because they will determine whether you will use parametric which assume a normal distributionnon-parametric tests, or the need to use a transformation in your analysis. Note: Before you analyze the data, it is important to check the distribution of the variables to identify outliers and determine whether parametric for a normal distribution or non-parametric tests are appropriate to use.

If you conduct tests for normality, results on most variables would be significant, i. Therefore, users are discouraged from solely depending on these tests for normality. Instead you can also request a Q-Q plot to examine normality. A Q-Q plot, or a quantile-quantile plot, is a graphical data analysis technique for assessing whether the distribution for data follows a particular distribution. In a Q-Q plot, the distribution of the variable in question is plotted against a normal distribution.

The variable of interest is normally distributed, if a straight line intersects the y-axis at a 45 degree angle. The standard deviation is a measure of the variability of the distribution of a random variable. To estimate the standard deviation. Skewness is a measure of the departure of the distribution of a random variable from symmetry. The skewness of a normally distributed random variable is 0. Kurtosis is a measure of the peakedness of the distribution.

The kurtosis of a normally distributed random variable depends on the formula used.The NNYFS collected data on physical activity and fitness levels through interviews and fitness tests. Conducted in, andthe NHANES I Epidemiologic Follow-up Study NHEFS was a national longitudinal study designed to investigate the relationships between clinical, nutritional, and behavioral factors assessed in the NHANES I and subsequent morbidity, mortality, and hospital utilization, as well as changes in risk factors, functional limitation, and institutionalization.

These oversampled groups included children aged 2 months to 5 years, persons over age 60, Mexican-American persons, and non-Hispanic black persons. This survey also concentrated on health and nutrition but additionally began to collect environmental exposure and infectious disease data.

The maximum age remained 74 years. Inthere was an augmentation to the survey on an additional national sample. This augmentation sample only included adults aged and did not oversample any population groups nor include nutrition data.

Skip directly to site content. National Center for Health Statistics. Related Sites. Links with this icon indicate that you are leaving the CDC website. Linking to a non-federal website does not constitute an endorsement by CDC or any of its employees of the sponsors or the information and products presented on the website.

You will be subject to the destination website's privacy policy when you follow the link. CDC is not responsible for Section compliance accessibility on other federal or private website. Cancel Continue.A few of these years are linked to National Death Index data, so you can assess risk factors at the time of the survey and use time-to-event mortality data to identify novel risk factors for death.

The best part? Manipulating NHANES data is challenging for beginners because of the sheer quantity of individual files and requirement for weighting.

Plus, all of the files are in SAS XPT format so you have to download, import, save, and merge before you can even think about starting an analysis. To make this data management task slightly more complex, the CDC sporadically publishes interval updates of the source data files on their website.

Files may be updated for errors or removed entirely without you knowing about it. Re-downloading all of the many files every time you want to do a project is a big headache. I love that Stata will download datasets for you with just a URL.

If the source files are updated by the CDC, no worry! Every time you run this. For example, the Folate lab results were withdrawn February Save data as Stata. Merge the. Review basic coding issues, 5.

Run an analysis using weighting, and 5. Display data.

### Software Tips

Set the working directory to be in the same folder as your. Opening the. BTW, that code is:. Finally, if you are trying to combine analyses from multiple NHANES cycles say, combinine withthings get a bit more complicated.

Skip to main navigation Skip to local navigation Skip to main content. Stata code Generic start of a Stata.The survey is unique in that it combines interviews and physical examinations.

The NHANES program began in the early s and has been conducted as a series of surveys focusing on different population groups or health topics. Inthe survey became a continuous program that has a changing focus on a variety of health and nutrition measurements to meet emerging needs. The survey examines a nationally representative sample of about 5, persons each year. These persons are located in counties across the country, 15 of which are visited each year.

The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel. Findings from this survey will be used to determine the prevalence of major diseases and risk factors for diseases.

Information will be used to assess nutritional status and its association with health promotion and disease prevention. NHANES findings are also the basis for national standards for such measurements as height, weight, and blood pressure. Data from this survey will be used in epidemiological studies and health sciences research, which help develop sound public health policy, direct and design health programs and services, and expand the health knowledge for the Nation.

As in past health examination surveys, data will be collected on the prevalence of chronic conditions in the population. Estimates for previously undiagnosed conditions, as well as those known to and reported by respondents, are produced through the survey.

Smoking, alcohol consumption, sexual practices, drug use, physical fitness and activity, weight, and dietary intake will be studied. Data on certain aspects of reproductive health, such as use of oral contraceptives and breastfeeding practices, will also be collected.

The sample for the survey is selected to represent the U. Since the United States has experienced dramatic growth in the number of older people during this century, the aging population has major implications for health care needs, public policy, and research priorities.

NCHS is working with public health agencies to increase the knowledge of the health status of older Americans. All participants visit the physician. Dietary interviews and body measurements are included for everyone. All but the very young have a blood sample taken and will have a dental screening. Depending upon the age of the participant, the rest of the examination includes tests and procedures to assess the various aspects of health listed above.To properly calculate the standard errors of your statistics such as means and percentagesthe Taylor series linearization method requires information on ALL records with a non-zero value for your weight variable, including those survey participants who are not in your population of interest.

For example, to estimate mean body mass index BMI and its standard error for men aged 20 years and over, the DESCRIPT procedure needs to read in the entire dataset of examined individuals who have an exam weight, including females and those younger than 20 years.

For more details on analyzing subgroups, see Module 4: Variance Estimation. Alternatively, the notsorted option on the SUDAAN procedure call will request that the SUDAAN procedure make a temporary copy of the input dataset and sort it by the required variables prior to conducting any calculations. However, since an analysis session typically includes multiple SUDAAN procedure calls with the same input dataset, it is generally more computationally efficient to sort the dataset once, prior to running any SUDAAN procedures.

A few differences are highlighted below. When you specify this option, the procedure computes variance estimates by analyzing the non-missing analysis values as a domain subpopulationwhere the entire population includes both non-missing and missing domains.

Use of the NOMCAR option is recommended, as the default assumptions — that the group of non-respondents do not differ in any relevant respect from the group of respondents and so may be treated as missing completely at random — are often not appropriate.

See the SAS documentation for more details. The degrees of freedom associated with an estimated statistic is needed to perform hypothesis tests and to compute confidence intervals.

SAS Survey procedures generally calculate the degrees of freedom based on the number of strata and PSUs represented in the overall dataset especially if the NOMCAR option is used, as is recommended to estimate the standard errors correctly.

Estimates for some subgroups of interest will have fewer degrees of freedom than are available in the overall analytic dataset. See Module 4: Variance Estimation for more information. See the code example about diabetes prevalence which replicates a portion of National Health Statistics Report for code to compute the degrees of freedom for subgroups and then calculate the Korn and Grabuard confidence intervals. The model statement will specify only one effect, which is either a classification variable or the crossed effect interaction of multiple classification variables.

The parameter estimates will represent the mean value of the dependent variable at each level of the effect e. To calculate the means and standard errors, you would use Stata survey svy commands because they account for the complex survey design of NHANES data when determining variance estimates. These commands can be used for simple random samples also.

Whenever you want to use SVY commands, you need to set up Stata by defining the survey design variables using the svyset command. This command has the general structure:. Once you do this, Stata remembers these variables and applies them to every subsequent SVY command. If you save the dataset, Stata will remember these variables and apply them automatically when you reopen the data set. Standard commands are regular Stata commands that can incorporate sampling weights.

For example, if standard errors are not needed, you can simply use regular Stata commands with the weight variable i. You only need to use these commands when there is no corresponding SVY command. When you use these commands, keep in mind that:. For example, to estimate mean body mass index BMI and its standard error for men aged 20 and over, the svy:mean command needs to access the entire dataset of examined individuals who have an exam weight, including females and those younger than 20 years.

Instead, you should use the subpop option available on the svy commands to specify your subpopulations of interest. Stata cares about the case of the letters - so you must refer to NHANES variables using the lowercase names provided on the data files.

For example, you must refer to the respondent sequence number the key variable as seqn with all lowercase letters, not as SEQN in uppercase letters. When you generate your own derived variables, you may choose to name them using uppercase characters, lowercase characters, or a mix of the two.

However, you must type the variable name consistently in all of your code. Stata commands are also case-sensitive. Stata represents missing numeric values ". For example, to test whether the fasting sample weight wtsaf2yr is non-missing and has a positive value, you could use of the following expressions:.

Stata procedures generally calculate the degrees of freedom based on the number of strata and PSUs represented in the overall dataset. In particular, although the svy:prop command as of Stata 15 has an option citype exact to compute Clopper-Pearson "exact" confidence limits for proportions, these confidence intervals are not based on the correct degrees of freedom for subgroups where not all strata and PSUs are represented.

**National Health and Nutrition Examination Survey (NHANES)**

For example, to estimate mean body mass index BMI and its standard error for men aged 20 years and over, the svymean function needs information about all examined individuals who have an exam weight, including females and those younger than 20 years. As a rule of thumb, it is recommended that you NEVER subset your data frame prior to using the svydesign function to define a survey design object.