Follow :
 
There is no better solution than to implement system capture mechanism that encourages/ coaxes/ incentivizes the end users of the application to provide the data so that it does not become a missing value situation.

One of the slides/tabulation in your study should be explicitly identifying the % missing for all the variables that are used in the study.

Explain why missing values for each of the study variables occur.

Sometimes, missing values are confused with naturally occurring non-availablity (existence) of values due to the construction/definition of variable.  This should be differentiated in the explanation of the % missing in the study variables. For example, in surveys that use hierarchical rule based questions, there will be missing situations that would occur because of the structure of the questions.

Use multiple imputation methods for application situations for best results.

Both “Delete Strategy” and “Mean Strategy” could be very biased methods, more so with the first one, than the later, depending on how much data is missing.

 
 
A Statistician’s Ten steps for data quality management.

Identify and agree regarding the system implemented meta data vs. business logic supporting meta data, every time you receive data.  Always ask for a data dictionary which is managed by the IT department.  Also, ask for first and the last 10 records of the data that are being delivered.

  1. Ask for data to be delivered in a particular format (CSV, TXT with special separation character, EXCEL, or Other database forms, SAS, SPSS, DB2, … ) that you are very familiar to handle.  Over a long period of experience, I found it easier if the data is delivered in fixed format text form.  Yet, it is much easier if there is an automation that would create what is called ‘Data Audit Report’ for analysts to have a quick look at the delivered data and communicate with the data delivery team on the quality of the data.
  2. Make sure you can read the data and output the top 10 and bottom 10 records.  Visually read the sample data for each of the variables and make sure it matches with the data promised to have been delivered to you by the IT department.
  3. Check to see whether total number of observations sent by the provider and the total number of observations received are the same.
  4. How are the numeric elements coded? Numeric or character?
  5. If a field is a numeric element, find out (1) is it Integer or not, (2) Min, (3) Max, and (4) Number OF Missing values for numerical variables.  Check out the equivalence of full list of alpha (character) values along with number of missing for alpha variables
  6. Check for all consistency checks in the data that exist among variables.  For example, if there is a total revenue and also revenue by product groups, make sure the sum of the product group revenues is same as total revenue, after checking with business/IT managers that such a consistency check exist or not.  This is a tricky part. Because there are so many ways you can identify the consistency checks.  Identify the quick major ones and check it out.
  7. The Data Audit Report should also have distributions of each of the variable.  If a variable is a numeric variable, use quintiles or deciles to see the distribution.  If a variable is a character variable, use the occurrences of each of the characters.
  8. Make sure weights are provided if there is a sample survey or if sample is taken from a population.  If weights are not provided create a weighting system using an available auxiliary variable that is available for the full population.
  9. If the data is provided for a predictive model, make sure you are selecting the right reference population when modeling the target population.  It is not the whole US population list whether it is B2B or B2C application.
  10. Missing value distributions (missed or not) should also be covered in any communication with the IT department so that re-orienting the processes for better capture of data can be implemented.
From Data Monster & Insight Monster