Intraday Dataset Details

Gaps

Unlike end-of-day data the quality of intraday data varies enormously and prior to developing trading systems using the data it should always be cleaned and tested.

The first test should always be for gaps. Due to the volume of data, vendors do not always catch gaps in the data and so they need to be tested for. If the data includes zero volume bars then it is trivial to test for missing minutes (number of regular-trading-hours bars should always be the same). However, many vendors do not include zero-volume bars due to the size of files - especially when out-of-hours trades are included. In such a circumstance, the first test should be for missing days as most gaps are actually very large (usually 3-10 days).
Testing for missing bars within the day is more challenging. The best approach is to calculate distributions for the number of bars in three trading periods - 1-hour after the open, 1-hour before the close, and the interim period during the day. The number of bars on each day can then be compared to the distributions, the number of bars during the day is close to normally distributed so a 3-standard deviation test can be used.
A final test can be to correlate the volume and number of bars, typically volume correlates with bars, and so high volume days with a low number of bars is a red flag.
The gap test should always be performed prior to the additional tests as gaps cannot be fixed and there is little point in spending time on other tests if the dataset is unusable.

Erroneous Datapoints

For various reasons, incorrect data can sometimes be recorded in the intraday bars.
A useful first test is to ensure that the high and low and always above and below the open and close. It is a surprisingly common occurrence for the close to be below the bar low which is obviously an error.
The next test should be for outliers, these are bad datapoints that are too far away from their closeby datapoints to be considered as accurate. There are various ways to test for this but we usually advocate creating a distribution of the ranges (bar high minus low) during the day and then screening for bars outside this distribution (3 SDs on the right of the distribution)

Correcting the Data

Gaps, as noted above, cannot be corrected. However, erroneous datapoints can be corrected. In correcting the datapoint - as much information as possible should be preserved (some aggressive correct protocols replaced the OHLC datapoints. Typically only one of the open, high, low or close is corrupted and should be corrected, and replacing that datapoint with midpoint between the two closest bars is an appropriate fix.

Testing and Using Intraday Data

Gaps

Erroneous Datapoints

Correcting the Data