Problem Solving with InfoNet 2: The problem of wrong and missing data


Courtesy of Innovyze

Untitled Document

The problem

Do you know the validity of the data in your various databases?
Do you know how accurate it is?
Do you know how much data is missing?
Have you ever checked it?
All of it?

When we talk to our clients about InfoNet and the benefits it brings to the storage and retrieval of their asset data, we sometimes have the opportunity to run some of their data through InfoNet's validation processes, for both their collection system assets and their water distribution assets. The client is often surprised by the extent of the errors. So the problem seems to have two aspects. First, data is inaccurate or missing. Second, the client may be unaware of the extent of these errors.

An example of error messages on data validated by InfoNet

We are less surprised, because data that has been put into files and databases that have no error checking ability, or only simplistic methods, is bound to contain errors. The data will probably have been input over a number of years. It will probably be of varying quality, the good survey data being indistinguishable from the less good. The problem is compounded by the fact that although there are firm rules of logic that can be used to check hydraulic data (water runs downhill in unpressurized pipes, for example, pipes in water distribution networks are usually connected to something at each end) most data storage systems are not specific to hydraulics or networks, so such logic is missing and cannot be used in validation routines.

The use of InfoNet

InfoNet is purpose-built for water companies, and can impose on its data a number of validation checks, some of which would apply to any data and some of which are specific to the assets of water companies. The validation checks can be described in four categories:

  • Low-level checks
  • Required data checks
  • Range checks
  • SQL and logic checks.

Low level checks – InfoNet contains some very simple checks associated with the different assets it covers. For example, if data refers to a pipe, is that pipe below ground? If data refers to a wet well pump, is the level at which the pump switches on higher than the level at which it switches off? These are the obvious sanity checks, with each asset type having its own.

Required data – to address the problem of missing data, InfoNet can check that a particular data element is present across the subset of data being checked. This can be specified for each dataset, so that when data from a particular survey is being entered, for example, there is a check that the data is complete according to the specification of that survey. Different surveys will have different data specifications. The user can redefine the requirement each time.

Range checks – these are numerical checks to make sure the numbers make sense. They can be simple checks that refer to no other data entities. Is the pipe diameter no smaller or larger than the range of pipe sizes used by the company? Or it can be a number that is checked in the context of other numbers. For example, is an upstream sewer pipe of larger size than the downstream pipe to which it is connected? If so, this highly unlikely item of data needs reporting.

Finally there are logic checks. These can be specified by the user in SQL, but the common checks are built into the system. These are:
  • Connectivity checks – does the data truly specify a network, or does it actually represent a number of sub-networks with key links unknown because of missing connectivity data? Checks on real data often reveal the latter case.
  • Tracing checks – InfoNet can show an upstream or downstream trace from any point in the network to highlight flow paths.
  • Intermediate path checks – InfoNet can highlight every path through the network that connects two given points.
  • Proximity checks – assets that are within a user-specified distance from another asset can be checked. For example, checking all assets that are within a yard (metre) of another will probably identify where connections exist but have been hidden by slight GIS data errors. InfoNet can make these connections if required.

Most of these checks can be user specified or taken from the default checks within InfoNet.

There are two approaches to when checks are run. First, data that is already within InfoNet can be checked, and second, data can be checked on entry, either with bulk entry or as it is keyed.

Once a validation is completed, a report is produced. If a number of checks are being run in parallel they can be prioritised so that the report sorts in the required priority order. As well as producing reports, files can be exported for example to provide input to Excel reports to include charts and graphs.

Every data item in InfoNet can be flagged. Usually this is used to log the source of the data. However, in the event that a specific improbable data value is checked and found to be correct, this fact can be recorded.

Finally, just as data can be validated, corrections can be calculated by interpolation. Missing data or data that fails logic tests can be inferred by reference to rules, under complete user specification and user control. For example if a pipe is missing in a network between two existing pipes of the same diameter, the connecting pipe is likely to have that diameter. Other such inferences can also be applied.

Customer comments

No comments were found for Problem Solving with InfoNet 2: The problem of wrong and missing data. Be the first to comment!