The rise 📈 of data multi-sourcing and using Juxt.io to improve data quality
In recent years, individuals and companies have sought to use various data strategies and tools to improve the data quality of key business processes. Today, we will talk about a strategy called data multi-sourcing which can be a powerful driver for higher quality data. We will also talk about a tool called Juxt.io to help facilitate the data multi-sourcing process.
What is Multi-sourcing?
Multi-sourcing is the concept of using multiple outside providers in a beneficial and collaborative way. The idea has been used in the manufacturing industry when a buyer wants to ensure there are multiple suppliers for an item. In the software industry, this concept can also be used so that a process is not dependent on a single data provider.
A software process can use multiple data providers to increase data availability and data quality. Increased data availability is achieved when, for example, Provider A has a system outage but Provider B is still operational. Your system can use the single Provider B until Provider A is back online. Increased data quality is achieved when using the multiple sources to detect if there are any significant deviations in the values that are being sent. Ideally the data from the different sources would be highly similar, however, there are often variations in the way providers gather and process data that result in meaningful differences.
Using Juxt.io to detect data quality issues
To illustrate the variations that can be seen when multi-sourcing data, we will compare data from two financial data providers: IEX Cloud and Yahoo Finance. You can run this example yourself with the following python script. When we upload IEX and Yahoo Finance data to Juxt.io, we can specify how we want to compare each field. Our files contain the fields: Ticker, Sector, Open, Close, Volume, SharesOutstanding. In this example we are running a strict match on Open and Close Price and applying a relative tolerance for Volume and SharesOustanding.
This is the configuration used:
Here is the final result:
As can be seen from the results, the multi-sourcing strategy and Juxt.io have highlighted several cases where the Volume and SharesOustanding data differ more than we expected. A process can be designed when we detect a difference to alert a person to investigate and determine the true value.
We’d love to hear about your experience multi-sourcing data and using Juxt.io. You can reach us at email@example.com. Thanks for reading!