Architectural design for data consistency on distributed analytic system
I am refactoring an Analytic system that will do a lot of calculation, and I need some ideas on possible architectural designs to a data consistency issue I am facing.
I have a queue based system, in which different requesting applications create messages that are eventually consumed by workers.
Each “Requesting App” breaks down a large calculation into smaller pieces that will be sent to the queue and processed by the workers.
When all the pieces are finished, the originating “Requesting app” will consolidate the results.
Also, the workers consume information from a centralized database (SQL Server) in order to process the requests (Important: the workers do not change any data on the database, only consume it).
Ok. So far, so good. The problem arises when we include a web service that updates the information on the database. This can happen at any time, but it is critical that each “large calculation” originated from the same “Requesting App” sees the same data on the database.
- App A generates messages A1 and A2, sending it to queue
- Worker W1 picks up message A1 for processing.
- The web server updates the database, changing from state S0 to S1.
- Worker W2 picks up message A2 for processing
I just can´t have worker W2 using state S1 of the database. for the whole calculation to be consistent it should use the previous S0 state.
A lock pattern to prevent the web server from changing the database while there is a worker consuming information from it.
- cons: The lock might be on for a long time, since the calculation form different “Request Apps” might overlap (A1, B1, A2, B2, C1, B3, etc.).
Create new layer between the database and the workers (a server that controls db caching by req. app)
- cons: Adding another layer might impose significant overhead (maybe?), and it is a lot of work, since I will have to rewrite the persistence of the workers (a lot of code).
I am pending to the second solution, but not very confident about it.
Any brilliant ideas ? Am I designing it wrong, or missing something ?
- This is a HUGE 2-tier legacy system (in C#) that we are trying to
evolve into a more scalable solution with as minimal effort as
- Each worker is potentially running on different servers.
3 Solutions collect form web for “Architectural design for data consistency on distributed analytic system”
Can you version your DB ?
Lets say the requesting application stamps the start of the calculation with ct1. Now every message this calculation generates is stamped with the same timestamp.
And also each DB update stamps the DB state with the time of the update. So state S0 is on time t0, state S1 on t1 etc.
Now when a worker gets a message it needs to get the DB state where the update time is the largest that is smaller or equal to the message time. In your example, if A1 and A2 are stamped with ct1, and t1 > ct1, both workers will retrieve S0 and not S1.
This means of course that you need to hold several versions in your DB. You can clean those versions after a certain time if you know that your computations must have finished after some time window.
I like option 2, especially if the amount of data needed for the full set of calculations isn’t unreasonably large. I assume there is a way to correlate (via id) calculations that belong to the same overall job?
When the first message of a set of calculations comes in, the worker that picks it up queries the database and for all the data necessary to do all calculations and creates a temporary data store. What this data store would look like would depend on a lot of factors (size, structure, etc.), but it could be a blob/document, a set of data in a relational schema (isolated by correlationId), an entry in an enterprise cache, etc.
You’d need to be careful of the case when Worker 1 and Worker 2 are both working on the same set of calculations, since only one of them should create the data store, and both would need to wait until the store is fully populated before proceeding.
Thanks everybody for the help.
Since I believe this is problem might be usual in other scenarios, I would like to share the solution we chose.
Thinking more thoroughly about the problem, I understood it for what it really is.
- I needed some sort of session control for each job
- There was a cache in-process that served as a session control for each job
Now the calculation has evolved to be distributed, I just needed to evolve my cache to be distributed as well.
In order to do that, we chose to use an In-Memory Database (hash-value), deployed as a separate server. (in this case Redis).
Now every time I start a job, I create a ID for the job and pass it to their messages
When each worker wants some information from the database, it would:
- Look for the data in Redis (with the job ID)
- If the data is in Redis, use the data
- If it is not, load it from SQL, and save it in redis (with the job ID).
At the end of the job, I clear all hashes associated with the job ID.