Queue processing best practice

Valentin R · June 2023

Let's say I have a db of 100k records (users).
Every hour I need to pull records from that db that match specific criteria. At some hours this will be 5000 records at a time.
Then I need to process those 5000 records in under an hour's time so that it doesn't overlap with the next batch of data.
Processing each entry takes about a second (a couple of external API calls). So this is 5000 seconds which is 83 minutes and already more than an hour.

What would be the most efficient way to speed up the process?

What I am doing right now is faking multi-threading:
1. I am running a task every 30 seconds that gets 3-4 pages of the records, 60 entries per page.
2. I've created an endpoint that calls the processing function.
3. I hit that endpoint via a lambda function that has a timeout of a few seconds so that the task continues to run and initiate other "threads".

This is working but the result of this is a downgraded performance of all endpoints and functions. Even when the db/API load is below 50%.

Any suggestions? Thanks.

Ray Deck · June 2023

What kind of instance are you running - the smaller instances can be surprisingly weak in this circumstance?

Another thought: are these worker endpoints all executing writes? You might find that you can improve your performance by finding where they write, and maybe asking whether the write needs to happen at all. (e.g. updating a price or some such. If the price is the same, no need to update)

Another thought: for situations where we need to make a lot of updates fast for basically key-value lookups, the redis system is an interesting alternative for Scale plans. It can do the job faster both on the retrieval and situations like this.

Depending on how you handle the requests, you mgiht have an easier time setting up a pooling pattern for launching your requests. Like have the task load up your 5000, and then run 50 at a time in somewhat parallel (like 2 processes firing off) and then waiting for them to come back. Youc an do this also with Redis. Launch the process with a short timeout, launch the second process, and then set upa while loop sleeping until it sees redis values change associated with each process completing. This is more complicated, but could provide a lot of efficiency.

Managing these kinds of data processing issues is often part of the hardest 5% we work on at State Change daily office hours and our loom-based forums.

Valentin R · June 2023

I am on the scale plan @Ray Deck.
There are multiple writes — to the same db from which I am pulling records plus to a couple of other ones too.

The "come back" feature sounds interesting and might limit the potential excessive overlap of tasks.
"50 at a time in parallel" would have to be achieved the same way (hitting internal endpoints) I assume?
Caching part.. I thought about it but not sure I can apply it. I am working with the data that can change any second and the functions I am running need to act based on the latest states/data.

Ray Deck · June 2023

The fact that you have so many writes to your system is probably the fundamental issue. 5000 records per hour getting multiple - call it 3 - writes is 15,000 journaled database changes per hour. That's over 4/s. We should expect that a relational database will feel that weight a bit, and look for breathing room to clean indices etc.

Redis is an in-memory key-value store that operates reads and writes orders of magnitude faster than the postgresql database. Depending on your use case, it could solve a lot of your problems. It's not about caching the endpoint, its about using this alt db for your highly mutable datasets. The limitations are in scale (it can't go to gigabytes), transaction history (doesn't journal) and the complexity of queries (it's a straightforward key-value store). If you need to make this many changes so quickly for automated needs, I'd consider this as an approach to unlock a lot of performance.

Guillaume Maison · June 2023

i would :
1. create a temp table
2. insert the result of the user select
3. use 3 tasks that would do its own select on id modulo 3=0 for task one, id modulo 3=1 for task 2 and id modulo 3 = 2 for task 3 … and run the according algorithm …

But still, if you have that many db writes, it'll be your bottleneck …
I/Os are always the bottleneck …

Valentin R · June 2023

I wonder if the query stream type will help with this. As far as I remember it processes in the background with lower priority.

Also, I don't understand whether running a function that processes say 100 entries at a time is better than running 100 tasks that process one entry at a time.

Ray Deck · June 2023

The function that processes 100 entries does them in series, not in parallel. YOu'll find that performance as we scale up is nonlinear - at first it's quite fast, and then as you add more and more actions in a row things slow down remarkably. Giving it just a little room to breathe lets maintenance actions happen that will allow future steps to be fast again. This is a tricky dynamic - worthy of being at the hard 5% of no-code!

Valentin R · June 2023

I tried a different approach today — created multiple tasks and a "tasks_tracker" table that keeps track of running tasks and id ranges for each task. So when a task runs it checks how many tasks there are running already, and either engages or not, based on the total task limit. If it engages it checks which range of records to ignore when querying next batch data… it worked well until tasks ran too close to each other. When they run with 0-1-2 delay in between (5 tasks running total), they query the same batches of data and create parallel processing for the same entries, which is obviously bad for my use case.

I wanted to avoid calling the internal API endpoint with this, but I guess I'll have to get back to just one task and internal API calls.

Queue processing best practice

Comments

Categories