Remove duplicates based on two fields

Valentin R · March 2023

I have a table that looks like this:
id / status_uuid / timestamp / json_data

I need to query and process in bulk (loop) only the newest (by timestamp) entries for each status_uuid. How do I get those or how do I dedupe to get rid of the rest?

Thanks

MattLaro · March 2023

Without any possibilty to subquery, I admit this kind of task is a tad more difficult to do, but there would be some ways more efficient than the other to do such thing.

Let's see what we can do invdividually.

How do I get those ?

You could simulate a subquery, as in you first get the distinct list of the status_uuid first then iterate on that list where you fetch only the first line of each status_uuid sorted by timestamp desc.

Let's say I have a table like this one :

My value_1 would be your status_uuid

My value_2 would be your json_data.

Grouping by value_1, the expected result would be
value_1 a → value_2 o
value_1 b → value_2 d
value_1 c → value_2 p
value_1 d → value_2 f

Query records and get the distinct values of value_1

Then, in a for each for each value_1, only get the first line of a query sorted by timestamp desc.

Here's the result :

how do I dedupe to get rid of the rest?

There would be many ways to do that. If we take the previous example, after you're done with doing whatever you needed to do with that row of data, you can do a second Query All on the table to get all data that has the same value_1, but not the same ID, then iterate on those to delete them one by one.

Valentin R · March 2023

Thanks for taking the time to answer, @MattLaro!

I thought about this way too and since I am dealing with timestamps I figured out a way to query value_1 and the corresponding max timestamp in one go:

(hard to tell if it's right, but I checked - timestamps are the “freshest”)

My only question is the efficiency of this.. whether this cleanup actually saves time vs. processing those older records. But I am not sure how to measure this.

Thanks for the hints!

MattLaro · March 2023

I think at this point it's more a matter of “What's the impact of executing only the very last request vs. executing every request sequentially?”.

If executing only the most recent request for the same status is fine in all situations, and that your table is not referenced by any foreign keys, then I'm rather positive that making data rows go “poof” will rarely be less efficient than processing each and every line, and by keeping your row count on that table in check, it will make aggretations lighter in the future.

Remove duplicates based on two fields

Answers

Categories