Problem:

I have many different URLs in database.

From many sites.

I dont know how these sites work and url structure.

So I need to get 500 URLs from each site then compare and group it by common static part.

Which should be automatically merged via replacing with {var} any dynamic URL parts.

And then get ~10 urls as result.

Final result: reduce database size

 

Solution:

Here is some kind Proof of Concept 🙂

Example with splitting URL by «?»

— Parse parameters.

— Calculate frequency for unique parameter values.

— Get Nth percentile.

— Build URLs and replace parameters which frequency is more than Nth percentile

For small data like here in sandbox 50 percentile is enough to group some URL.

For «big real data» 90-95 percentile.
For example: I use 90 percentile for 5000 links -> result ~200 links