Problem:
I have many different URLs in database.
From many sites.
I dont know how these sites work and url structure.
So I need to get 500 URLs from each site then compare and group it by common static part.
Which should be automatically merged via replacing with {var} any dynamic URL parts.
And then get ~10 urls as result.
Final result: reduce database size
Solution:
Here is some kind Proof of Concept 🙂
Example with splitting URL by «?»
— Parse parameters.
— Calculate frequency for unique parameter values.
— Get Nth percentile.
— Build URLs and replace parameters which frequency is more than Nth percentile
For small data like here in
For «big real data» 90-95 percentile.
For example: I use 90 percentile for 5000 links ->
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
<?php $stats = []; $pages = [ (object)['page' => 'http://example.com/?page=123'], (object)['page' => 'http://example.com/?page=123'], (object)['page' => 'http://example.com/?page=123'], (object)['page' => 'http://example.com/?page=321'], (object)['page' => 'http://example.com/?page=321'], (object)['page' => 'http://example.com/?page=321'], (object)['page' => 'http://example.com/?page=qwas'], (object)['page' => 'http://example.com/?page=safa15'], ]; // array of objects with page property = URL $params_counter = []; foreach ($pages as $page) { $components = explode('?', $page->page); if (!empty($components[1])) { parse_str($components[1], $params); foreach ($params as $key => $val) { if (!isset($params_counter[$key][$val])) { $params_counter[$key][$val] = 0; } $params_counter[$key][$val]++; } } } function procentile($percentile, $array) { sort($array); $index = ($percentile/100) * count($array); if (floor($index) == $index) { $result = ($array[$index-1] + $array[$index])/2; } else { $result = $array[floor($index)]; } return $result; } $some_data = []; foreach ($params_counter as $key => $val) { $some_data[$key] = count($val); } $procentile = procentile(90, $some_data); foreach ($pages as $page) { $components = explode('?', $page->page); if (!empty($components[1])) { parse_str($components[1], $params); arsort($params); foreach ($params as $key => $val) { if ($some_data[$key] > $procentile) { $params[$key] = '$var'; } } arsort($params); $pattern = http_build_query($params); $new_url = urldecode('?'.$pattern); if (!isset($stats[$new_url])) { $stats[$new_url] = 0; } $stats[$new_url]++; } } arsort($stats); var_dump($stats); |