Today I am trying to solve a very interesting problem, I would like to call it the “data migration problem”.
Background
Let me illustrate to problem in short.
- We are working on data migration of a PMS system to a newer version.
- This system contains 2 major entities: USER & HOTEL
- Each USER could operate 1 or more HOTELS
- Each HOTELS could be operated by 1 or more USERS
- Migrations are operated by batch of hotels. Each batch we could migrate 1 or more hotels.
Requirement
- After each batch of migration, all hotels operated by the same user should at the same status (migrated / non-migrated).
- In the migration process, all hotels in the same batch could not be operate (downtime happens). So we are trying to minimize size of each batch with above requirement.
Examples
Each input line implies a relation of operation / management1
2
3
4
5
6
7
8
9input:
user_1 hotel_1
user_2 hotel_2
user_3 hotel_3
output:
[hotel_1]
[hotel_2]
[hotel_3]
Each user operate individual hotels, so each hotel could be migrated in separate batch.
1 | input: |
All hotels should be migrate in the same batch. Once hotel_1 is migrated, all other hotels user_2 operates (here, hotel_2) should be migrated as well because user_2 also operate hotel_1. Similarly, once hotel_2 migrated, hotel_3 should also be migrated (in the same batch) because user_3 operate these two hotels.
Solutions
At the first glance, The data structure are very similar to BIPARTITE GRAPH. The nodes consist of two part are the hotels and the users, edges are relations between hotels and users. There no edges between hotels, neither do users.
However, we are not trying to apply capable users to manage hotel as much as possible. So it’s not a classical bipartite graph MATCH PROBLEM. Instead, we are trying to judge whether nodes are connected via edges, which could be convert to another problem – “Finding all CONNECTED COMPONENT“.
Of course we have BFS/DFS to answer such questions. However, some information is not important to our current problem – the structure. We don’t need to know which manager make ‘Ritz Carlton’ and ‘Hilton’ connected, we just need to make sure both hotel are in the same batch. So we are able to make our algorithm more effective using DISJOINT SET.
Explanations
Let’s see the Elem
class. Actually it’s an implementation of the Disjoint Set.
- When initialize, Each
Elem
could be seen as an single element set. parent
is a pointer to its parentElem
.- To judge whether two element is in the same set, check whether their
root
(utmost parent / ancestor) are the same element - To
union
two set, we find theroot
of each set, set parent of one root to the other root
1 | def root(self): |
And the iteration algorithm is:
- Each user / hotels are initialized as a individual set.
- Iterate all hotels, for each hotel:
- iterate each user it belongs to:
- If the user has not been union into any hotel, then union with current hotel
- Else, the user has been union into an existing hotel set. In this situation, current hotel should also be merged into the same hotel.
- iterate each user it belongs to:
1 | for h in hotels: |
The Loop Invariant of this algorithm guarantee that:
- After each hotel looping, the root of any hotel Elem is always another hotel or itself. This means it’s correctly unitied with all those hotels known and supposed to be migrated together, if any, in the same batch.
- After each user looping for any hotel, the root of the user Elem is always a hotel.
That’s it.
Code
Here I feed this python script with a comma separated csv file, marking users and hotels.
1 | import sys |