'''Part II: Distinguish Internal Links'''
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)
<a href = https://www.facebook.com/...></a>
we examine all pages(nodes) at the same depth before going down to the next depth.
Python file saved in
E:\projects\listing page identifier\Internal_Link\Internal_url_BFS.py
'''''Depth-First Search (DFS) approach''''':
we visit a page(node)"A" and then all its A's children on the current path will be visited before we visit A's neighbor node "B".
For example, assuming the furthest depth a user wants to dig in is 2, we will start with our homepage and then examine its first child node "page 1", then visiting page 1's children until we meet the maximum depth. Then we move onto homepage's second child "page 2" and visit page 2's children until we reach the maximum depth. Next we visit homepage's next child and so on.
Python file saved in
E:\projects\listing page identifier\Internal_Link\Internal_url_DFS.py