Extract All Urls In A String With Python3
I am trying to find a clean way to extract all urls in a text string. After an extensive search, i have found many posts suggesting using regular expressions to do the task and the
Solution 1:
If you want a regex, you can use this:
import re
string = "Lorem ipsum dolor sit amet https://www.lorem.com/ipsum.php?q=suas, nusquam tincidunt ex per, ius modus integre no, quando utroque placerat qui no. Mea conclusionemque vituperatoribus et, omnes malorum est id, pri omnes atomorum expetenda ex. Elit pertinacia no eos, nonumy comprehensam id mei. Ei eum maiestatis quaerendum https://www.lorem.org😀. Pri posse constituam in, sit http://news.bbc.co.uk omnium assentior definitionem ei. Cu duo equidem meliore qualisque."
result = re.findall(r"\w+://\w+\.\w+\.\w+/?[\w\.\?=#]*", string)
print(result)
Output:
['https://www.lorem.com/ipsum.php?q=suas', 'https://www.lorem.org', 'http://news.bbc.co.uk']
Solution 2:
import re
import string
text = """
Lorem ipsum dolor sit amet https://www.lore-m.com/ipsum.php?q=suas,
nusquam tincidunt ex per, ftp://link.com ius modus integre no, quando utroque placerat qui no.
Mea conclusionemque vituperatoribus et, omnes malorum est id, pri omnes atomorum expetenda ex.
Elit ftp://link.work.in pertinacia no eos, nonumy comprehensam id mei. Ei eum maiestatis quaerendum https://www.lorem.org😀.
Pri posse constituam in, sit http://news.bbc.co.uk omnium assentior definitionem ei. Cu duo equidem meliore
qualisque.
"""
URL_REGEX = r"""((?:(?:https|ftp|http)?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|org|uk)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|uk|ac)\b/?(?!@)))"""
urls = re.findall(URL_REGEX, text)
print([''.join(x for x in url if x in string.printable) for url in urls])
Now if you want to keep only urls with valid domains you can write it as follow:
VALID_DOMAINS = ['lorem.org', 'bbc.co.uk', 'sample.com', 'link.net']
valid_urls = []
for url in result_url:
for val_domain in VALID_DOMAINS:
if val_domain in url:
valid_urls.append(url)
print(valid_urls)
Solution 3:
output = [x for x ininput().split() if x.startswith('http://') or x.startswith('https://') or x.startswith('ftp://')]
print(output)
your example: http://ideone.com/wys57x
After all you can also cut last character in elements of list if it is not a letter.
EDIT:
output = [x for x in input().split() if x.startswith('http://') or x.startswith('https://') or x.startswith('ftp://')]
newOutput = []
for link in output:
copy = link
while not copy[-1].isalpha():
copy = copy[:-1]
newOutput.append(copy)
print(newOutput)
Your example: http://ideone.com/gHRQ8w
Solution 4:
Using an existing library is probably the best solution.
But it was too much for my tiny script, and -- inspired by @piotr-wasilewiczs answer-- I came up with:
from string import ascii_letters
links = [x for x in line.split() if x.strip(str(set(x) - set(ascii_letters))).startswith(('http', 'https', 'www'))]
- for each word in the line,
- strip (from the beginning and the end) the non ASCII letters found in the word itself)
- and filter by the words starting with one of https, http, www.
A bit too dense for my taste and I have no clue how fast it is, but it should detect most "sane" urls in a string.
Post a Comment for "Extract All Urls In A String With Python3"