Skip to content Skip to sidebar Skip to footer

Pulling Html From A Webpage In Java

I want to pull the entire HTML source code file from a website in Java (or Python or PHP if it is easier in those languages to display). I wish only to view the HTML and scan throu

Solution 1:

In Java:

URL url = new URL("http://stackoverflow.com");
URLConnection connection = new URLConnection(url);
InputStream stream = url.openConnection();
// ... read stream like any file stream

This code, is good for scripting purposes and internal use. I would argue against using it for production use though. It doesn't handle timeouts and failed connections.

I would recommend using HttpClient library for production use. It supports authentication, redirect handling, threading, pooling, etc.


Solution 2:

In Python:

import urllib
# Get a file-like object for the Python Web site's home page.
f = urllib.urlopen("http://www.python.org")
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()

Please see Python and HTML Processing for more details.


Solution 3:

Maybe you should also consider an alternative like running a standard utility like wget or curl from the command line to fetch the site tree into a local directory tree. Then do your scanning (in Java, Python, whatever) using the local copy. It should be simpler to do that, than to implement all of the boring stuff like error handling, argument parsing, etc yourself.

If you want to fetch all pages in a site, wget and curl don't know how to harvest links from HTML pages. An alternative is to use an open source web crawler.


Post a Comment for "Pulling Html From A Webpage In Java"