data:
URLsHave you ever downloaded a webpage only to find out that it becomes a shadow of its former self when viewed later? In order for downloaded webpages to work properly, all of the associated content, such as images, stylesheets and javascripts, must be downloaded as well or you end up with an unstyled mess of text. What if you could download a complete single webpage, with all associated content, all in one file?
Many people know that in Opera, Mozilla/Firefox, Safari and Konqueror browsers, images can be embedded into your HTML pages and stylesheets using data:
URLs.
Yet this ability can be extended to include virtually any type of file, from stylesheet to external javascript.
This tool will download a webpage you specify and grab all associated content, such as images, stylesheets and javascripts.
External content will be encoded in data:
URLs and substituted for the original URLs.
The result is an HTML document which is completely self-contained; all styling, scripting and image data is included in a single file, like a PDF.
The only problem is, it won't work in MSIE; you must load the completed file in a newer browser.
I call this resulting document a PHF because you can save it, move it and email it, just like a PDF. There are no folders or linked files necessary; the HTML page you download should appear and (if the javascript isn't too complex) behave completely like the original. Try it for yourself using the form above.
The principal advantage other "all-in-one" formats (.mhtml etc.) have over PHF is when a page happens to include the same large assets in several locations.
For example, you could include the same large font file in multiple .css files, and your browser will know to use its cached version instead of requesting the same file again from the server.
Likewise, in the .mhtml format, the data for each asset is stored only once, and then referenced by the other files using a special system of substitution strings, much like the way email behaves.
Using PHF however, the same lo-o-o-o-o-ong data:
URL must be included for every instance of the asset used, wherever it is encountered.
In addition, a browser exporting an .mhtml file has the benefit of its own internal page logic which gives it an exhaustive list of all assets the current page is using.
Assets the current page is url( ... )
linked images in your .css file in style rules that don't apply to the current page.
The above can lead to seemingly small-sized pages resulting in absolutely gigantic PHF files!
Get the PHP source for this tool! This script is licenced under the BSD licence.
This tool essentially downloads all the files associated with the supplied webpage URL. It costs me bandwidth to grab it, CPU time to process it, and more bandwidth to serve it. If you have your own webspace with PHP + cURL, you can do me a favour and host your own copy using the source code!