PHF: Portable Hypertext Format

Save an entire webpage in a single HTML file like a PDF, using data: URLs

PHF_Get version 0.4.0 (beta)
  • Note: Valid http:// or https:// URLs only. This operation may take up to a minute or two for large and / or complicated pages. Please be patient. The script will time out after two (2) minutes.

  • Choose User Agent
    • phf_get %{version}
    • Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)

What's Happening?

Have you ever downloaded a webpage only to find out that it becomes a shadow of its former self when viewed later? In order for downloaded webpages to work properly, all of the associated content, such as images, stylesheets and javascripts, must be downloaded as well or you end up with an unstyled mess of text. What if you could download a complete single webpage, with all associated content, all in one file?

Many people know that in Opera, Mozilla/Firefox, Safari and Konqueror browsers, images can be embedded into your HTML pages and stylesheets using data: URLs. Yet this ability can be extended to include virtually any type of file, from stylesheet to external javascript.

This tool will download a webpage you specify and grab all associated content, such as images, stylesheets and javascripts. External content will be encoded in data: URLs and substituted for the original URLs. The result is an HTML document which is completely self-contained; all styling, scripting and image data is included in a single file, like a PDF. The only problem is, it won't work in MSIE; you must load the completed file in a newer browser.

I call this resulting document a PHF because you can save it, move it and email it, just like a PDF. There are no folders or linked files necessary; the HTML page you download should appear and (if the javascript isn't too complex) behave completely like the original. Try it for yourself using the form above.

Where This Method Falls Short

  1. The principal advantage other "all-in-one" formats (.mhtml etc.) have over PHF is when a page happens to include the same large assets in several locations. For example, you could include the same large font file in multiple .css files, and your browser will know to use its cached version instead of requesting the same file again from the server. Likewise, in the .mhtml format, the data for each asset is stored only once, and then referenced by the other files using a special system of substitution strings, much like the way email behaves. Using PHF however, the same lo-o-o-o-o-ong data: URL must be included for every instance of the asset used, wherever it is encountered.

    In addition, a browser exporting an .mhtml file has the benefit of its own internal page logic which gives it an exhaustive list of all assets the current page is using. Assets the current page is not using can be ignored, while in contrast the PHF format will download and save all referenced assets, regardless of whether they are used or not. These assets could be, for example, url( ... ) linked images in your .css file in style rules that don't apply to the current page.

    The above can lead to seemingly small-sized pages resulting in absolutely gigantic PHF files!

  2. Another advantage of browser-based .mhtml export formats is that your browser also has access to all of your cookies for the domain you are downloading the file from. This means it will account for your login state and other interactive modifications the site may make on your specific behalf. The PHF tool can only get pages that aren't behind a login or other cookie-based access controls.

Version 0.4.0 (beta)

Get the PHP source for this tool! This script is licenced under the BSD licence.

Important to note

  1. The script includes external javascript files as-is; it will not parse them for URLs.
  2. I am developing this script right here, using the very file referenced by the form, so this service may be unavailable or broken from time to time.
  3. The internal script time limit has been set to two (2) minutes. If the URL (and associated assets) you requested takes longer than this time to download and parse, the script will time out with no output generated.
  4. This copy of the script has been set so any single file retrieved can have a maximum size of 512kB (the total returned PHF may be larger than this). Any individual files larger than this will be ignored.
  5. All PHP errors and weird things like unusual mime-types encountered while running this script are logged for me to review. So feel free to use it on all kinds of URLs in order to break it!
  6. You can also helpfully report bugs using my contact form.

Host it yourself!

This tool essentially downloads all the files associated with the supplied webpage URL. It costs me bandwidth to grab it, CPU time to process it, and more bandwidth to serve it. If you have your own webspace with PHP + cURL, you can do me a favour and host your own copy using the source code!