Web Dehydrator: Any web-page to JSON!

Web Dehydrator is a tool that helps transform web-pages to JSON. Uses Zend Framework 2, Symfony DomCrawler and PhantomJS

Posted on July 07, 2013

WARNING: Please note that this article was published a long time ago. The information contained might be outdated.

My new project is online. It's called Web Dehydrator and it can be described as a tool that transforms any web-page to JSON. Web Dehydrator is made by a mix of Zend Framework 2, Symfony DomCrawler and PhantomJS. This is what each component does:

  • PhantomJS is used to retrieve the content of a web-page
  • a plugin manager runs a set of plugins built to extract data (via Symfony DomCrawler) from the content of the web-page
  • the extracted data is used to create the JSON result
  • Zend Framework 2 sticks all together via service manager, event manager, caching and MVC Layer.

I haven't published the code behind the Web Dehydrator service, but I could share it if someone is interested in helping.

The following is a sample JSON output of the result of the data extracted from the http://www.dilbert.com/ website:

{
    "description":"The Official Dilbert Website featuring Scott Adams Dilbert strips, animation, mashups and more starring Dilbert, Dogbert, Wally, The Pointy Haired Boss, Alice, Asok, Dogbertu0027s New Ruling Class and more.",
    "title":"The official Dilbert website with Scott Adamsu0027 color comic strips, animation, mashups and more!",
    "ogTitle":"The official Dilbert website with Scott Adamsu0027 color comic strips, animation, mashups and more!",
    "ogImage":"http://dilbert.com/img/v1/fb_image.jpg",
    "lastUpdated":1373206898,
    "url":"http://dilbert.com/",
    "content":{
        "title":"The Dilbert Strip for July 7, 2013",
        "imgBig":"http://dilbert.com/dyn/str_strip/000000000/00000000/0000000/100000/80000/5000/800/185868/185868.strip.zoom.gif",
        "imgSmall":"http://dilbert.com/dyn/str_strip/000000000/00000000/0000000/100000/80000/5000/800/185868/185868.strip.gif"
    },
    "pagination":{
        "prev":{
            "url":"http://dilbert.com/2013-07-06/"
        }
    }
}
![Web Dehydrator: Any web-page to JSON!](/blog/web-dehydrator-any-web-page-to-json-9325375/img/web-dehydrator-website-screenshot-e1373208226731.png)