Monday, May 26, 2008
LINQ & Lambda, Part 3: Html Agility Pack to LINQ to XML Converter

 

The Html Agility Pack brings flexible, versatile html parsing onto the .NET platform. It's a great toolkit for scraping websites that don't offer formal, documented APIs.

Under the hood, the agility pack has it's own parser that's deftly designed to parse malformed HTML in all it's nasty real-life forms.

For querying and traversing a parsed document, the pack offers a in-memory object graph, XPath and XSLT. Today I'd like to add a LINQ to XML converter for extra HTML agility in this new .NET 3.5 post-MIX '08 world.



using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using System.IO;

namespace HtmlAgilityPack
{
    public static class HtmlDocumentExtensions
    {
        public static XDocument ToXDocument(this HtmlDocument document)
        {
            using (StringWriter sw = new StringWriter())
            {
                document.OptionOutputAsXml = true;
                document.Save(sw);
                return XDocument.Parse(sw.GetStringBuilder().ToString());
            }
        }
    }
}


For a demonstration on how to use this, I'm going to partially solve a scraping problem I came across a few days ago. Basically, I needed to find all the images on a webpage, order them by descending size, and list them to the user for their selection. Their choice would be used to generate a thumbnail. Social content sites like Digg and Mixx do something similar, but seem to download the top 10 biggest images, thumbnail them anyway and let the user choose. In the following example, I'm going to simply find all the images in the right order and let the reader do the thumbnailing :)

The first step is to download a page and convert it's HtmlDocument object graph into an XDocument.



HtmlWeb hw = new HtmlWeb();
string url = @"http://www.nytimes.com/2008/05/18/us/politics/18memoirs.html?hp";
Uri uri = new Uri(url);

HtmlDocument doc = hw.Load(url);
var xdoc = doc.ToXDocument();


 

Now we can find all the image tags with a simple LINQ expression.



var imgs =
        from el in xdoc.Descendants()
        where el.Name.LocalName == "img"
        select 
        new { 
            Src = el.Attribute(XName.Get("src")).Value,
        };


The last step is to order them by descending size, but there's a couple of things to consider when reading the "width" and "height" from an image XElement.

  1. "width" and "height" may not exist. These attributes aren't mandatory.
  2. "width" and "height" may be expressed in percentages.
  3. XElement.Attribute(XName.Get("width")).Value will throw a NullRefereceException if the attribute doesn't exist.

Every raster image has a width and height, even if the image tag doesn't. If the size cannot be seen on the tag, we could download and open the image to find out it's true height/width. A fast way to do this would be read the image header, but this post isn't about the Image class or reading images - just LINQ to HTML. For now, I'm going to create a default width and height of 0.



XName widthAttribute = XName.Get("width");
XName heightAttribute = XName.Get("height");
XName srcAttribute = XName.Get("src");
XAttribute defaultHeight = new XAttribute(heightAttribute, "0");
XAttribute defaultWidth = new XAttribute(widthAttribute, "0");


Now we can read the width and height in the LINQ expression allowing for defaults, and adding a hack for sizes expressed in percentages.



var imgs =
        from el in xdoc.Descendants()
        let width = Int32.Parse((el.Attribute(widthAttribute) ?? defaultWidth).Value.TrimEnd('%'))
        let height = Int32.Parse((el.Attribute(heightAttribute) ?? defaultHeight).Value.TrimEnd('%'))
        let metric = Math.Sqrt(width * height)
        where el.Name.LocalName == "img"         orderby metric descending         select          new { 
            Src = el.Attribute(srcAttribute).Value,
            Width = width,
            Height = height
        };

foreach (var image in imgs)
{
    Console.WriteLine("{0} Size: {1}x{2}", 
        image.Src, 
        image.Width, 
        image.Height);
}


Hey presto! The image URLs get written to the console in the right order. The output for the new york times article looks like this:

Found 26 images

http://graphics7.nytimes.com/images/2008/05/18/us/18mem.span.jpg Size: 600x350
http://graphics7.nytimes.com/adx/images/ADS/14/60/ad.146024/sf.gif Size: 300x250
http://graphics7.nytimes.com/images/2008/05/17/us/18memoir_190.jpg Size: 190x285
http://graphics7.nytimes.com/ads/marketing/mm08/opinion_033108.jpg Size: 334x105
http://graphics7.nytimes.com/ads/marketing/mm08/opinion_052608.jpg Size: 334x105
http://graphics8.nytimes.com/images/2008/05/25/opinion/25moth_glanville.jpg Size: 151x151
http://graphics8.nytimes.com/images/2008/05/25/weekinreview/25moth_wong.jpg Size: 151x151
http://graphics8.nytimes.com/images/2008/05/23/fashion/25moth-brain1.ready.jpg Size: 151x151
http://graphics8.nytimes.com/images/2008/05/25/opinion/25moth_classic.gif Size:
151x151
http://graphics8.nytimes.com/images/2008/05/25/nyregion/thecity/25moth_crime.jpg Size: 151x151
http://graphics8.nytimes.com/images/2008/05/25/travel/25moth-trail.jpg Size: 151x151
http://graphics7.nytimes.com/images/blogs/thecaucus/thecaucus75.jpg Size: 75x75
http://graphics7.nytimes.com/ads/marketing/mm07/nyt-logo.png Size: 151x26
http://graphics7.nytimes.com/ads/marketing/mm07/nyt-logo.png Size: 151x26
http://graphics7.nytimes.com/ads/marketing/mm07/vertical_opinion.gif Size: 145x23
http://graphics7.nytimes.com/ads/marketing/mm07/vertical_opinion.gif Size: 145x23
http://up.nytimes.com/?d=0//&t=2&s=0&ui=&r=&u= Size: 3x1
/adx/bin/clientside/7e2e2078Q2FZkQ3F3VQ7BpQ27!l6c6L7cQ7BQ7Bpp82Q27Vp Size: 3x1
http://wt.o.nytimes.com/dcsym57yw10000s1s8g0boozt_9t1x/njs.gif?js=No&WT.tv=1.0.7 Size: 1x1
http://graphics7.nytimes.com/ads/ameriprise/AMP_logo_88x31.gif Size: 0x0
http://graphics7.nytimes.com/images/misc/nytlogo153x23.gif Size: 0x0
http://politics.nytimes.com/images/section/politics/elections/2008.gif Size: 0x0
http://graphics7.nytimes.com/adx/images/ADS/16/69/ad.166980/728x90_T_logo.gif Size: 0x0
http://graphics7.nytimes.com/adx/images/ADS/16/95/ad.169529/youngheart_88x31_8.gif Size: 0x0
http://graphics7.nytimes.com/adx/images/ADS/16/01/ad.160192/TMAG_86x60.gif Size: 0x0
http://graphics7.nytimes.com/ads/blank.gif Size: 0x0

 

Some of my assumptions above may seem less than ideal, but remember that it's good practice to include the width and height in the image tag so the browser knows how much visual space to leave while the page loads. For any well designed page this approach works nicely.

Another neat use for LINQ :)

UPDATE: you can download the code with my modifications and this sample from here.

kick it on DotNetKicks.com

Tags: , , , ,

posted @ Monday, May 26, 2008 4:03 PM | Feedback (3)
ActionDisposable

I borrowed this class from the Wes Dyer's LINQ to ASCII Art post. It's a great way to return IDisposable from a method without creating a specialized class.



using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
 

namespace System {

    public class ActionDisposable: IDisposable {
        Action action;
        public ActionDisposable(Action action) {
            this.action = action;
        }

        public void Dispose() {
            action();
        }
    }
}
  

 

Wes uses ActionDisposable casually to reset Console properties. My example to demonstrate this class will be similar to Eilon Lipton's solution to locking a collection, expect I'll use ActionDisposable instead of specialized ReadLockDisposable and WriteLockDisposable.



public class SingletonSafeCollection: Collection

{

    private static readonly SingletonSafeCollection instance = new SingletonSafeCollection();

 

    public static SingletonSafeCollection Instance {

        get

        {

            return instance;

        }

    }


    private ReaderWriterLockSlim rwLock = new ReaderWriterLockSlim();

    /// Acquires a read lock for reading the collection
    /// Disposable object that releases the read lock
    public IDisposable GetReadLock()
    {
       rwLock.EnterReadLock();
        return new ActionDisposable(rwLock.ExitReadLock);
    }


    /// Acquires a write lock for writing the collection
    /// Disposable object that releases the write lock
    public IDisposable GetWriteLock()

    {

        rwLock.EnterWriteLock();

        return new ActionDisposable(rwLock.ExitWriteLock);

    } 
}


 

Now I can use the SingletonSafeCollection<T> anywhere in my code with guarantees that when I one takes a read lock, no one will be writing to the collection and when one takes a write lock, no one will be writing to the collection.

 




static void Main(string[] args)

{

    SingletonSafeCollection numbers = SingletonSafeCollection.Instance;

    using (numbers.GetWriteLock())

    {

        // safe to write with a guarentee no one is reading or modifying the collection (except this thread)

        numbers.Add(1212);

    }

 

    using (numbers.GetReadLock())

    {

        // safe to read the list with a guarentee no one is modifying the collection

        numbers.ToList().ForEach(i => Console.WriteLine(i));

    }

 

    Console.ReadLine();

}



Tags: , ,

posted @ Monday, May 26, 2008 2:32 PM | Feedback (0)
Posts
8
Comments
28
Trackbacks
0