May 2008 Entries
LINQ & Lambda, Part 3: Html Agility Pack to LINQ to XML Converter

 

The Html Agility Pack brings flexible, versatile html parsing onto the .NET platform. It's a great toolkit for scraping websites that don't offer formal, documented APIs.

Under the hood, the agility pack has it's own parser that's deftly designed to parse malformed HTML in all it's nasty real-life forms.

For querying and traversing a parsed document, the pack offers a in-memory object graph, XPath and XSLT. Today I'd like to add a LINQ to XML converter for extra HTML agility in this new .NET 3.5 post-MIX '08 world.



using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using System.IO;

namespace HtmlAgilityPack
{
    public static class HtmlDocumentExtensions
    {
        public static XDocument ToXDocument(this HtmlDocument document)
        {
            using (StringWriter sw = new StringWriter())
            {
                document.OptionOutputAsXml = true;
                document.Save(sw);
                return XDocument.Parse(sw.GetStringBuilder().ToString());
            }
        }
    }
}


For a demonstration on how to use this, I'm going to partially solve a scraping problem I came across a few days ago. Basically, I needed to find all the images on a webpage, order them by descending size, and list them to the user for their selection. Their choice would be used to generate a thumbnail. Social content sites like Digg and Mixx do something similar, but seem to download the top 10 biggest images, thumbnail them anyway and let the user choose. In the following example, I'm going to simply find all the images in the right order and let the reader do the thumbnailing :)

The first step is to download a page and convert it's HtmlDocument object graph into an XDocument.



HtmlWeb hw = new HtmlWeb();
string url = @"http://www.nytimes.com/2008/05/18/us/politics/18memoirs.html?hp";
Uri uri = new Uri(url);

HtmlDocument doc = hw.Load(url);
var xdoc = doc.ToXDocument();


 

Now we can find all the image tags with a simple LINQ expression.



var imgs =
        from el in xdoc.Descendants()
        where el.Name.LocalName == "img"
        select 
        new { 
            Src = el.Attribute(XName.Get("src")).Value,
        };


The last step is to order them by descending size, but there's a couple of things to consider when reading the "width" and "height" from an image XElement.

  1. "width" and "height" may not exist. These attributes aren't mandatory.
  2. "width" and "height" may be expressed in percentages.
  3. XElement.Attribute(XName.Get("width")).Value will throw a NullRefereceException if the attribute doesn't exist.

Every raster image has a width and height, even if the image tag doesn't. If the size cannot be seen on the tag, we could download and open the image to find out it's true height/width. A fast way to do this would be read the image header, but this post isn't about the Image class or reading images - just LINQ to HTML. For now, I'm going to create a default width and height of 0.



XName widthAttribute = XName.Get("width");
XName heightAttribute = XName.Get("height");
XName srcAttribute = XName.Get("src");
XAttribute defaultHeight = new XAttribute(heightAttribute, "0");
XAttribute defaultWidth = new XAttribute(widthAttribute, "0");


Now we can read the width and height in the LINQ expression allowing for defaults, and adding a hack for sizes expressed in percentages.



var imgs =
        from el in xdoc.Descendants()
        let width = Int32.Parse((el.Attribute(widthAttribute) ?? defaultWidth).Value.TrimEnd('%'))
        let height = Int32.Parse((el.Attribute(heightAttribute) ?? defaultHeight).Value.TrimEnd('%'))
        let metric = Math.Sqrt(width * height)
        where el.Name.LocalName == "img"         orderby metric descending         select          new { 
            Src = el.Attribute(srcAttribute).Value,
            Width = width,
            Height = height
        };

foreach (var image in imgs)
{
    Console.WriteLine("{0} Size: {1}x{2}", 
        image.Src, 
        image.Width, 
        image.Height);
}


Hey presto! The image URLs get written to the console in the right order. The output for the new york times article looks like this:

Found 26 images

http://graphics7.nytimes.com/images/2008/05/18/us/18mem.span.jpg Size: 600x350
http://graphics7.nytimes.com/adx/images/ADS/14/60/ad.146024/sf.gif Size: 300x250
http://graphics7.nytimes.com/images/2008/05/17/us/18memoir_190.jpg Size: 190x285
http://graphics7.nytimes.com/ads/marketing/mm08/opinion_033108.jpg Size: 334x105
http://graphics7.nytimes.com/ads/marketing/mm08/opinion_052608.jpg Size: 334x105
http://graphics8.nytimes.com/images/2008/05/25/opinion/25moth_glanville.jpg Size: 151x151
http://graphics8.nytimes.com/images/2008/05/25/weekinreview/25moth_wong.jpg Size: 151x151
http://graphics8.nytimes.com/images/2008/05/23/fashion/25moth-brain1.ready.jpg Size: 151x151
http://graphics8.nytimes.com/images/2008/05/25/opinion/25moth_classic.gif Size:
151x151
http://graphics8.nytimes.com/images/2008/05/25/nyregion/thecity/25moth_crime.jpg Size: 151x151
http://graphics8.nytimes.com/images/2008/05/25/travel/25moth-trail.jpg Size: 151x151
http://graphics7.nytimes.com/images/blogs/thecaucus/thecaucus75.jpg Size: 75x75
http://graphics7.nytimes.com/ads/marketing/mm07/nyt-logo.png Size: 151x26
http://graphics7.nytimes.com/ads/marketing/mm07/nyt-logo.png Size: 151x26
http://graphics7.nytimes.com/ads/marketing/mm07/vertical_opinion.gif Size: 145x23
http://graphics7.nytimes.com/ads/marketing/mm07/vertical_opinion.gif Size: 145x23
http://up.nytimes.com/?d=0//&t=2&s=0&ui=&r=&u= Size: 3x1
/adx/bin/clientside/7e2e2078Q2FZkQ3F3VQ7BpQ27!l6c6L7cQ7BQ7Bpp82Q27Vp Size: 3x1
http://wt.o.nytimes.com/dcsym57yw10000s1s8g0boozt_9t1x/njs.gif?js=No&WT.tv=1.0.7 Size: 1x1
http://graphics7.nytimes.com/ads/ameriprise/AMP_logo_88x31.gif Size: 0x0
http://graphics7.nytimes.com/images/misc/nytlogo153x23.gif Size: 0x0
http://politics.nytimes.com/images/section/politics/elections/2008.gif Size: 0x0
http://graphics7.nytimes.com/adx/images/ADS/16/69/ad.166980/728x90_T_logo.gif Size: 0x0
http://graphics7.nytimes.com/adx/images/ADS/16/95/ad.169529/youngheart_88x31_8.gif Size: 0x0
http://graphics7.nytimes.com/adx/images/ADS/16/01/ad.160192/TMAG_86x60.gif Size: 0x0
http://graphics7.nytimes.com/ads/blank.gif Size: 0x0

 

Some of my assumptions above may seem less than ideal, but remember that it's good practice to include the width and height in the image tag so the browser knows how much visual space to leave while the page loads. For any well designed page this approach works nicely.

Another neat use for LINQ :)

UPDATE: you can download the code with my modifications and this sample from here.

kick it on DotNetKicks.com

Tags: , , , ,

posted @ Monday, May 26, 2008 4:03 PM | Feedback (3)
ActionDisposable

I borrowed this class from the Wes Dyer's LINQ to ASCII Art post. It's a great way to return IDisposable from a method without creating a specialized class.



using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
 

namespace System {

    public class ActionDisposable: IDisposable {
        Action action;
        public ActionDisposable(Action action) {
            this.action = action;
        }

        public void Dispose() {
            action();
        }
    }
}
  

 

Wes uses ActionDisposable casually to reset Console properties. My example to demonstrate this class will be similar to Eilon Lipton's solution to locking a collection, expect I'll use ActionDisposable instead of specialized ReadLockDisposable and WriteLockDisposable.



public class SingletonSafeCollection: Collection

{

    private static readonly SingletonSafeCollection instance = new SingletonSafeCollection();

 

    public static SingletonSafeCollection Instance {

        get

        {

            return instance;

        }

    }


    private ReaderWriterLockSlim rwLock = new ReaderWriterLockSlim();

    /// Acquires a read lock for reading the collection
    /// Disposable object that releases the read lock
    public IDisposable GetReadLock()
    {
       rwLock.EnterReadLock();
        return new ActionDisposable(rwLock.ExitReadLock);
    }


    /// Acquires a write lock for writing the collection
    /// Disposable object that releases the write lock
    public IDisposable GetWriteLock()

    {

        rwLock.EnterWriteLock();

        return new ActionDisposable(rwLock.ExitWriteLock);

    } 
}


 

Now I can use the SingletonSafeCollection<T> anywhere in my code with guarantees that when I one takes a read lock, no one will be writing to the collection and when one takes a write lock, no one will be writing to the collection.

 




static void Main(string[] args)

{

    SingletonSafeCollection numbers = SingletonSafeCollection.Instance;

    using (numbers.GetWriteLock())

    {

        // safe to write with a guarentee no one is reading or modifying the collection (except this thread)

        numbers.Add(1212);

    }

 

    using (numbers.GetReadLock())

    {

        // safe to read the list with a guarentee no one is modifying the collection

        numbers.ToList().ForEach(i => Console.WriteLine(i));

    }

 

    Console.ReadLine();

}



Tags: , ,

posted @ Monday, May 26, 2008 2:32 PM | Feedback (0)
LINQ & Lambda, Part 1: iTunes XML Querying

Today I was curious what my friend's favourite artists were. Rather than just ask him and give away any surprise of this years birthday present, I decided to sneakily scrape his iTunes library. He has a wide taste in music and buys a lot of it, so he's a great source of new music. His iTunes library is filled with interesting artists, songs and compilations. On the other hand, he's in the industry and collects a lot of amateur music. When looking for an ideal birthday gift idea in his iTunes library, I'll need to consider how much he plays any song.

iTunes keeps it's music metadata in an XML file for decorating music files with other metadata not supported in ID3 tags.

Using LINQ, there are two ways I could query this data:-

  1. Use LINQ to XML.
  2. Convert the DTD to XSD, generate proxy classes using LINQ to XSD and then query the loaded file with LINQ to Objects.

I *need* my LINQ & Lambda fix - now - so I'm going to the take the fast option numero uno. If anyone is interested in seeing LINQ to XSD in action, let me know in the comments.

Niel Bornstein's article on hacking the iTunes XML describes the DTD clearly. Even though his article is old, the key-value pair structure allows for the addition of new properties in newer iTunes versions.

If you've at least seen some XML before, simply perusing the file is enough to understand the simple format.

  • On Windows, you can find the library XML file in My Documents\My Music\iTunes\iTunes Music Library.XML
  • On OS X, it's in ~/Music/iTunes/iTunes Music Library.XML

Basically, the document root is a plist tag split up into three sections - header, track info and playlist info. Each section is within a dict tag, and each entity within the Track and Playlist sections are contained within dict tags.

The XML element for a track looks like :-


  
<key>839</key>
  <dict>
    <key>Track ID</key><integer>839</integer>
    <key>Name</key><string>Sweet Georgia Brown</string>
    <key>Artist</key><string>Count Basie & His Orchestra</string>
    <key>Composer</key><string>Bernie/Pinkard/Casey</string>
    <key>Album</key><string>Prime Time</string>
    <key>Genre</key><string>Jazz</string>
    <key>Kind</key><string>Protected AAC audio file</string>
    <key>Size</key><integer>3771502</integer>
    <key>Total Time</key><integer>219173</integer>
    <key>Disc Number</key><integer>1</integer>
    <key>Disc Count</key><integer>1</integer>
    <key>Track Number</key><integer>3</integer>
    <key>Track Count</key><integer>8</integer>
    <key>Year</key><integer>1977</integer>
    <key>Date Modified</key><date>2004-06-16T18:10:55Z</date>
    <key>Date Added</key><date>2004-06-16T18:08:31Z</date>
    <key>Bit Rate</key><integer>128</integer>
    <key>Sample Rate</key><integer>44100</integer>
    <key>Play Count</key><integer>3</integer>
    <key>Play Date</key><integer>-1119376103</integer>
    <key>Play Date UTC</key><date>2004-08-17T16:39:53Z</date>
    <key>Rating</key><integer>100</integer>
    <key>Artwork Count</key><integer>1</integer>
    <key>File Type</key><integer>1295274016</integer>
    <key>File Creator</key><integer>1752133483</integer>
    <key>Location</key><string>file://localhost/Users/niel/Music/music.mp4</string>
    <key>File Folder Count</key><integer>4</integer>
    <key>Library Folder Count</key><integer>1</integer>
  </dict>


To construct my query, I broke my goal into three tasks:

  1. Find the collection of tracks.
  2. For each song, find the play count and artist name.
  3. Descendingly sort the song list by play count.
  4. Select any artist only once.

And here's the code to write out my friends' favourite artists in descending order of popularity.


using System;
using System.IO;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Linq;

namespace iTunesXmlParser {
   class Program {
       static void Main(string[] args) {
               string xmlFile = args[1];
               using (var sr = new StreamReader(xmlFile)) {
                   var doc = XDocument.Load(sr);
                   // find the tracks dictionary
                   XElement tracksDictionary = (from element in doc.Descendants()
                                                where element.Value.Equals("Tracks")
                                                select element).First().NextNode as XElement;

                   // get all distinct track artist names ordered descendingly by track play count
                   // NOTE: if no play count exists for a track, the track is assumed to have 0 plays
                   List artistNames = (from el in tracksDictionary.Descendants()
                                  let playCountValues = (from trackField in el.Descendants()
                                                         where trackField.Value == "Play Count"
                                                         select (trackField.NextNode as XElement))
                                  let artistValues = (from trackField in el.Descendants()
                                                         where trackField.Value == "Artist"
                                                         select (trackField.NextNode as XElement))
                                  where 
                                  el.Name.LocalName == "dict" // find the track key-value pair property set
                                  && artistValues.Count() > 0 // an artist property should exist
                                  orderby playCountValues.Count() == 0? 0 : Int32.Parse(playCountValues.First().Value) descending
                                  select artistValues.First().Value).Distinct().ToList();

                   Console.WriteLine("Number of artists:" + artistNames.Count());

                   // write out the artist names in decending order of popularity
                   foreach (var artistName in artistNames) {
                       Console.WriteLine(artistName);
                   }
               }
               Console.WriteLine("hit enter to exit");
               Console.ReadLine();
       }
   }

}
  

The let keyword allows you to define a variable within a query. In my case, the playCountValues and artistValues are the each IEnumerable<XElements> of the play count and artist name respectively. let is great for keeping variables that will be reused more than once elsewhere in the query.

This query is a bit complex and unusually long because the iTunes XML doesn't store key-values like <Name>DJ Shadow</Name> (the typical format). Notice the need to get the value element with (element.NextNode as XElement).

After running this on my friends library, the console spat out over 5000 lines of artist names. Success! Top of the list is "Zero 7". Second on the list is "DJ Shadow" - who I've never heard before. I'll definitely check these artists out.

The next thing I'll do is find out what albums of "Zero 7" he's got and perhaps cross reference the list with the an Amazon page to find an album he doesn't own. This years birthday present decision will be completely automated :)

Till next time on LINQ & Lambda, if there's any particular LINQ provider you want me to cover, please leave a message in the comments.

kick it on DotNetKicks.com

Tags: , , ,

posted @ Monday, May 05, 2008 1:37 AM | Feedback (1)
LINQ & Lambda

I think I have a problem. I'm addicted to LINQ and Lambdas...

Every opportunity I get to play with a new LINQ provider I take it. LINQ to XSD, LINQ to Lucene, LINQ to Flickr, LINQ to SQL, LINQ to * - you name it - I've injected used it.

And it seems I'm not the only one, with others confessing their uncontrollable attraction toward them too.

Last week I spent a few days on an old C# 2 project. Within minutes, I was clutching for the sweet, sweet 3.0 compiler. I couldn't stop sweating when writing event handler receivers like delegate(object sender, EventArgs e) {}.

Calm down Vijay... they're just withdrawal symptoms...

Oh the inner angst and turmoil! What do I do?! How do I expel the Demons from within?!

I call these demons LINQ & Lambda, and they've irrecoverably changed the way I think - reprogrammed my brain.

Last night, I dreamt of an anthropomorphism of these Demons.

LINQ and Lambda

Coincidentally, they resemble the evil, innocent looking, white-nationalist pop duo Prussian Blue.

Perhaps the best way to deal with them is to get the word out. Warn others of my folly and misadventure - in blog-series format.

Stay tuned for Part 1 of LINQ & Lambda...

kick it on DotNetKicks.com

Tags: ,

posted @ Sunday, May 04, 2008 6:32 PM | Feedback (5)
Posts
8
Comments
28
Trackbacks
0