Jim’s Random Notes

Musings on technology and life

July 24th, 2008

Is that code really from Sun?

I updated my Java runtime the other day, and now every time I open a new tab in Internet Explorer, I get this message box:

It looks like somebody at Sun forgot to sign their update agent.  At least, I think this control came from Sun.  But there’s no way to be sure, is there?  Do I blindly assume that this really is from Sun and that they made a mistake in generating the build, or do I do the prudent thing and permanently disallow it?

In a security conscious world, there’s no excuse for a major player like Sun to have released something with this error.  One wonders, if an obvious bug like this makes it through their quality control, what other less obvious nasties are lurking in the code.

To heck with it.  If Sun wants to push their software on me, they’ll have to get it right.  I’m going to disallow the update agent.  If I ever need to update my Java runtime, I guess I’ll just have to do it manually.

July 22nd, 2008

Charlie versus the Wildlife. Again.

Every time I get to thinking that maybe Charlie’s learned not to mess with the local wildlife, he does something incredibly stupid to set me straight. Last night I let him out just before going to bed. He stood there by the door for a minute and then took off around the corner after something. 30 seconds later he was running across the yard with his face in the grass, and the unmistakable aroma of skunk assaulted my olfactory system.

Yes, Charlie got another skunk. More correctly, the skunk got him. Not only does the dog stink (he’s at the vet now, getting a skunk bath), but the skunk let loose around the side of the house–right next to the air conditioning unit. The house reeks. I’m at home today with the windows open and the whole-house fan pulling in the 95-degree air, hoping to get rid of that smell.

This is Charlie’s second skunk. I had hoped that after the last time he would have learned that the stinky black kitty with the white stripe is strictly hands-off. Sadly, he seems to be a slow learner.

July 21st, 2008

C# and .NET: What’s Next?

About 10 days ago, MSDN’s Channel 9 site released an hour-long video entitled Meet the Design Team, that talks in very vague terms about uncoming features in C# 4.0.  You’ll learn that the language will include more dynamic constructs and built-in support for multiple cores.  Honestly, that’s about all you’ll learn from watching the video.  Granted, either one of those broad features implies many changes to the language and to the underlying runtime.

Improvements to the language are all well and good, but given the choice I’d rather have them address some fundamental runtime issues:  the two-gigabyte limit, and garbage collection.  Both of these issues have caused me no end of grief over the past year.

All things considered, the .NET garbage collector is a definite win.  It handles the majority of memory management tasks much better than most programmers.  It’s not impossible to create a memory leak in a .NET program, but you really have to try.  Unfortunately, garbage collection is not free.  You’ll find that out pretty quickly if you write a long-running program that does a lot of string manipulation.  For example, take a look at this clip, which shows bandwidth usage from a Web crawler written in .NET:

Those times of zero bandwidth usage you see coincide with the garbage collector pausing all the threads to clean things up.  We lose somewhere around 10% of our potential bandwidth usage due to garbage collection.  This particular graph is from a dual-core machine.  The graph looks the same on a quad-core processor.

Obviously, they’ll have to do something about the garbage collector if they’re going to support multiple cores.  No amount of multi-core support in the language or in the runtime will do me a bit of good if every core stops whenever the garbage collector kicks in.

I’ve mentioned the .NET two-gigabyte limit before.  The 64-bit runtime has access to as much memory as you can put in a machine, but no single object can be larger than two gigabytes.  When you’re working with data sets that contain hundreds of millions of items, that’s just not acceptable.  When $2,000 will buy you a machine with 16 gigabytes of memory, it’s time that the .NET runtime give me the ability to allocate an object that makes use of that capacity.

I’m happy to see the team continue improving the C# language.  I’ll undoubtedly find many of their improvements useful.  But no amount of language improvement will increase my productivity if I’m hamstrung by the absurd limit on individual object size and the garbage collector continues to eat my processor cycles.

Unfortunately, we’ll have to wait a bit longer before we know what all will be included in the next versions of C# and .NET.  Microsoft is keeping pretty quiet, apparently in an attempt to make a big splash at the Professional Developer’s Conference in October.

Anybody care to pay my way to the conference?

July 19th, 2008

More URL Filtering

Last week I mentioned proxies and other URL filtering issues that we’ve encountered when crawling the Web.  A problem that continually plagues us is repeated path components–URLs like these:

http://www.example.com/mp3/mp3/mp3/mp3/mp3/song.mp3
http://www.example.com/mp3/mp3/mp3/mp3/mp3/mp3/song.mp3

I don’t know why some sites do that, but a crawler can easily get caught in a trap and will generate such URLs indefinitely.  Or until our self-imposed URL length limit kicks in.  Most of the time when that happens, we discover that all the URLs resolve to the same file, and removing the repeated path component (i.e. creating http://www.example.com/mp3/song.mp3) is the right thing to do.

A single repeated component is by far the most common, but we frequently see two or three repeated components:

http://www.example.com/mp3/download/mp3/download/mp3/download/song.mp3
http://www.example.com/mp3/Rush/download/mp3/Rush/download/song.mp3

It’s easy enough to write regular expressions that identify the repeated path components, and replacing the repeats with a single copy is trivial.  But it’s not a good general solution.  For example this blog (and many others) uses URLs of the form blog.mischel.com/yyyy/mm/dd/post-name/, so the entry for July 7 is blog.mischel.com/2008/07/07/post-name/.  Globally applying the repeated component removal rules would break a very large number of URLs.

This is one of the many URL filtering problems for which there is no good global solution.  Sometimes, repeated path components are legitimate.  We can use some heuristics based on the crawl history (i.e. if /mp3/song.mp3 generates /mp3/mp3/song.mp3) to identify problem sites, but in the end we end up having to write domain-specific filtering rules.  Manually identifying and coding around the dozen or so worst offenders makes a big dent in the problem.

Another per-domain problem is that of session IDs encoded within the path, or with uncommon parameter names.  For example, we can easily identify and remove common ids like PHPSESSID= and sessionid=, but these URLs will escape the filter unscathed:

http://www.example.com/file.html?exSession=123456xyzzy
http://www.example.com/file.html?exSession=845038plugh
http://www.example.com/coolstuff/123456xyzzy/index.html

http://www.example.com/coolstuff/845038plugh/index.html

It’s easy for humans to look at the first two URLs and determine that they likely go to the same place.  Same for the second pair.  The computer isn’t quite that smart, though, and making it that smart is very difficult.

Developing a system that automatically identifies problem URLs and generates filtering rules is a “big-R” research project–something that we don’t have time to work on at the moment.  Even if we were to develop such a thing, it’d be pretty fragile and would require constant monitoring and tweaking.  If a site’s URL format changes (something that happens with distressing frequency), the filtering rules become invalid.  Usually the effect will be letting through some stuff that should have been filtered, but in rare cases a change in the input data can lead to the filter rejecting a large number of URLs that it should have passed.

When I started this project, I knew that crawling the Web was non-trivial.  But it turns out that the URL filtering problem is much more complex than I expected the entire Web crawler to be.

July 18th, 2008

Odds ‘n Ends

  • Tom’s Hardware is running a review of solid state drives that compares the latest generation of SSDs against current mechanical drive technology.  It’s little surprise that SSDs are in general faster than hard drives.  What I found surprising is that some SSDs actually require more power than hard drives.  Not the newer crop, though.  Even the least efficient SSD has better performance-per-watt numbers than the most efficient hard drive.  And the OCZ SATA II is very impressive.
  • Solid state drives are still very expensive, though.  The 64 gigabyte OCZ SATA II will cost you about $17 per gigabyte.  That’s the high end.  Typical SSD prices are in the $10 per gigabyte range.  That’s a whole lot more than you’ll pay for a mechanical hard drive.  You can pick up a 320 Gb notebook drive for $110–less than 30 cents per gigabyte.  It’s nice to know that SSD is coming along, but it’ll be a year or two before I can justify replacing my notebook’s hard drive.
  • If you’re interested in using Windows Server 2008 as a workstation operating system, you should visit win2008workstation.com.  But be careful.  The site has a lot of good information, but there’s a large hacker/cracker component that sees nothing wrong with sharing component files.  I wouldn’t trust downloading anything pointed to by forum posts.
  • If you’re in the market for a “dual core” laptop, be careful.  Intel made a “Core Duo” line of processors which is in effect two Pentium M processors on one die.  These are 32-bit processors.  You probably want a machine that has a “Core 2 Duo” processor–a 64-bit part.  I can’t see any reason why a typical user would want to buy a machine with a 32-bit processor.
  • Also on the subject of laptop computers, don’t assume that you’re getting the best price by buying on eBay.  I compared prices for Dell laptops on eBay and at Dell Outlet.  The outlet prices compare quite favorably with eBay, the only drawback being that you’ll have to pay sales tax if you buy from Dell.  Still, I found plenty of eBay sales where the buyer paid more than what he would have paid at the outlet–including tax.  Do your research.

 

July 16th, 2008

Exceeding the Limits

We generate a lot of data here, some of which we want to keep around. Yesterday I noticed that I was running out of space on one of my 750 GB archive drives and figured it was time to start compressing some of the data. The data in question is reasonably compressible. A quick test with Windows’ .zip file creator indicated that I’d get a 30% or better reduction in size.

The data is generated on a continuous basis by a program that is always running.  The program rotates its log once per hour, and the hourly log files can be anywhere from 75 to 200 megabytes in size.  Figuring I’d reduce the number of files while also compressing the data, I wrote a script that uses INFO-ZIP’s Zip utility to create one .zip file for each day’s data.

And then I hit a wall.  It seems that the largest archive that Zip can create is 2 gigabytes.  As their FAQ entry about Limits says:

While the only theoretical limit on the size of an archive is given by (65,536 files x 4 GB each), realistically UnZip’s random-access operation and (partial) dependence on the stored compressed-size values limits the total size to something in the neighborhood of 2 to 4 GB. This restriction may be relaxed in a future release.

With 24 files ranging in size from 75 to 200 megabytes, it’s inevitable that some days will generate more than 3 gigabytes of data.  At about 30% compression, that’s not going to fit into the 2 GB file.

My immediate solution will be to compress the files individually.  It’s less than ideal, but at least it’ll give me some breathing room while I look for a new archive utility.

I’m surprised that in today’s world of cheap terabyte-sized hard drives, the most popular compression tools have the same limitations they had 20 years ago.  Every modern operating system has supported files larger than 4 gigabytes for at least 10 years.  It’s time our tools let us use that functionality.

I’m in the market for a good command-line compression/archiver utility that has true 64-bit file support.  Any suggestions?

July 14th, 2008

Going Too Far Back

The other day I intended to close a Remote Desktop window and instead hit the Close button (the X on the right of the window’s caption bar) on the console window running our data broker. Nothing like an abnormal exit to bring the whole house of cards tumbling down.

So I went looking for a way to prevent that particular problem from occurring again. Disabling the Close button is pretty easy. In fact, there are at least two ways to do it. Neither is ideal.

The Close button is on the window’s system menu. You can get a handle to the system menu by calling the GetSystemMenu Windows API function. In addition to the buttons on the window’s caption bar, this menu also contains the menu items you see if you click on the box at the left of the window:

Given a handle to the system menu, you have (at least) two choices:

  1. Call EnableMenuItem to disable the caption bar’s Close button.
  2. Call DeleteMenu to remove the Close item from the menu. Doing so will also disable the Close button on the caption bar.

The second option looks like the best, because it prevents me from hitting the Close button, and also prevents me from inadvertently clicking the Close menu item when I’m going for Edit. The C# code for the second option looks like this:

[DllImport("kernel32.dll", SetLastError = true)]
public static extern IntPtr GetConsoleWindow();

[DllImport("user32")]
private static extern IntPtr GetSystemMenu(IntPtr hWnd, bool bRevert);

[DllImport("user32")]
private static extern bool DeleteMenu(IntPtr hMenu, uint uPosition, uint uFlags);

private const int MF_BYPOSITION = 0x0400;

static void Main(string[] args)
{
    // Get the console window handle
    IntPtr winHandle = GetConsoleWindow();

    // Get the system menu
    IntPtr hmenu = GetSystemMenu(winHandle, false);

    // Delete the Close item from the menu
    DeleteMenu(hmenu, 6, MF_BYPOSITION);

    // rest of program follows
}

That works well, as you can see from this screen shot:

But there’s a problem. To restore the menu when your program is done, you’re supposed to call GetSystemMenu and pass true for the second parameter, telling it to restore the menu, like this:

GetSystemMenu(winHandle, true);

The result is probably not what you expect:

The system didn’t revert to the previous menu, but rather to the default system menu–the one created for every window. The Edit, Defaults, and Properties items that cmd.exe adds to the menu are gone.

Since I can’t reliably restore the menu after deleting an item, I figured I’d call EnableMenuItem to disable the Close item. Unfortunately, that doesn’t appear to be possible. At least, I haven’t been able to make it work. Since I often need the Edit menu item even after the program exits, I’m going with the first option and hoping that I don’t hit the Close menu item by mistake when going for the Edit menu while the program is running.

An aside: we have the term “fat finger” to describe hitting the wrong key on the keyboard. Is there a similar expression for making a mistake with the mouse? I suppose “mis-mouse” would do, but it doesn’t have quite the same ring to it as “fat finger.”

July 10th, 2008

Proxy fits

Three years ago I mentioned anonymous proxies as a way to “anonymize” your Internet access. At the time neglected to mention one of their primary uses: allowing you to surf sites that might be blocked by your friendly IT department. For example, I know of at least one company that blocks access to slashdot.org.

You can often go around such blocks (not that I’m advocating such behavior) by using services such as SureProxy.com. When you go to SureProxy and enter the URL for slashdot, SureProxy fetches the page from slashdot and sends it to you. The URL you see will look something like this: http://sureproxy.com/nph-index.cgi/011110A/http/slashdot.org/. If SureProxy isn’t blocked by your IT department, then you end up seeing the slashdot page. (Along with whatever advertisements SureProxy adds to the page.)

I’m sure this kind of thing gives corporate IT departments headaches. Their headaches are nothing compared to the problems proxies pose for Web crawlers.

The primary problem is that the proxy changes the URLs in the returned HTML page. Every link on the page is modified so that it, too, goes through the proxy. If the crawler starts crawling those URLs, it will just build more and more, all of which go through the proxy. And since the proxy URL doesn’t look anything like the real URL (at least, not to the crawler), the crawler will end up viewing the same page many times: once through the real link, and once through every proxy that the link appears in.

Fortunately, it’s pretty easy to write code that will identify and eliminate the vast majority of proxy URLs. Most of the proxies I’ve encountered use CGIProxy–a free proxy script. The script itself is usually called nph-proxy.cgi or nph-proxy.pl, although I’ve also seen nph-go and nph-proy, among others. It’s easy enough to write a regular expression that looks for those file names, extracts the real URL, and discards the proxy URL. That takes care of the simple cases. The rest I’ll have to find and block manually.

I’ve also seen proxies (Invisible Surfing is one) that use a completely different type of proxy script. They supply the target URL as an encoded query string parameter that looks something like this: http://www.invisiblesurfing.com/surf.php?q=aHR0cDovL3d3dy5taXNjaGVsLmNvbS9pbmRleC5odG0=. I’m sure that with some effort I could decode the URLs hidden in the query string, once I determined that the URL was a proxy URL. That turns out to be a rather difficult problem. Until I come up with a reliable way for the crawler to identify these types of proxy URLs, I do some manual spot-checking the URLs myself and manually blocking the domains. It’s like playing Whac-A-Mole, though, because new proxies appear all the time.

The other problem with crawling through proxies is that it makes the crawler ignore the robots.txt file on the target Web site. Since the crawler thinks it’s accessing the proxy site, it checks the proxy’s robots.txt. As a result, the crawler undoubtedly ends up accessing (and the indexer indexing) files that it never should have crawled.

Perhaps most surprising is that proxy sites don’t have robots.txt files that disallow all crawlers. I can see no benefit for the proxy site to allow crawling. The crawlers aren’t viewing the Web pages, so the proxy site doesn’t get the benefit of people clicking on their ads. All the crawler does is waste the proxy site’s bandwidth. If somebody out there understands the business of proxy sites and can explain why they don’t take the simple step of writing a simple robots.txt, please explain that to me in the comments, or by email. I’m very curious.

July 9th, 2008

Crawler versus the URLs

When you start crawling the Web on even a small scale, you quickly learn that things aren’t nearly as neat and tidy as the RFCs would have you believe.  After just a few weeks of writing code to handle all the special cases and ambiguities that crop up, you’ll start to wonder how the Web manages to work at all.  Nowhere is this more evident than when working with URLs.

It’s a pleasant fantasy to believe that a document on the Web can be reached through one and only URL.  That is, our training as programmers pushes us into the belief that the URL http://www.example.com/docs/resume.html is the way to reference that particular document.  It might be the preferred way, but it’s certainly not the only way.  On most servers, for example, you can drop the “www”, so that http://example.com/docs/resume.html will get you to the same place. We call this “the www problem.”

That’s just the simplest example.  Did you know that multiple slashes are irrelevant?  That is, http://www.example.com/////docs////resume.html will go to the same place as the two URLs above. You can also do some path navigation within the URL so that http://www.example.com/docs/../docs/resume.html goes to the same place as all the other examples I’ve shown.

You can also “escape” any character within a URL. For example, you can replace a slash (/) with the character string %2F, turning the original URL above into this: http://www.example.com%2Fdocs%2Fresume.html. Most often, escaping is used to remove embedded spaces and special characters that have particular meanings in URLs. Sometimes escaping is done automatically when a user copies a link from a browser and pastes it into an HTML authoring program.

Above are just some of the simplest examples. I haven’t even started on query strings–parameters that you can pass after the path part of a URL. But even without query strings, the number of different ways you can address a particular document on the Web is essentially infinite. And yet a crawler is expected to, as much as possible, determine the “canonical” form of a URL and crawl only that. Crawling the same document multiple times wastes bandwidth (for both the crawler and the crawlee), and results in duplicate data that can only cause more problems for the processes that come along after the crawler has stored the page.

If you haven’t written a crawler, you might think I’m just contriving examples. I’m not. The www problem in particular is a very real issue that if not addressed can cause a crawler to read a very large number of pages twice: once with the www and once without the www. The other issues are not nearly as prevalent, but they are significant–so significant that every crawler author spends a huge amount of time trying to develop heuristics for URL canonicalization. Simply following the specification in RFC 3986 will get you most of the way there, but there are ambiguities that simply cannot be resolved.  So we do the best we can.

You might also wonder where these weird URLs come from.  The answer is, “everywhere.”  Scripts are high on the lists of culprits.  They can mangle URLs beyond belief.  For example, one script I encountered had the annoying feature of re-escaping a parameter in the query string.  The percent sign (%) is one of those characters that gets escaped because it has special meaning in URLs.

So imagine a script  reached from the URL http://www.example.com/script.php?page=1&username=Jim%20Mischel. The script appends the username variable to the query string for all links when it generates the page, but it escapes the string. So links harvested from the page have this form: http://www.example.com/script.php?page=2&username=Jim%2520Mischel. “%25″ is the escape code for the percent sign. Now imagine following a chain of 10 links all generated by that script. You end up with http://www.example.com/script.php?page=10&username=Jim%2525252525252525252520Mischel.

What’s a poor crawler to do?

We do the best we can, and we have measures in place to identify such situations so that we can improve our canonicalization code. But it’s a never-ending battle. Whenever we think we’ve seen it all, we run into another surprise.

July 7th, 2008

Computer Notes

  • One thing I haven’t figured out yet with my new Dell 490 system running Windows Server 2008 is how to burn a CD. I have a LITE-ON DVD RW in it–the same drive that was in my old Windows XP system–but for some reason Windows Server reports it as a DVD-ROM. This one has me stumped, but I don’t have the time to really track it down. Although I am getting tired of going to some other machine for burning CDs.
  • Several years ago I bought a Shuttle SK41G computer that served me quite well, first as a Linux test machine, then as a development platform, and finally as a small DNS server. About a year ago it lost its mind. I thought the problem was the battery for the CMOS RAM, but after replacing the battery the machine still loses the time whenever I shut it off. I hate to throw out a perfectly good (if somewhat aging) computer, but have a hard time justifying the time I’d spend puzzling this out.
  • I’ve been considering buying a clone of my Dell Latitude D610 laptop. Dell doesn’t sell that machine anymore, but they’re plentiful on eBay: at about $400 for a fully loaded machine, shipping included. That’s about 20% of the new price from three years ago. It’s a very serviceable machine, with a 2 GHz processor, 2 GB of RAM, hard drives from 30 to 160 GB, and a nice display that’ll do 1400×1050 pixels. The only possible drawback is that it’s a single core 32-bit processor.
  • Dual core laptops are pretty resonable. You can pick up a new Dell Inspiron 1525 on eBay for $500 or $600. For $700, you can get one fully loaded with lots of RAM and a big hard drive. I wonder about battery life, though. Can I get five hours out of it with the optional battery in the expansion bay? And honestly: do I really need multiple cores in a laptop?
June 23rd, 2008

Odds ‘n Ends

A few notes after a day of knocking things off the “to do” list.

  • I’ve used QUIKRETE before, but never for setting a post. Just pour the dry concrete mix into the hole (after placing the post), and add one gallon of water for every 50 lbs of mix. The stuff sets in about 45 minutes, and you can apply stress to the post after only four hours. No mixing required. Ain’t technology wonderful?
  • Seeing as how I had only one hole to dig, I did it the old-fashioned way: with a post hole digger and a Texas toothpick. Note to self: wear gloves next time.
  • From the hammer’s point of view, a thumb looks just like a fence staple.
  • It’s always a good idea to remove the old part and take it to the auto parts store when you go shopping for its replacement. It’ll save you from having to make another trip when you realize that the part you got isn’t the part you need.
  • I shouldn’t have to remove a dozen screws with three different tools in order to replace a relay.
  • A PVC union is an ingenious device. But remember to put thread compound on the threads of the device itself, in addition to the threads of the two pipes you’re attaching it to.
June 19th, 2008

Major search engines support robots.txt standard

Google, Yahoo, and Microsoft’s Live Search recently announced standard support for the major robots.txt directives.  This means that you can use the same syntax for robots.txt to control the activities of those three major search engine crawlers.  The common directives are: Disallow, Allow, and Sitemaps.  In addition, all three support the use of wildcards (* and $) in specifying paths for Allow and Disallow.  It’s interesting to note that Yahoo says they support “$ Wildcards,” whereas Google and Microsoft say that they support “* Wildcards” as well as “$ Wildcards.”  From reading Yahoo’s documentation, though, I’d say that they also support “* Wildcards.”

All three also support several HTML META tags, such as NOINDEX and NOFOLLOW, that give content authors much tighter control over crawlers than can be accomplished with robots.txt. 

This isn’t exactly a new step.  The three major search engines have been collaborating for the last few years, trying to make Webmasters’ jobs easier with respect to the major search engines.  For example, back in February they announced common support for cross-submission of Sitemaps.

Unfortunately, all three also support individual extensions to the Robots Exclusion Protocol.  For example, Yahoo and Microsoft support the Crawl-Delay directive, which Google does not support. Both Google and Yahoo support some unique META tags that the others don’t support.

Even with the incompatibilities, this is a big step in the right direction. With unified support of the major robots.txt directives among the three major search engine crawlers, we can expect to see more support by smaller crawlers. I know that many authors of smaller-scale crawlers look to the majors to see what they should support. Having all three support the same directives in the same way, makes other developers’ jobs (including mine!) easier.

But ultimately it’s the Webmasters who benefit the most by giving them a standard way to control crawlers’ access to their sites.

June 16th, 2008

One more time: the Internet is public

[Note:  As Michael Covington pointed out, there's plenty of privacy on the Internet--just not on the World Wide Web.]

I know I’ve mentioned this before, but I keep running across people who don’t understand that there is no privacy on the Internet.  If you’ve uploaded something to your Web site, it’s highly likely that Google, MSN, Yahoo, or any (or all) of the many other search engines out there has found it.  Even our Web crawler–a small-scale operation–finds things in hidden nooks and crannies of the Web that most people with browsers would never stumble upon.

For example, the other day a coworker was spot-checking some of the crawler’s latest finds and stumbled upon a site where the owner had uploaded what looks like (from examining the file names) a bunch of very private stuff.  This all in an unprotected directory.  A person with a browser could go to that URL, get a listing of all files, and then browse to his heart’s content.  Although it’s unlikely that a person browsing would stumble upon the directory, a crawler almost certainly will.  Eventually.

When we run across something like that, we don’t actually browse, but rather find out how to contact the site owner and send him a very nice email suggesting that he either protect the directory or not upload that information.

The day after discovering the site I mentioned above, we ran across the story of Alex Kozinski, a judge in the 9th Circuit whose personal porn stash was found publicly accessible online:

Kozinski, 57, said that he thought the site was for his private storage and that he was not aware the images could be seen by the public, although he also said he had shared some material on the site with friends. After the interview Tuesday evening, he blocked public access to the site.

Of particular interest in this case is that the judge was presiding over an obscenity trial (now postponed) that involves material that’s apparently similar to some of the material on the judge’s site.  The judge also had some copyrighted music on the site, opening up the possibility of copyright violation.

No matter how far out in the country you live, if you stand naked in front of an uncovered window, somebody will eventually see you.  Similarly, if you upload something to your Web site and don’t take active measures to prevent access, it will be found.  Do not assume that it can’t be found because you never told anybody about it.  That’s like putting a key under the doormat and figuring it’s safe because only you know it’s there.

June 11th, 2008

Can’t Configure Windows DNS Resolver Cache

In experimenting with the program I described yesterday, I got to fiddling with the DNS resolver cache, called dnscache. Briefly, dnscache saves the results from recent DNS queries so that it doesn’t have to keep querying the DNS server. Considering that a DNS query can take 100 milliseconds or more to resolve, this can save considerable time. For example, for your browser to load this Web page, it has to make many different requests to my server: one for the base page, one for the stylesheet, one for each image, etc. It wouldn’t be uncommon to require a dozen separate requests to get all the resources that make up the page. If each resource required a separate DNS request, it would take more than a second just for DNS!

I got to wondering just how large the DNS cache is. A little bit of searching brings up any number of pages claiming that you can “speed up your connection” by tweaking the DNS resolver cache parameters. Specifically, they talk about changing registry keys for the cache hash table size, maximum time to live, etc. There’s even a Microsoft TechNet article describing these parameters for Windows Server 2003 (and, by extension, Windows XP). It’s interesting to note that the information on most of the pages claiming to speed things up conflicts rather badly with the information in the TechNet article.

After reading the tweaks and the TechNet article, I figured I’d give it a shot. I fired up the Registry Editor, made the changes, and … is it working? How can I tell? I tried browsing a few Web sites, but I couldn’t see any difference.

A little more searching and I found the command ipconfig /displaydns. This writes the contents of the DNS resolver cache to the console. A little work with the FIND utility, and I was able to count the number of entries in the cache. 34 on my Windows XP box. Interesting, considering that I set the CacheHashTableSize registry entry to over 7,000. I fiddled and tweaked, restarted the DNS Client service, flushed the cache, rebooted my computer, faced Redmond and cursed, and generally tried everything I could think of. No matter what settings I used, I always ended up with between 30 and 40 entries in my DNS cache.

On my Windows Server 2008 machine at the office, I always got between 270 and 300 entries, no matter what I tried.

So that leaves me with the following possibilities:

  1. It’s not possible to change the size of the DNS resolver cache in Windows XP or Windows Server 2008.
  2. It is possible, but the documentation is wrong.
  3. The documentation is correct as far as it goes, but it’s incomplete.
  4. The documentation is correct and complete, but I’m too dumb to make sense of it.
  5. The documented registry entries actually changed the size of the cache, but ipconfig isn’t showing me all the entries that are in the cache.

At this point, all possibilities seem almost equally likely. I could do some indirect testing based on the amount of time it takes to resolve a series of DNS requests, but even that would be inconclusive. There are no documented API calls that allow me to examine the DNS cache or its size. (And the undocumented ones aren’t described well enough to be worth checking out.) My only means of seeing what’s in the cache is the ipconfig tool.

So I ask: does anybody know how to change the size of the Windows DNS resolver cache and prove that those changes actually work? Do I have to restart the DNS Client service? Reboot the machine? Set some super magic registry entry?

Any information greatly appreciated.

June 10th, 2008

Is this really asynchronous?

I’ve been working on a relatively simple program whose purpose is to see just how fast I can issue Web requests. The idea is to get one machine hooked directly to an Internet connection and see how many concurrent connections it can maintain and how much bandwidth it can consume. A straight bandwidth test is easy: just start three or four Linux distribution downloads from different sites. That’ll usually max out a cable modem connection.

But determining the sustained concurrent connection rate is a bit more difficult. It requires that you issue a lot of requests, very quickly, for an extended period of time. By slowly increasing the number of concurrent connections and monitoring the bandwidth used, I should be able to find an optimum range of request rates: one that makes maximum use of bandwidth, but doesn’t cause requests to timeout.

My Web crawler does something similar, but it also does a whole lot of other things that make it impractical for use as a diagnostic tool.

I got the program up and limping today, and was somewhat surprised to find that it couldn’t maintain more than 15 concurrent connections for any length of time. Considering that my crawler can maintain 200 or more connections without a problem, I found that quite curious. It had to be something about the different way I was issuing requests.

Because this is a simple tool, I figured I’d use the .NET Framework’s WebClient component to issue the requests. In order to avoid the overhead of constructing a new WebClient for every request, I initialized 100 WebClient instances to be served from a queue, and then issued the requests in a loop, kind of like this:

while (!shutdown)
{
    if (currentConnections < MaxConnections)
    {
        WebClient cli = GetClientFromQueue();
        ++currentConnections;
        cli.DownloadStringAsync(GetNextUrlFromQueue());
    }
}

The actual code is a bit more involved, of course, but that’s the gist of it. The currentConnections counter gets decremented in the download completed event handler.

The important thing to note here is that I’m issuing asynchronous requests. The call to DownloadStringAsync executes on a thread pool thread. This code should issue requests at a blindingly fast rate, and keep the number of concurrent connections right near the maximum. Even with MaxConnections set to 50, the best I could do was 20 concurrent–and that for only a very short time. Most often I had somewhere between 10 and 15 concurrent connections.

After eliminating everything else, I finally got around to timing just how long it takes to issue that asynchronous request. The result was pretty surprising: in my brief tests, it took anywhere from 0 to 300 milliseconds to issue those requests. The average seemed to be around 100 or 150 ms. That would explain why I could only keep 10 or 15 connections open. If it takes 100 ms to issue a request, then I can only make 10 requests per second. Since it takes about 2 seconds (on average) to complete a request, the absolute best I’ll be able to do is 20 concurrent requests.

So I got to thinking, why would it take 100 milliseconds or more to issue an asynchronous Web request? And the only reasonable answer I could come up with was DNS: resolving the domain name. And it turns out I was right. I flushed the DNS cache and ran my test by requesting a small number of URLs from different domains. Sure enough, it averaged about 150 ms per request. I then ran the program again and it took almost no time at all to issue the requests. Why? Because the DNS cache already had those domain names resolved. Just to make sure, I flushed the DNS cache again and re-ran the test.

By the way, the HttpWebRequest.BeginGetResponse method (the low-level counterpart to WebClient.DownloadStringAsync) exhibits the same behavior. That’s not surprising, considering that WebClient uses HttpWebRequest to do its thing.

This is a fatal flaw in the design of the .NET Framework’s support for asynchronous Web requests. The whole idea of supplying asynchronous methods for I/O requests is to push the waiting off on to background threads so that the main thread can continue processing. What’s the use of providing an asynchronous method if you have to wait for a high latency task like DNS resolution to complete before the asynchronous request is issued? Why can’t the DNS resolution be done on the thread pool thread, just like the actual Web request is?

There is a way around the problem: queue a background thread to issue the asynchronous request. Yes, I know it sounds crazy, but it works. And it’s incredibly easy to do with anonymous delegates:

ThreadPool.QueueUserWorkItem(delegate(object state)
    {
        cli.DownloadStringAsync((Uri)state);
    }, GetNextUrlFromQueue());

That spawns a thread, which then issues the asynchronous Web request. The time waiting for DNS lookup is spent in the background thread rather than on the main processing thread. It looks pretty goofy, and unless it’s commented well somebody six months from now will wonder what I was smoking when I wrote it.

June 9th, 2008

The perfect ground cover?

A few years back, Debra and I started adding large mulch areas around the trees in the yard. This was an effort to make things look a little better, as well as to reduce lawn maintenance. More mulch means less grass to mow. And mulch around the trees means that I don’t have to run the weed eater to knock down the grass that normally would grow around the trunks. The problem is that weeds and grass grow in the mulch, and if you don’t keep up with pulling them and adding a new layer of mulch every year or two, the grass will take over again.

Another option is to plant a good thick ground cover that will prevent grass and weeds from growing. One of the best such plants for this area is asian jasmine. Getting it established might be a challenge, but once established it’s very drought tolerant and requires little maintenance. Just trim it with the weed eater, or run the mower over it on the highest setting once or twice a year. The only thing that concerns me is the stated requirement of “moist, well-drained, well-prepared soil” for establishment. Such soil is in short supply in our yard.

As far as I’m concerned, the perfect ground cover would be grass that never grows higher than an inch or two. Why can’t some of these genetic engineering whizzes get to work on such a thing? Forget Frankenfoods. I imagine just about any homeowner would kill for a lush green lawn that he never had to mow.

June 6th, 2008

Internet Explorer clipboard protection is broken

This morning I copied a URL from the browser to the clipboard and then tried to paste it into the email message I was writing in another browser window. Internet Explorer popped up this confirmation box:

I wouldn’t mind so much if it showed this box one time. But it shows the box for every new email I try to paste stuff to.

There are two things that annoy me about this confirmation box. The first is that the default button is “Don’t allow”. Obviously, somebody has a much higher opinion of the threat posed by indiscriminate clipboard pasting than I do. I just don’t agree that IE should be holding my hand here and trying to dissuade me from pasting data into an email. The default should be “Allow access”. For dang sure, I should be able to change the default. Better yet, I’d like to just turn the silly notification off. Does Windows have a, “Yes, I know what I’m doing” mode?

Worse, this confirmation box is broken for keyboard users. I’m pretty keyboard-centric, especially when I’m writing. I don’t need to remove my fingers from the keyboard in order to copy a URL from one browser window (or tab) to another. Alt+Tab, Ctrl+D, Ctrl+C, Alt+Tab, Ctrl+V. Done. When this confirmation box pops up, it changes “Done” into:

  1. “What the heck?”
  2. Press Enter before fully realizing that I just prevented myself from pasting into the email.
  3. Copy the draft email to the clipboard.
  4. Open Notepad.
  5. Paste the draft into Notepad.
  6. Close the draft email.
  7. Open a new email message or reply.
  8. Paste the draft back into the new email.
  9. Go find the URL I wanted to paste, and copy it to the clipboard.
  10. Attempt to paste the URL into the email.
  11. Read confirmation box and press the left arrow button to highlight the “Allow access” button.
  12. Nothing happens.
  13. Press the right arrow.
  14. Press Enter.

Whoever coded up this particular confirmation box got his arrow keys backwards.

I guess I am more secure with this new setup. It’s so painful that I’ll stop trying to paste things into my emails.

I understand that security is an issue, and to some extent IE has to protect users from themselves. But this is broken. Horribly. At minimum, the confirmation should have a link or checkbox that lets me turn the message off for pages that I identify. Like the “new email” page that I use dozens of times a day.

June 5th, 2008

Webbots, Spiders, and Screen Scrapers

Considering what I’m doing for work, you can imagine that when I ran across Michael Schrenk’s Webbots Spiders, and Screen Scrapers recently, I ordered a copy. The book is a tutorial on writing small Web bots that automate the collection of data from the Web.

Most of the book focuses on screen scrapers that download data from previously identified Web sites, parse the pages, and then store and present the data. There’s a little information on “spidering”–automatically following links from one page to another–but that’s not the primary purpose of the book. Which is probably a good thing. A Web-scale spider (or crawler) is fundamentally different than a screen scraper or a special-purpose spider that’s written to gather information from a small set of domains or very narrowly-defined pages.

The first six chapters explain why Web bots are useful, and walk you through the basics: downloading Web pages, parsing the contents, automating log in and form submission, and many other tasks that are involved in automated data collection. With plenty of PHP code examples, these chapters provide a good foundation for the next 12 chapters: Projects. In this section, we see examples of real Web bots that monitor prices, capture images, verify links, aggregate data, read email, and more. Again, with many code examples.

The first two sections cover about three-fifths of the book. If you read and follow along by trying the code examples, you’ll have a very good understanding of how to build many different types of Web bots.

The remainder of the book is divided into two sections. Part 3, Advanced Technical Considerations, briefly explains spiders, and then discusses some of the technical issues such as authentication and cookie management, cryptography, and scheduling your bots. This section has some code examples, but they aren’t the primary focus.

The fourth section, Larger Considerations, focuses on things like how to keep your bots out of trouble, legal issues, designing Web sites that are friendly to bots, and how to prevent bots from scraping your site. Again, these chapters have a few code samples, but the emphasis is on the larger issues–things to think about when you’re writing and running your bots.

Overall, I like the book. The writing is conversational, and the author obviously has a lot of experience building useful bots. The many code samples do a good job illustrating the concepts, and the projects cover the major types of bots most people would be interested in writing. Reading about the projects and some of the other ideas he presents opens up all kinds of possibilities.

The book succeeds very well in its stated mission: explaining how to build simple web bots and operate them in accordance with community standards. It’s not everything you need to know, but it’s the best introduction I’ve seen. The focus is on simple, single-threaded, bots. There’s some small mention of using multiple bots that store data in a central repository, but there’s no discussion of the issues involved in writing multi-threaded or distributed bots that can process hundreds of pages per second.

I recommend that you read this book if you’re at all interested in writing Web bots, even if you’re not familiar with or intending to use PHP. But be sure not to expect more than the book offers.

June 4th, 2008

.NET regular expressions are slow?

One of the benefits–or curses, depending on my mood and how urgently I need a solution–of programming computers is that I often start working on one thing and end up getting sidetracked by a piece of the problem.

Take today’s distraction, for example. I’m writing a program to experiment with some text classification using the downloaded Wikipedia database. A major part of the program is extracting terms from the individual articles. Since I don’t need anything too fancy (at least not quite yet), I figured that this would be a perfect application for regular expressions. So I coded up my term extractor and let it loose on the Wikipedia data, figuring it’d take an hour or two to process the 13 gigabytes of text.

It’s a good thing I’ve gotten into the habit of instrumenting my code and periodically outputting timing information. It was taking an average of 10 milliseconds to process each Wikipedia article. The file has about 5.8 million articles. A quick back of the envelope calculation says that’ll take 58,000 seconds, or 16 hours. I’ve been wrong before, but not often by an order of magnitude.

After removing some unnecessary code and postponing some processing that can be done against the aggregate, I cut the required time per page to about 4 milliseconds. Better, but still too much. Through the process of elimination, I finally narrowed it down to the loop that extracts terms from the document text using a regular expression. Stripping everything but the critical loop, it looks like this:

static Regex reTerm = new Regex("\\p{L}+", RegexOptions.Compiled);
static Stopwatch reElapsed = new Stopwatch();

static void DoReParse(string pageText)
{
    reElapsed.Start();

    Match m = reTerm.Match(pageText);
    while (m.Success)
    {
        m = m.NextMatch();
    }

    reElapsed.Stop();
}

The regular expression, \p{L}+, says, “match a string of one or more contiguous Unicode Letter characters.” reElapsed is a Stopwatch object that accumulates the time taken between the Start and Stop calls. When my program is done, I divide the accumulated time by the number of documents processed to get the average time per page. For the first 100,000 documents in the Wikipedia download, it averages about 1.5 milliseconds per page, which is pretty close to what I thought all of the processing would take–parsing included.

That seemed unreasonable, so I wrote my own parser that reads each character of the document text and pulls out the substrings of contiguous Unicode letter characters. It’s more code than the regular expression version, but it’s still pretty simple:

static void DoJmParse(string pageText)
{
    // Time extraction with direct parsing
    jmElapsed.Start();

    int i = 0;
    int len = pageText.Length;
    bool inWord = false;
    int start = 0;
    for (i = 0; i < len; i++)
    {
        if (!inWord)
        {
            if (char.IsLetter(pageText[i]))
            {
                start = i;
                inWord = true;
            }
        }
        else if (!char.IsLetter(pageText[i]))
        {
            string term = pageText.Substring(start, i - start);
            inWord = false;
        }
    }
    if (inWord)
    {
        string term = pageText.Substring(start);
    }

    jmElapsed.Stop();
}

Both of the methods shown above are stripped-down versions of the code. The complete code adds the extracted terms to a list so that I can compare what’s extracted by each method. In all cases (the first 100,000 pages in the Wikipedia download), both methods extract the same terms. But the hand-coded parser takes an average of 0.25 milliseconds per page. The regular expression parser is six times slower.

I could understand my hand-coded routine being somewhat faster because it avoids the overhead of constructing Match objects and some of the other regular expression overhead. But six times? Something smells wrong. I have to think that I’m missing something.

I tried the usual things: fiddling with the regular expression options, calling Matches to get all of the matches in a single call, etc. All to no avail. Everything I tried had either no effect or increased the running time.

Then I thought that the difference had to do with Unicode combining characters or surrogate pairs. That is, characters that take up two code units rather than just one. My parser treats each code unit as a character, whereas the regular expression parser might be taking those multi-code unit characters into account. But a simple test doesn’t bear that out.

Consider this character string defined in C#:

const string testo = "\u0061\u0300xyz";

The first two characters, “\0061\0300″, define the character à, a lower-case “a” with a grave accent. From my understanding of Unicode, this should be treated as a single character. But when I run that string through the regular expression term extractor, I get two strings: “a”, and “xyz”. If the regular expression engine supports combining characters, I should get one string: “àxyz”. The documentation is suspiciously silent on the matter, but a close reading of Jeffrey Friedl’s Mastering Regular Expressions indicates that I shouldn’t expect the .NET regular expression engine to give me just the one string.

So I’m at a loss. I have no idea why the regular expression version of the term extractor is so much slower than my hand-coded version. For now, I’m going to abandon the regular expression, but I’d sure like to hear from anybody who can shed some light on this for me.

May 29th, 2008

What’s that house worth?

It’s property appraisal time again. We got our Notice of Appraised Value in the mail last month, and were shocked to learn that our property value increased by 16% last year. That’s on top of the 33% increase from the year before. At least, that’s what the local appraisal district would have you believe. Last year I missed the deadline for filing a protest. You can bet I won’t be missing this year’s.

I called a Realtor friend of mine to get comparable sales information, and then compared that with the proposed appraisal. The difference is quite remarkable. If I’m extremely generous, I can make the house’s market value almost equal to last year’s appraised value. When you take into account the comparable sales and subtract the cost of the many repairs we need to make, the house’s market value is about 1/3 less than the proposed appraisal.

One of the tools I tried to use for research is Zillow. This is a pretty cool mashup that shows satellite pictures with property lines and home prices, along with pertinent information about the houses. Zillow also gives a “Zestimate” of home values. I’m sure there’s some complicated formula for these estimates, but at least in my area I noticed that the estimates are much closer to the tax appraisals than to the sales of comparable homes.

In my experience, tax appraisals are trailing indicators: they continue to rise after home prices have leveled off following a boom, and they continue to fall (although not quite as much as they rise) after falling home prices have leveled off. The result is that sources like Zillow and others end up over- or under-reporting on market swings. For my area, Zillow is reporting values that are quite a bit higher than are justified by actual sales, indicating to me that it relies too heavily on tax appraisals.

As far as I’m concerned–especially in today’s market–the value Zillow reports is the “if everything goes exactly right and you find the perfect buyer” price. It’s a useful tool for comparison, but even then I’d look on it with a large dose of skepticism. As far as absolute values are concerned, though, Zillow’s numbers bear little resemblance to reality.