Jim’s Random Notes

Musings on technology and life

August 13th, 2008

The Government Rant

The best thing about our government is that it never ceases to amuse me. It’s also continuously annoying, but I guess you have to take the bad with the good. It’s not the government itself that amuses me so much, but rather the absurd things that our illustrious Congresscritters do and say in an attempt to garner votes. The most amusing (and also the most frustrating) thing is that constituents continue to be taken in. Rather than making an effort to come up with a solution ourselves, we argue over which totally unworkable plan our elected representatives should vote on. This gives the leeches in Washington Congress incredible leeway to do anything, and then spin their positions to best advantage.

Examples abound. Let’s look at some of the more recent.

Dependence on foreign oil

Our country’s dependence on foreign oil has been a major problem since the Arab oil embargo of 1973. In the 35 ensuing years, Congress has put forth all manner of proposals to “fix” the problem. We’ve funded research into solar, geothermal, tidal, and other natural energy sources, provided incentives and subsidies for domestic oil exploration, coal, ethanol, and all manner of questionable energy saving technologies. Today our government has much more control over energy policy than it did in 1973 and yet we’re more dependent on foreign oil than we were back then.

Seven administrations and countless members of Congress have been “doing something about the problem” for 35 years, and the problem has gotten worse. And yet the vast majority of Americans look to Congress and the President for a solution to high gas prices, all the while cheering for or ridiculing the laughably simple minded, short term proposals that are put forth. Our representatives, of course, couldn’t care less. All they have to do is make themselves look good to their own constituents. As long as they can keep the voting public believing that government is the solution, their jobs are secure.

Every thinking American (and, sadly, I’m beginning to believe that the number is falling fast) knows that the solution to our energy problems requires conservation, domestic oil and gas production, development of nuclear plants, exploitation of wind, thermal, solar, and other natural sources, and research into more energy efficient transportation and buildings. We won’t solve anything unless we address all of those areas. And it’s going to take time. Government has proven that it’s incapable of formulating and implementing a workable energy policy. It’s time to get government out of the picture. No more subsidies, incentives, or preferential treatment. Let the market decide.

Tax Rebates

This is one of the dumber things I’ve seen Congress do. And, yes, I realize that both the 2001 and the 2008 rebates were initially proposed by President Bush. That doesn’t relieve Congress of their complicity and their ultimate responsibility. The 2001 rebate was “justified” by a “budget surplus”–a surplus that anybody with a fifth grade education knew was an illusion. This year’s rebate was “justified” by the current economic situation. Congress would have you believe that a windfall of a few hundred dollars (up to $1,200, as I recall) would “stimulate the economy” and soften the recession. Any thinking person could have told you that the result would be a short term spike in consumer spending, followed by a quick return to normal. I can’t prove this yet, but I suspect that it also resulted in people putting down payments on things they can’t afford, figuring they’d find a way to make the monthly payments.

Congress, of course, knew that the tax rebates wouldn’t have an effect on the economy other than to increase the size of the federal debt. But that’s okay. What’s a few billion more dollars compared to the time honored tradition of buying votes? It is an election year, after all. Besides, it made for good press coverage and retail store managers drooled over the prospect of Christmas in July. The rebates seem so popular that Senator Obama proposed a $1,000 rebate to fight energy costs.

The reaction of those receiving the rebates was predictable. Most squandered it like drunken sailors on leave. Those few who know the names of their Congressmen or Senators might have lifted a glass in salute, but most just thanked the government for the handout. That’s what surprises me the most. It’s like having somebody cut your arm off at the shoulder and then thanking him when he returns the forearm and hand. Idiots.

The “mortgage crisis”

This one is fun because there are so many levels of idiocy. Lenders made high-risk loans to people who were demonstrably incapable of paying them back, then sold those loans to a government sponsored enterprise, which ultimately will be bailed out by taxpayers when the original borrowers default.

When borrowing money in good faith, both the lender and the borrower are responsible for ensuring that the money can be paid back. But when the lender is just a middleman who gets paid for making the loan and selling it to somebody else, there is little incentive for him to vigorously check the borrower’s documentation. On the contrary, there is ample incentive for him to be very creative in putting together a loan package, both by making the terms of the loan appear attractive to the borrower and by making the borrower look attractive to the third party who’s buying the loan. Sure, the middleman will eventually be found out, but the short term rewards are incredible.

And when the ultimate buyer is a government sponsored enterprise like Fannie Mae or Freddie Mac, there is almost no oversight. When you have, with government’s blessing, a virtual monopoly on the secondary mortgage market, you know that you’ll get bailed out if things go bad. So where’s the incentive to insist on real documentation for the loans that you buy?

I’m not an economist by any stretch of the imagination. I’m not even a financial analyst. But I’m not an idiot, either. I and many others saw this coming three years ago. Congress ignored the problem at the time, or discounted it as scare mongering. I’ll go out on a limb here and say that most of them probably knew what was coming. But they also knew that there wasn’t anything they could do about it and that bringing it up would be very unpopular. Our elected representitives are many things, but stupid is not one of them.

Now that the real extent of the problem has become apparent, Congress is all over it with one proposal after another. They’re “doing something about the problem.” They know that there are only two possible solutions: either pump money into Fannie Mae and Freddie Mac to keep them afloat, or cut them loose and let people finally endure the consequences of their actions. We know, just by the the nature of elected officials, what their solution will be: another hundred billion dollars or more shelled out to fix a problem that Congress created in the first place. And We the Sheeple just nod our heads and thank Congress for taking care of us once again.

More is better?

All three of the above examples demonstrate extreme incompetence on the part of government. The Congress-proposed solution to those problems, as with all others, is more government regulation. As if making even more and larger bureaus, agencies, and departments will somehow transform government into an intelligent and effective organization. And we let them do it! When will people learn that the cure for a headache is to stop beating your head against the wall?

I used to get upset when I’d think about this stuff. I used to rant and carry on about the proper function of government, and how intrusive government is in our daily lives. But nobody listens. Nobody seems to care. I learned a while back to stop bashing my head against that particular pile of bricks. Now I just laugh and hope that the coming violent overthrow (which will almost certainly happen if government continues on its current path) doesn’t occur until after I’m gone.

August 10th, 2008

Paranoia versus productivity

We had an interesting discussion at the office about how much validation a collection type should do in its constructor. The key question, I think, came down to this:

If the constructor can determine that using the instantiated object will throw an exception, should the constructor fail rather than returning the instantiated object?

In other words, if I know that the instantiated object won’t work, shouldn’t I just throw the exception now, rather than let you be surprised later?

There are two extremes here: 1) the constructor should go to heroic efforts, and; 2) let the buyer beware. I tend to lean towards putting the onus on the caller, figuring that whoever is instantiating the object knows what he’s doing.  Let me provide an example.

Consider the .NET SortedList generic collection type. To do its job (that is, keep a collection of items sorted), it requires a comparison function. If you don’t specify a comparison function when you call the constructor, the collection uses the default comparison function for whatever type you specify as the key. This sounds simple enough, right? A list of employees that’s sorted by employee number, for example, would be defined like this:

SortedList<int, Employee> Employees =
    new SortedList<int, Employee>();

Since the System.Int32 type (which the C# int type resolves to) implements IComparable, everything works.

But imagine you have an EmployeeNumber type:

class EmployeeNumber
{
    public string Division { get; private set; }
    public int EmpNo { get; private set; }
    public EmployeeNumber(string d, int no)
    {
        Division = d;
        EmpNo = no;
    }
}

Now, if you create a SortedList that’s keyed on that type, you’ll have:

SortedList<EmployeeNumber, Employee> Employees =
    new SortedList<EmployeeNumber, Employee>();

Allow me to show the entire program here, so we don’t get confused.

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace genericsTest
{
    class EmployeeNumber
    {
        public string Division { get; private set; }
        public int EmpNo { get; private set; }
        public EmployeeNumber(string d, int no)
        {
            Division = d;
            EmpNo = no;
        }
    }

    class Employee
    {
        public string Name { get; set; }
        public Employee(string nm)
        {
            Name = nm;
        }
    }

    class Program
    {
        static SortedList<EmployeeNumber, Employee> Employees =
            new SortedList<EmployeeNumber, Employee>();

        static void Main(string[] args)
        {
            Employees.Add(new EmployeeNumber("Accounting", 1),
                new Employee("Sue"));
            Employees.Add(new EmployeeNumber("Dev", 2),
                new Employee("Jim"));
        }
    }
}

If you compile and run that program, you’ll see that it throws an exception when it tries to add the second employee to the list. The program fails because it can’t compare the items. Neither implements IComparable.

Those who lean towards the first extreme above will argue that the SortedList constructor should determine that the key type doesn’t implement IComparable, and should prevent you from instantiating the collection. It should throw an exception because it knows that trying to add items to the collection will fail.

The constructor could do this. It’s possible for the constructor to get the default comparer and call it. If the comparison function returns a value, then all is well. If it fails, then the constructor throws an exception saying, “Sorry, but you didn’t supply a comparison function.”

The only problem with that scenario is that it’s wrong. Not wrong philosophically, but wrong in a very concrete sense. Extending my example will illustrate why.

Suppose you have two different types of employee numbers. Maybe an OldEmployeeNumber that looks like the one I defined above, and a NewEmployeeNumber that has different fields. Because you want to keep both employee number types in the same list, you define a base class, EmployeeNumberBase from which they both can inherit. The definitions would look like this:

abstract class EmployeeNumber : IComparable
{
    // Some common employee number functionality goes here.

    public int CompareTo(object obj)
    {
        throw new NotImplementedException();
    }
}

class OldEmployeeNumber : EmployeeNumber, IComparable
{
    public string Division { get; private set; }
    public int EmpNo { get; private set; }
    public OldEmployeeNumber(string d, int no)
    {
        Division = d;
        EmpNo = no;
    }

    int IComparable.CompareTo(object obj)
    {
        int rslt = 0;
        if (obj is OldEmployeeNumber)
        {
            var o2 = obj as OldEmployeeNumber;
            rslt = Division.CompareTo(o2.Division);
            if (rslt == 0)
                rslt = EmpNo.CompareTo(o2.EmpNo);
        }
        else if (obj is NewEmployeeNumber)
        {
            // OldEmployeeNumber sorts before NewEmployeeNumber
            rslt = -1;
        }
        return rslt;
    }
}

class NewEmployeeNumber : EmployeeNumber, IComparable
{
    public string Country { get; private set; }
    public decimal EmpNo { get; private set; }
    public NewEmployeeNumber(string c, decimal no)
    {
        Country = c;
        EmpNo = no;
    }

    int IComparable.CompareTo(object obj)
    {
        int rslt = 0;
        if (obj is NewEmployeeNumber)
        {
            var o2 = obj as NewEmployeeNumber;
            rslt = Country.CompareTo(o2.Country);
            if (rslt == 0)
                rslt = EmpNo.CompareTo(o2.EmpNo);
        }
        else if (obj is OldEmployeeNumber)
        {
            // NewEmployeeNumber sorts after OldEmployeeNumber
            rslt = 1;
        }
        return rslt;
    }
}

Yeah, I know. That’s quite a mouthful.

The EmployeeNumberBase class implements the IComparable interface, but its implementation just throws NotImplementedException. Furthermore, the class is marked as abstract to prevent it from being instantiated. Only derived classes can be instantiated.

The derived classes each explicitly implement the IComparable interface. The company-defined sorting rules are that old employee numbers always sort in the list before new employee numbers. Within the same type, the numbers are sorted using their own rules. [Note here that my CompareTo implementations aren't terribly robust. They'll return zero (equal) if the object passed is not of a known type, and they'll fail if the passed object is null. But those details aren't terribly relevant to the example.]

Now, the Employees list is created in exactly the same way:

SortedList<EmployeeNumber, Employee> Employees =
    new SortedList<EmployeeNumber, Employee>();

We can then add items to the list:

Employees.Add(new NewEmployeeNumber("USA", 2.002m),
    new Employee("Jim"));
Employees.Add(new OldEmployeeNumber("Accounting", 1),
    new Employee("Sue"));
Employees.Add(new OldEmployeeNumber("HR", 3),
    new Employee("Dana"));

If you make those changes and run the program, you’ll see that it does indeed run, and work as expected, and I didn’t change the comparison function that the constructor sees.

If SortedList had attempted to protect me from myself–that is, call the default comparison function and throw an exception because the comparison function had failed–then this final code would not work. By trying to protect me from myself, it would have prevented me from doing what I wanted to do.

Understand, the above is something of a contrived example. I certainly can’t imagine implementing the employee list that way, even if I did have different employee number types. But somebody else might think it’s a perfectly reasonable thing to do. The point is that there could be very good reasons for instantiating a keyed collection with a key type that does not have a valid comparison function. The constructor cannot know if comparisons will fail.

Which brings us back to the original question: how hard should a collection class (or any library object) try to prevent you from instantiating an object that will fail? In my opinion, the constructor should instantiate the object if the immediate parameters look reasonable. My reasoning is that it’s extremely difficult, if not impossible, to know how the caller will be using the class. As you saw above, making broad assumptions about types in a polymorphic environment can be fatal.

This reasoning extends far beyond the question of how a collection class’s constructor should behave. As programmers, we have to strike a balance between paranoia and productivity. We have to decide daily how much trust to put in the code that calls our methods, and how much we can depend on the code we call. Do we write classes that hold the programmer’s hand to help him across the street, or do we provide a “walk” signal and a warning that says, in effect, “If you cross on red, all bets are off”?

August 8th, 2008

Hey, you deleted my files!

We got a rather strongly worded message the other day from a Webmaster who was threatening legal action because our crawler deleted a bunch of files from his site.  The news that our crawler is capable of deleting files was quite a surprise to us.  Like other crawlers, ours just downloads HTML files, extracts links, and then visits those links.  There is no “delete a file” logic in there.  But if the crawler stumbles upon a link whose action is to delete a file, then visiting that link will indeed delete the file.

Further investigation in this particular case revealed a file management page that includes, among other things, links that have the form:  www.example.com/files/?delete=filename.txt.  Surprisingly enough, clicking on that link deletes the file.  The file management page is not protected by a password, nor is there any kind of confirmation displayed before the file is permanently deleted.

Examining the logs, we saw accesses from other search engine crawlers.  We also learned from the Webmaster that some time back, a kid had “hacked in” to the site and deleted a bunch of files.

I’m a little surprised that anybody would create such a page and not provide any protection.  I’m very surprised to find out that a supposedly professional Web developer would do such a thing and not learn the lesson when a random surfer came in and deleted files.  And I’m shocked that, even after we explained this to the Webmaster, he insists that we can take this as an opportunity to learn from our “mistake” and “fix” the crawler so that it doesn’t happen again.

It’s unfortunate that our crawler visited those links, causing the files to be deleted.  But the mistake was on the part of the person who posted those destructive links.  The crawler was operating exactly as it should.  Exactly, in fact, as every major search engine crawler acts.  It’d be nice if we could imbue the crawler with enough intelligence to “understand” Web pages and know in advance what the effects of clicking a link will be.  But that kind of machine intelligence is far, far in the future.

If you post something on the Web, it will be found, unless you take active measures to protect it.  Posting a destructive link on an unprotected page and then blaming somebody else when the link is clicked by an “unauthorized” person is akin to running out into a busy street and then blaming your injuries on the driver of the bus that hits you.

August 4th, 2008

Multicore Crisis?

There’s been some talk recently of the next “programming crisis”: multicore computing. I’ll agree that we should be concerned, but I don’t think we’re anywhere near the crisis point. Before I address that specifically, I think it’s instructive to review the background: why multicore processors exist, how they affect existing software, and the issues involved in writing code to make use of multiple cores.

Moore’s law has been quoted and misquoted so often that it’s almost a cliché. His original statement was simply an observation on the rate at which transistor counts were increasing on integrated circuits, and that he expected the trend to continue for at least 10 years. That was 1965. The trend has continued, and there’s no indication that it will slow.

Some people think Moore’s Law has become something of a self-fulfilling prophecy: because we believe that it’s possible, somehow we strive to make it so. One wonders what would have happened if Moore had said that he expected the rate of growth to increase. Would transistor densities have increased at an exponential rate?

Self-fulfilling prophecy or not, it’s almost certain that the trend in increasing transistor densities will continue (it has through 2007) and that as a result we’ll get ever more powerful CPUs as well as faster, higher-capacity RAM. Absolute processor speed as measured by clock rate will continue to increase, but not at the astounding rates that we saw up to 2005 or so. Quantum effects and current leakage have put a little damper on the rate of growth there. Better materials will solve the problem–are solving the problem–but absent a fundamental breakthrough by the chemists working on the problem, clock speeds won’t be doubling every 18 months like they had been in the recent past. The Clock Speed Timeline graph makes this quite evident.

Today’s trend is towards multiple cores on a single processor, running at a somewhat slower clock rate. The machine I’m writing this on, for example, has a quad-core Intel Xeon processor running at 2 GHz. The clock speed is somewhat slower than you can get in a high-end Pentium, but the multiple cores provide more total computing power. Quad core processors today are quite common. Intel demonstrated an 80-core chip in February of 2007, and promised to deliver it within five years. I fully expect to have a 256-core processor in my desktop computer ten years from now.

The trend towards multiple cores and very slowly increasing clock rates has some interesting ramifications for software developers. In the past, we have depended on more RAM and faster processors to give us some very nice performance boosts. All indications are that the amount of available RAM and the size of on-chip caches will continue to grow, but we can’t count on the biannual doubling of processor speed. Unless we learn to write programs that use multiple cores, we will soon reach a very real performance ceiling.

Not all applications can benefit from multiple cores, but you’d be surprised at how many can. And even in those cases when a single program can’t make use of multiple cores, users still benefit from having a multicore processor because the machine is better at multi-tasking. Imagine running four virtual machines on one computer, for example. If the computer has a single processor core, all four virtual machines and all of the operating system services share that one core. On a quad-core processor, the work load is spread out over all four cores. The result is more processor cycles per virtual machine, meaning that all four virtual machines should run faster.

Software systems that consist of multiple mostly-independent processes can make good use of multicore processors without any modification. Consider a system consisting of two services that are constantly running. On a single-core computer, only one can actually be working at a time. You could almost double performance simply by upgrading to a dual-core processor. Such software systems are quite common, and they require no code changes in order to benefit immediately from the new multicore processor designs.

Contrary to popular belief, writing code that is explicitly multi-threaded–designed to take advantage of multiple cores–isn’t necessarily a huge step up in complexity. Such code can be much more complex than single-threaded code, but it doesn’t have to be. Some programs are more multi-threaded than others. I’ve found it useful to think of programs in terms of the following four levels of complexity:

  1. No explicit multi-threading.
  2. Infrequent, mostly independent asynchronous tasks.
  3. Loosely coupled cooperating tasks.
  4. Tightly coupled cooperating tasks.

Obviously, it’s impossible to draw exact boundaries between the levels, and many programs will use features found in two or more of the levels. In general, I would classify a program by the highest level of multi-threading features that it uses.

Level 1 requires little in the way of explanation. This is the most common type of application in use today. In a batch mode program, execution proceeds sequentially from start to finish. In a GUI program, user interface events and processing execute on the same thread. This type of application has served us well over the years.

Most Windows programmers have some experience with the next level of complexity. A GUI application that performs background processing and periodically updates the display is an example of this type of program. Typically, the program starts the background process, which from time to time raises events which the GUI thread handles and updates the display. Data and process synchronization between tasks is limited to the event handlers that respond to asynchronous events. Modern development environments make it very easy to create such programs. These programs can benefit from multiple processor cores because the background thread can operate independently of the the GUI thread, making the GUI thread much more responsive.

I have found the third level of complexity–loosely coupled cooperating tasks–to be a very useful and relatively simple way to make use of multiple cores. The idea is to construct a program that operates in an assembly line fashion. For example, consider a program that gathers input, does some complex processing of the input data, and then generates some output. Many such programs are processor bound. If you structure the program such that it maintains an input queue, a pool of independent worker threads, and an output queue, then there is little danger of running into the problems that often plague more complex programs. You have to supply synchronization (mutual exclusion locks, or similar) on the input and output queues, but the worker threads operate independently. Using this technique on a quad-core processor, it’s possible to get an almost 4x increase in throughput over a single-core processor, with very little danger of running into resource contention issues.

Written correctly, programs that have multiple tightly-coupled cooperating tasks make the best possible use of processor resources. However, explictly coding thread synchronization is perhaps the most difficult type of programming imaginable. Forgetting to lock a resource before accessing it can lead to unexplained crashes or data corruption. Holding a lock for too long can create a performance bottleneck. Locks that are too granular increase complexity and also the chance for deadlock situations. Locks that are not granular enough will stall worker threads. Race conditions are endemic. Assuming you get such a program working, even a small change will often cause new, unanticipated problems. Writing this kind of code is hard. You’re much better off re-thinking your approach to the problem and casting it as a Level 3 problem. Whatever price you pay in performance will be returned many fold in increased reliability and reduced development time.

If you’re writing a Level 3 or Level 4 program, you should very seriously consider using a existing multi-tasking library if at all possible. Doing so will require that you think about your problem differently, but you leverage a lot of known-working code that is almost certainly more robust in all ways than what you’re likely to write yourself in the time allotted. Two good examples of such libraries are the Parallel Extensions to .NET 3.5 and the Java Parallel Processing Framework. Such libraries exist for many other programming environments. Although still in their infancy, these libraries promise to greatly simplify the move to multicore. If you’re contemplating development of a program that makes good use of multiple cores, you definitely should learn about any parallel computing libraries that support your platform.

So, back to the crisis. Bob Warfield over at SmoothSpan Blog has had and continues to have quite a lot to say about it, and many others share his sentiments. I, on the other hand, don’t think we’re anywhere near the crisis point. Nor do I think we’re likely to get there. Whereas it’s true that most current software isn’t multicore ready, software developers have understood for several years now that they need to begin writing applications that take advantage of multiple processor cores. It’s likely that some shops have taken an ad hoc approach to the problem, and they’re probably suffering with the issues I pointed out above. It’s also likely that many (and I would hope, most) development shops have done the prudent thing and adopted a parallel computing library that takes care of the difficult areas, leaving the programmers to worry about their specific applications. Doing so is no different than adopting an operating system, development environment, GUI library, report generator, or any other third party component–something that development shops have long experience with.

In short, the multicore “crisis” that the doomsayers are warning us about is almost a non-issue. It’s going to require a small amount of programmer retraining and there will undoubtedly be a temporary plateau in the rate at which our processing of data increases, but in a very short time we’ll again have mainstream applications that push all this fancy hardware to its limits.

July 29th, 2008

The Ultimate Development Machine?

In Understanding the Hardware, Jeff Atwood describes his “best bang for the buck developer x86 box,” at a cost of about $1,100.  The system he describes is quite a nice development machine, although it’s probably overkill for a lot of developers.  Seriously.  How many developers do you know who really need a 10,000 RPM drive and a screaming video card?

Surprisingly, he doesn’t mention what case he’s going to put all that fancy hardware in.  I’d really like to know.  I’ve mentioned before that I like the Antec Sonata cases because they’re very quiet.  But with their fans, they almost certainly create more noise than whatever Jeff’s using for passive cooling.

My development machine these days is quite a bit different from what he describes, but I realize that I have somewhat different needs.  I’ll give you a quick rundown.

Start with a Dell Precision 490 case, with power supply and motherboard.  These can be had for under $200 on eBay, or from Dell surplus suppliers.  They’re starting to become a bit scarce on the surplus market now, because most have gone off lease and Dell doesn’t make that model anymore.  One drawback to this system is that it creates a bit more noise than the Antec case, but I’ve found that I can accept a certain amount of noise.  And it’s hard to beat the price.

Add a quad-core Xeon E5335 processor running at 2 GHz.  Granted, 2 GHz isn’t exactly blindingly fast, but it’s quite well suited to the work that I do.  Unlike most developers, the code I’m working on does benefit from multiple cores.  The motherboard in this 490 has two processor slots, so I could potentially run two of those quad-core Xeons.  And I can make good use of all eight cores.  The Xeon is pretty pricey if you buy it new.  You might consider picking one up on eBay.  We’ve purchased dozens of these processors on eBay and haven’t had a problem with any of them.

I would have been shocked a year ago if somebody told me that I’d have a need for more than 8 gigabytes of RAM.  But the stuff I’m doing is memory hungry in the extreme.  This is another reason we go for the Dell 490 motherboard:  it was one of very few that supported 16 gigabytes a year ago, and I use every bit of it.  At about $80 for four gigabytes, memory is still a bit expensive.  But the stuff we’re working on really does need all the memory it can get.

I also use a lot of disk space.  Hard disk speed is important, but capacity is way more important to me.  I’ve loaded the box with two 7,200 RPM 750-gigabyte drives.  Terabyte drives are available, but at a huge premium.  The 750 GB drives go for about $120, or 6.25 cents per gigabyte.  A terabyte drive will run about $220, or 22 cents per gigabyte.  If I need more storage, I’ll find a way to shoehorn a third drive into this Dell box.

I’m not writing computer games, and I’ve turned off all the fancy Windows Aero features that do nothing but annoy me and chew up system resources.  My video card is a low-end ATI Sapphire 1650 for which we paid less than $50.  It drives my 24″ LCD at 1920 x 1200 resolution just fine.  I have no need for really high end video performance.

When you add everything up and throw in the DVD burner, we can put together one of these machines for under $1,500, which isn’t very much more than Jeff’s system once he adds the case and DVD.

I realize that I’m somewhat out of the ordinary, working with programs that require multiple cores as well as enormous amounts of memory and disk space.  I suspect that my ultimate development machine would be complete overkill for most developers.  But I find it interesting to compare what other developers need against what I’m using.

Do you have an ultimate developer machine?  Drop me a note.

An aside:
Jeff also uses the word commodification, as in, “This industry was built on the commodification of hardware. If you can snap together a Lego kit, you can build a computer.”  I had to read that twice before I realized that he wasn’t talking about turning hardware into toilets.  Commodification?  Please stop.

July 24th, 2008

Is that code really from Sun?

I updated my Java runtime the other day, and now every time I open a new tab in Internet Explorer, I get this message box:

It looks like somebody at Sun forgot to sign their update agent.  At least, I think this control came from Sun.  But there’s no way to be sure, is there?  Do I blindly assume that this really is from Sun and that they made a mistake in generating the build, or do I do the prudent thing and permanently disallow it?

In a security conscious world, there’s no excuse for a major player like Sun to have released something with this error.  One wonders, if an obvious bug like this makes it through their quality control, what other less obvious nasties are lurking in the code.

To heck with it.  If Sun wants to push their software on me, they’ll have to get it right.  I’m going to disallow the update agent.  If I ever need to update my Java runtime, I guess I’ll just have to do it manually.

July 22nd, 2008

Charlie versus the Wildlife. Again.

Every time I get to thinking that maybe Charlie’s learned not to mess with the local wildlife, he does something incredibly stupid to set me straight. Last night I let him out just before going to bed. He stood there by the door for a minute and then took off around the corner after something. 30 seconds later he was running across the yard with his face in the grass, and the unmistakable aroma of skunk assaulted my olfactory system.

Yes, Charlie got another skunk. More correctly, the skunk got him. Not only does the dog stink (he’s at the vet now, getting a skunk bath), but the skunk let loose around the side of the house–right next to the air conditioning unit. The house reeks. I’m at home today with the windows open and the whole-house fan pulling in the 95-degree air, hoping to get rid of that smell.

This is Charlie’s second skunk. I had hoped that after the last time he would have learned that the stinky black kitty with the white stripe is strictly hands-off. Sadly, he seems to be a slow learner.

July 21st, 2008

C# and .NET: What’s Next?

About 10 days ago, MSDN’s Channel 9 site released an hour-long video entitled Meet the Design Team, that talks in very vague terms about uncoming features in C# 4.0.  You’ll learn that the language will include more dynamic constructs and built-in support for multiple cores.  Honestly, that’s about all you’ll learn from watching the video.  Granted, either one of those broad features implies many changes to the language and to the underlying runtime.

Improvements to the language are all well and good, but given the choice I’d rather have them address some fundamental runtime issues:  the two-gigabyte limit, and garbage collection.  Both of these issues have caused me no end of grief over the past year.

All things considered, the .NET garbage collector is a definite win.  It handles the majority of memory management tasks much better than most programmers.  It’s not impossible to create a memory leak in a .NET program, but you really have to try.  Unfortunately, garbage collection is not free.  You’ll find that out pretty quickly if you write a long-running program that does a lot of string manipulation.  For example, take a look at this clip, which shows bandwidth usage from a Web crawler written in .NET:

Those times of zero bandwidth usage you see coincide with the garbage collector pausing all the threads to clean things up.  We lose somewhere around 10% of our potential bandwidth usage due to garbage collection.  This particular graph is from a dual-core machine.  The graph looks the same on a quad-core processor.

Obviously, they’ll have to do something about the garbage collector if they’re going to support multiple cores.  No amount of multi-core support in the language or in the runtime will do me a bit of good if every core stops whenever the garbage collector kicks in.

I’ve mentioned the .NET two-gigabyte limit before.  The 64-bit runtime has access to as much memory as you can put in a machine, but no single object can be larger than two gigabytes.  When you’re working with data sets that contain hundreds of millions of items, that’s just not acceptable.  When $2,000 will buy you a machine with 16 gigabytes of memory, it’s time that the .NET runtime give me the ability to allocate an object that makes use of that capacity.

I’m happy to see the team continue improving the C# language.  I’ll undoubtedly find many of their improvements useful.  But no amount of language improvement will increase my productivity if I’m hamstrung by the absurd limit on individual object size and the garbage collector continues to eat my processor cycles.

Unfortunately, we’ll have to wait a bit longer before we know what all will be included in the next versions of C# and .NET.  Microsoft is keeping pretty quiet, apparently in an attempt to make a big splash at the Professional Developer’s Conference in October.

Anybody care to pay my way to the conference?

July 19th, 2008

More URL Filtering

Last week I mentioned proxies and other URL filtering issues that we’ve encountered when crawling the Web.  A problem that continually plagues us is repeated path components–URLs like these:

http://www.example.com/mp3/mp3/mp3/mp3/mp3/song.mp3
http://www.example.com/mp3/mp3/mp3/mp3/mp3/mp3/song.mp3

I don’t know why some sites do that, but a crawler can easily get caught in a trap and will generate such URLs indefinitely.  Or until our self-imposed URL length limit kicks in.  Most of the time when that happens, we discover that all the URLs resolve to the same file, and removing the repeated path component (i.e. creating http://www.example.com/mp3/song.mp3) is the right thing to do.

A single repeated component is by far the most common, but we frequently see two or three repeated components:

http://www.example.com/mp3/download/mp3/download/mp3/download/song.mp3
http://www.example.com/mp3/Rush/download/mp3/Rush/download/song.mp3

It’s easy enough to write regular expressions that identify the repeated path components, and replacing the repeats with a single copy is trivial.  But it’s not a good general solution.  For example this blog (and many others) uses URLs of the form blog.mischel.com/yyyy/mm/dd/post-name/, so the entry for July 7 is blog.mischel.com/2008/07/07/post-name/.  Globally applying the repeated component removal rules would break a very large number of URLs.

This is one of the many URL filtering problems for which there is no good global solution.  Sometimes, repeated path components are legitimate.  We can use some heuristics based on the crawl history (i.e. if /mp3/song.mp3 generates /mp3/mp3/song.mp3) to identify problem sites, but in the end we end up having to write domain-specific filtering rules.  Manually identifying and coding around the dozen or so worst offenders makes a big dent in the problem.

Another per-domain problem is that of session IDs encoded within the path, or with uncommon parameter names.  For example, we can easily identify and remove common ids like PHPSESSID= and sessionid=, but these URLs will escape the filter unscathed:

http://www.example.com/file.html?exSession=123456xyzzy
http://www.example.com/file.html?exSession=845038plugh
http://www.example.com/coolstuff/123456xyzzy/index.html

http://www.example.com/coolstuff/845038plugh/index.html

It’s easy for humans to look at the first two URLs and determine that they likely go to the same place.  Same for the second pair.  The computer isn’t quite that smart, though, and making it that smart is very difficult.

Developing a system that automatically identifies problem URLs and generates filtering rules is a “big-R” research project–something that we don’t have time to work on at the moment.  Even if we were to develop such a thing, it’d be pretty fragile and would require constant monitoring and tweaking.  If a site’s URL format changes (something that happens with distressing frequency), the filtering rules become invalid.  Usually the effect will be letting through some stuff that should have been filtered, but in rare cases a change in the input data can lead to the filter rejecting a large number of URLs that it should have passed.

When I started this project, I knew that crawling the Web was non-trivial.  But it turns out that the URL filtering problem is much more complex than I expected the entire Web crawler to be.

July 18th, 2008

Odds ‘n Ends

  • Tom’s Hardware is running a review of solid state drives that compares the latest generation of SSDs against current mechanical drive technology.  It’s little surprise that SSDs are in general faster than hard drives.  What I found surprising is that some SSDs actually require more power than hard drives.  Not the newer crop, though.  Even the least efficient SSD has better performance-per-watt numbers than the most efficient hard drive.  And the OCZ SATA II is very impressive.
  • Solid state drives are still very expensive, though.  The 64 gigabyte OCZ SATA II will cost you about $17 per gigabyte.  That’s the high end.  Typical SSD prices are in the $10 per gigabyte range.  That’s a whole lot more than you’ll pay for a mechanical hard drive.  You can pick up a 320 Gb notebook drive for $110–less than 30 cents per gigabyte.  It’s nice to know that SSD is coming along, but it’ll be a year or two before I can justify replacing my notebook’s hard drive.
  • If you’re interested in using Windows Server 2008 as a workstation operating system, you should visit win2008workstation.com.  But be careful.  The site has a lot of good information, but there’s a large hacker/cracker component that sees nothing wrong with sharing component files.  I wouldn’t trust downloading anything pointed to by forum posts.
  • If you’re in the market for a “dual core” laptop, be careful.  Intel made a “Core Duo” line of processors which is in effect two Pentium M processors on one die.  These are 32-bit processors.  You probably want a machine that has a “Core 2 Duo” processor–a 64-bit part.  I can’t see any reason why a typical user would want to buy a machine with a 32-bit processor.
  • Also on the subject of laptop computers, don’t assume that you’re getting the best price by buying on eBay.  I compared prices for Dell laptops on eBay and at Dell Outlet.  The outlet prices compare quite favorably with eBay, the only drawback being that you’ll have to pay sales tax if you buy from Dell.  Still, I found plenty of eBay sales where the buyer paid more than what he would have paid at the outlet–including tax.  Do your research.

 

July 16th, 2008

Exceeding the Limits

We generate a lot of data here, some of which we want to keep around. Yesterday I noticed that I was running out of space on one of my 750 GB archive drives and figured it was time to start compressing some of the data. The data in question is reasonably compressible. A quick test with Windows’ .zip file creator indicated that I’d get a 30% or better reduction in size.

The data is generated on a continuous basis by a program that is always running.  The program rotates its log once per hour, and the hourly log files can be anywhere from 75 to 200 megabytes in size.  Figuring I’d reduce the number of files while also compressing the data, I wrote a script that uses INFO-ZIP’s Zip utility to create one .zip file for each day’s data.

And then I hit a wall.  It seems that the largest archive that Zip can create is 2 gigabytes.  As their FAQ entry about Limits says:

While the only theoretical limit on the size of an archive is given by (65,536 files x 4 GB each), realistically UnZip’s random-access operation and (partial) dependence on the stored compressed-size values limits the total size to something in the neighborhood of 2 to 4 GB. This restriction may be relaxed in a future release.

With 24 files ranging in size from 75 to 200 megabytes, it’s inevitable that some days will generate more than 3 gigabytes of data.  At about 30% compression, that’s not going to fit into the 2 GB file.

My immediate solution will be to compress the files individually.  It’s less than ideal, but at least it’ll give me some breathing room while I look for a new archive utility.

I’m surprised that in today’s world of cheap terabyte-sized hard drives, the most popular compression tools have the same limitations they had 20 years ago.  Every modern operating system has supported files larger than 4 gigabytes for at least 10 years.  It’s time our tools let us use that functionality.

I’m in the market for a good command-line compression/archiver utility that has true 64-bit file support.  Any suggestions?

July 14th, 2008

Going Too Far Back

The other day I intended to close a Remote Desktop window and instead hit the Close button (the X on the right of the window’s caption bar) on the console window running our data broker. Nothing like an abnormal exit to bring the whole house of cards tumbling down.

So I went looking for a way to prevent that particular problem from occurring again. Disabling the Close button is pretty easy. In fact, there are at least two ways to do it. Neither is ideal.

The Close button is on the window’s system menu. You can get a handle to the system menu by calling the GetSystemMenu Windows API function. In addition to the buttons on the window’s caption bar, this menu also contains the menu items you see if you click on the box at the left of the window:

Given a handle to the system menu, you have (at least) two choices:

  1. Call EnableMenuItem to disable the caption bar’s Close button.
  2. Call DeleteMenu to remove the Close item from the menu. Doing so will also disable the Close button on the caption bar.

The second option looks like the best, because it prevents me from hitting the Close button, and also prevents me from inadvertently clicking the Close menu item when I’m going for Edit. The C# code for the second option looks like this:

[DllImport("kernel32.dll", SetLastError = true)]
public static extern IntPtr GetConsoleWindow();

[DllImport("user32")]
private static extern IntPtr GetSystemMenu(IntPtr hWnd, bool bRevert);

[DllImport("user32")]
private static extern bool DeleteMenu(IntPtr hMenu, uint uPosition, uint uFlags);

private const int MF_BYPOSITION = 0x0400;

static void Main(string[] args)
{
    // Get the console window handle
    IntPtr winHandle = GetConsoleWindow();

    // Get the system menu
    IntPtr hmenu = GetSystemMenu(winHandle, false);

    // Delete the Close item from the menu
    DeleteMenu(hmenu, 6, MF_BYPOSITION);

    // rest of program follows
}

That works well, as you can see from this screen shot:

But there’s a problem. To restore the menu when your program is done, you’re supposed to call GetSystemMenu and pass true for the second parameter, telling it to restore the menu, like this:

GetSystemMenu(winHandle, true);

The result is probably not what you expect:

The system didn’t revert to the previous menu, but rather to the default system menu–the one created for every window. The Edit, Defaults, and Properties items that cmd.exe adds to the menu are gone.

Since I can’t reliably restore the menu after deleting an item, I figured I’d call EnableMenuItem to disable the Close item. Unfortunately, that doesn’t appear to be possible. At least, I haven’t been able to make it work. Since I often need the Edit menu item even after the program exits, I’m going with the first option and hoping that I don’t hit the Close menu item by mistake when going for the Edit menu while the program is running.

An aside: we have the term “fat finger” to describe hitting the wrong key on the keyboard. Is there a similar expression for making a mistake with the mouse? I suppose “mis-mouse” would do, but it doesn’t have quite the same ring to it as “fat finger.”

July 10th, 2008

Proxy fits

Three years ago I mentioned anonymous proxies as a way to “anonymize” your Internet access. At the time neglected to mention one of their primary uses: allowing you to surf sites that might be blocked by your friendly IT department. For example, I know of at least one company that blocks access to slashdot.org.

You can often go around such blocks (not that I’m advocating such behavior) by using services such as SureProxy.com. When you go to SureProxy and enter the URL for slashdot, SureProxy fetches the page from slashdot and sends it to you. The URL you see will look something like this: http://sureproxy.com/nph-index.cgi/011110A/http/slashdot.org/. If SureProxy isn’t blocked by your IT department, then you end up seeing the slashdot page. (Along with whatever advertisements SureProxy adds to the page.)

I’m sure this kind of thing gives corporate IT departments headaches. Their headaches are nothing compared to the problems proxies pose for Web crawlers.

The primary problem is that the proxy changes the URLs in the returned HTML page. Every link on the page is modified so that it, too, goes through the proxy. If the crawler starts crawling those URLs, it will just build more and more, all of which go through the proxy. And since the proxy URL doesn’t look anything like the real URL (at least, not to the crawler), the crawler will end up viewing the same page many times: once through the real link, and once through every proxy that the link appears in.

Fortunately, it’s pretty easy to write code that will identify and eliminate the vast majority of proxy URLs. Most of the proxies I’ve encountered use CGIProxy–a free proxy script. The script itself is usually called nph-proxy.cgi or nph-proxy.pl, although I’ve also seen nph-go and nph-proy, among others. It’s easy enough to write a regular expression that looks for those file names, extracts the real URL, and discards the proxy URL. That takes care of the simple cases. The rest I’ll have to find and block manually.

I’ve also seen proxies (Invisible Surfing is one) that use a completely different type of proxy script. They supply the target URL as an encoded query string parameter that looks something like this: http://www.invisiblesurfing.com/surf.php?q=aHR0cDovL3d3dy5taXNjaGVsLmNvbS9pbmRleC5odG0=. I’m sure that with some effort I could decode the URLs hidden in the query string, once I determined that the URL was a proxy URL. That turns out to be a rather difficult problem. Until I come up with a reliable way for the crawler to identify these types of proxy URLs, I do some manual spot-checking the URLs myself and manually blocking the domains. It’s like playing Whac-A-Mole, though, because new proxies appear all the time.

The other problem with crawling through proxies is that it makes the crawler ignore the robots.txt file on the target Web site. Since the crawler thinks it’s accessing the proxy site, it checks the proxy’s robots.txt. As a result, the crawler undoubtedly ends up accessing (and the indexer indexing) files that it never should have crawled.

Perhaps most surprising is that proxy sites don’t have robots.txt files that disallow all crawlers. I can see no benefit for the proxy site to allow crawling. The crawlers aren’t viewing the Web pages, so the proxy site doesn’t get the benefit of people clicking on their ads. All the crawler does is waste the proxy site’s bandwidth. If somebody out there understands the business of proxy sites and can explain why they don’t take the simple step of writing a simple robots.txt, please explain that to me in the comments, or by email. I’m very curious.

July 9th, 2008

Crawler versus the URLs

When you start crawling the Web on even a small scale, you quickly learn that things aren’t nearly as neat and tidy as the RFCs would have you believe.  After just a few weeks of writing code to handle all the special cases and ambiguities that crop up, you’ll start to wonder how the Web manages to work at all.  Nowhere is this more evident than when working with URLs.

It’s a pleasant fantasy to believe that a document on the Web can be reached through one and only URL.  That is, our training as programmers pushes us into the belief that the URL http://www.example.com/docs/resume.html is the way to reference that particular document.  It might be the preferred way, but it’s certainly not the only way.  On most servers, for example, you can drop the “www”, so that http://example.com/docs/resume.html will get you to the same place. We call this “the www problem.”

That’s just the simplest example.  Did you know that multiple slashes are irrelevant?  That is, http://www.example.com/////docs////resume.html will go to the same place as the two URLs above. You can also do some path navigation within the URL so that http://www.example.com/docs/../docs/resume.html goes to the same place as all the other examples I’ve shown.

You can also “escape” any character within a URL. For example, you can replace a slash (/) with the character string %2F, turning the original URL above into this: http://www.example.com%2Fdocs%2Fresume.html. Most often, escaping is used to remove embedded spaces and special characters that have particular meanings in URLs. Sometimes escaping is done automatically when a user copies a link from a browser and pastes it into an HTML authoring program.

Above are just some of the simplest examples. I haven’t even started on query strings–parameters that you can pass after the path part of a URL. But even without query strings, the number of different ways you can address a particular document on the Web is essentially infinite. And yet a crawler is expected to, as much as possible, determine the “canonical” form of a URL and crawl only that. Crawling the same document multiple times wastes bandwidth (for both the crawler and the crawlee), and results in duplicate data that can only cause more problems for the processes that come along after the crawler has stored the page.

If you haven’t written a crawler, you might think I’m just contriving examples. I’m not. The www problem in particular is a very real issue that if not addressed can cause a crawler to read a very large number of pages twice: once with the www and once without the www. The other issues are not nearly as prevalent, but they are significant–so significant that every crawler author spends a huge amount of time trying to develop heuristics for URL canonicalization. Simply following the specification in RFC 3986 will get you most of the way there, but there are ambiguities that simply cannot be resolved.  So we do the best we can.

You might also wonder where these weird URLs come from.  The answer is, “everywhere.”  Scripts are high on the lists of culprits.  They can mangle URLs beyond belief.  For example, one script I encountered had the annoying feature of re-escaping a parameter in the query string.  The percent sign (%) is one of those characters that gets escaped because it has special meaning in URLs.

So imagine a script  reached from the URL http://www.example.com/script.php?page=1&username=Jim%20Mischel. The script appends the username variable to the query string for all links when it generates the page, but it escapes the string. So links harvested from the page have this form: http://www.example.com/script.php?page=2&username=Jim%2520Mischel. “%25″ is the escape code for the percent sign. Now imagine following a chain of 10 links all generated by that script. You end up with http://www.example.com/script.php?page=10&username=Jim%2525252525252525252520Mischel.

What’s a poor crawler to do?

We do the best we can, and we have measures in place to identify such situations so that we can improve our canonicalization code. But it’s a never-ending battle. Whenever we think we’ve seen it all, we run into another surprise.

July 7th, 2008

Computer Notes

  • One thing I haven’t figured out yet with my new Dell 490 system running Windows Server 2008 is how to burn a CD. I have a LITE-ON DVD RW in it–the same drive that was in my old Windows XP system–but for some reason Windows Server reports it as a DVD-ROM. This one has me stumped, but I don’t have the time to really track it down. Although I am getting tired of going to some other machine for burning CDs.
  • Several years ago I bought a Shuttle SK41G computer that served me quite well, first as a Linux test machine, then as a development platform, and finally as a small DNS server. About a year ago it lost its mind. I thought the problem was the battery for the CMOS RAM, but after replacing the battery the machine still loses the time whenever I shut it off. I hate to throw out a perfectly good (if somewhat aging) computer, but have a hard time justifying the time I’d spend puzzling this out.
  • I’ve been considering buying a clone of my Dell Latitude D610 laptop. Dell doesn’t sell that machine anymore, but they’re plentiful on eBay: at about $400 for a fully loaded machine, shipping included. That’s about 20% of the new price from three years ago. It’s a very serviceable machine, with a 2 GHz processor, 2 GB of RAM, hard drives from 30 to 160 GB, and a nice display that’ll do 1400×1050 pixels. The only possible drawback is that it’s a single core 32-bit processor.
  • Dual core laptops are pretty resonable. You can pick up a new Dell Inspiron 1525 on eBay for $500 or $600. For $700, you can get one fully loaded with lots of RAM and a big hard drive. I wonder about battery life, though. Can I get five hours out of it with the optional battery in the expansion bay? And honestly: do I really need multiple cores in a laptop?
June 23rd, 2008

Odds ‘n Ends

A few notes after a day of knocking things off the “to do” list.

  • I’ve used QUIKRETE before, but never for setting a post. Just pour the dry concrete mix into the hole (after placing the post), and add one gallon of water for every 50 lbs of mix. The stuff sets in about 45 minutes, and you can apply stress to the post after only four hours. No mixing required. Ain’t technology wonderful?
  • Seeing as how I had only one hole to dig, I did it the old-fashioned way: with a post hole digger and a Texas toothpick. Note to self: wear gloves next time.
  • From the hammer’s point of view, a thumb looks just like a fence staple.
  • It’s always a good idea to remove the old part and take it to the auto parts store when you go shopping for its replacement. It’ll save you from having to make another trip when you realize that the part you got isn’t the part you need.
  • I shouldn’t have to remove a dozen screws with three different tools in order to replace a relay.
  • A PVC union is an ingenious device. But remember to put thread compound on the threads of the device itself, in addition to the threads of the two pipes you’re attaching it to.
June 19th, 2008

Major search engines support robots.txt standard

Google, Yahoo, and Microsoft’s Live Search recently announced standard support for the major robots.txt directives.  This means that you can use the same syntax for robots.txt to control the activities of those three major search engine crawlers.  The common directives are: Disallow, Allow, and Sitemaps.  In addition, all three support the use of wildcards (* and $) in specifying paths for Allow and Disallow.  It’s interesting to note that Yahoo says they support “$ Wildcards,” whereas Google and Microsoft say that they support “* Wildcards” as well as “$ Wildcards.”  From reading Yahoo’s documentation, though, I’d say that they also support “* Wildcards.”

All three also support several HTML META tags, such as NOINDEX and NOFOLLOW, that give content authors much tighter control over crawlers than can be accomplished with robots.txt. 

This isn’t exactly a new step.  The three major search engines have been collaborating for the last few years, trying to make Webmasters’ jobs easier with respect to the major search engines.  For example, back in February they announced common support for cross-submission of Sitemaps.

Unfortunately, all three also support individual extensions to the Robots Exclusion Protocol.  For example, Yahoo and Microsoft support the Crawl-Delay directive, which Google does not support. Both Google and Yahoo support some unique META tags that the others don’t support.

Even with the incompatibilities, this is a big step in the right direction. With unified support of the major robots.txt directives among the three major search engine crawlers, we can expect to see more support by smaller crawlers. I know that many authors of smaller-scale crawlers look to the majors to see what they should support. Having all three support the same directives in the same way, makes other developers’ jobs (including mine!) easier.

But ultimately it’s the Webmasters who benefit the most by giving them a standard way to control crawlers’ access to their sites.

June 16th, 2008

One more time: the Internet is public

[Note:  As Michael Covington pointed out, there's plenty of privacy on the Internet--just not on the World Wide Web.]

I know I’ve mentioned this before, but I keep running across people who don’t understand that there is no privacy on the Internet.  If you’ve uploaded something to your Web site, it’s highly likely that Google, MSN, Yahoo, or any (or all) of the many other search engines out there has found it.  Even our Web crawler–a small-scale operation–finds things in hidden nooks and crannies of the Web that most people with browsers would never stumble upon.

For example, the other day a coworker was spot-checking some of the crawler’s latest finds and stumbled upon a site where the owner had uploaded what looks like (from examining the file names) a bunch of very private stuff.  This all in an unprotected directory.  A person with a browser could go to that URL, get a listing of all files, and then browse to his heart’s content.  Although it’s unlikely that a person browsing would stumble upon the directory, a crawler almost certainly will.  Eventually.

When we run across something like that, we don’t actually browse, but rather find out how to contact the site owner and send him a very nice email suggesting that he either protect the directory or not upload that information.

The day after discovering the site I mentioned above, we ran across the story of Alex Kozinski, a judge in the 9th Circuit whose personal porn stash was found publicly accessible online:

Kozinski, 57, said that he thought the site was for his private storage and that he was not aware the images could be seen by the public, although he also said he had shared some material on the site with friends. After the interview Tuesday evening, he blocked public access to the site.

Of particular interest in this case is that the judge was presiding over an obscenity trial (now postponed) that involves material that’s apparently similar to some of the material on the judge’s site.  The judge also had some copyrighted music on the site, opening up the possibility of copyright violation.

No matter how far out in the country you live, if you stand naked in front of an uncovered window, somebody will eventually see you.  Similarly, if you upload something to your Web site and don’t take active measures to prevent access, it will be found.  Do not assume that it can’t be found because you never told anybody about it.  That’s like putting a key under the doormat and figuring it’s safe because only you know it’s there.

June 11th, 2008

Can’t Configure Windows DNS Resolver Cache

In experimenting with the program I described yesterday, I got to fiddling with the DNS resolver cache, called dnscache. Briefly, dnscache saves the results from recent DNS queries so that it doesn’t have to keep querying the DNS server. Considering that a DNS query can take 100 milliseconds or more to resolve, this can save considerable time. For example, for your browser to load this Web page, it has to make many different requests to my server: one for the base page, one for the stylesheet, one for each image, etc. It wouldn’t be uncommon to require a dozen separate requests to get all the resources that make up the page. If each resource required a separate DNS request, it would take more than a second just for DNS!

I got to wondering just how large the DNS cache is. A little bit of searching brings up any number of pages claiming that you can “speed up your connection” by tweaking the DNS resolver cache parameters. Specifically, they talk about changing registry keys for the cache hash table size, maximum time to live, etc. There’s even a Microsoft TechNet article describing these parameters for Windows Server 2003 (and, by extension, Windows XP). It’s interesting to note that the information on most of the pages claiming to speed things up conflicts rather badly with the information in the TechNet article.

After reading the tweaks and the TechNet article, I figured I’d give it a shot. I fired up the Registry Editor, made the changes, and … is it working? How can I tell? I tried browsing a few Web sites, but I couldn’t see any difference.

A little more searching and I found the command ipconfig /displaydns. This writes the contents of the DNS resolver cache to the console. A little work with the FIND utility, and I was able to count the number of entries in the cache. 34 on my Windows XP box. Interesting, considering that I set the CacheHashTableSize registry entry to over 7,000. I fiddled and tweaked, restarted the DNS Client service, flushed the cache, rebooted my computer, faced Redmond and cursed, and generally tried everything I could think of. No matter what settings I used, I always ended up with between 30 and 40 entries in my DNS cache.

On my Windows Server 2008 machine at the office, I always got between 270 and 300 entries, no matter what I tried.

So that leaves me with the following possibilities:

  1. It’s not possible to change the size of the DNS resolver cache in Windows XP or Windows Server 2008.
  2. It is possible, but the documentation is wrong.
  3. The documentation is correct as far as it goes, but it’s incomplete.
  4. The documentation is correct and complete, but I’m too dumb to make sense of it.
  5. The documented registry entries actually changed the size of the cache, but ipconfig isn’t showing me all the entries that are in the cache.

At this point, all possibilities seem almost equally likely. I could do some indirect testing based on the amount of time it takes to resolve a series of DNS requests, but even that would be inconclusive. There are no documented API calls that allow me to examine the DNS cache or its size. (And the undocumented ones aren’t described well enough to be worth checking out.) My only means of seeing what’s in the cache is the ipconfig tool.

So I ask: does anybody know how to change the size of the Windows DNS resolver cache and prove that those changes actually work? Do I have to restart the DNS Client service? Reboot the machine? Set some super magic registry entry?

Any information greatly appreciated.

June 10th, 2008

Is this really asynchronous?

I’ve been working on a relatively simple program whose purpose is to see just how fast I can issue Web requests. The idea is to get one machine hooked directly to an Internet connection and see how many concurrent connections it can maintain and how much bandwidth it can consume. A straight bandwidth test is easy: just start three or four Linux distribution downloads from different sites. That’ll usually max out a cable modem connection.

But determining the sustained concurrent connection rate is a bit more difficult. It requires that you issue a lot of requests, very quickly, for an extended period of time. By slowly increasing the number of concurrent connections and monitoring the bandwidth used, I should be able to find an optimum range of request rates: one that makes maximum use of bandwidth, but doesn’t cause requests to timeout.

My Web crawler does something similar, but it also does a whole lot of other things that make it impractical for use as a diagnostic tool.

I got the program up and limping today, and was somewhat surprised to find that it couldn’t maintain more than 15 concurrent connections for any length of time. Considering that my crawler can maintain 200 or more connections without a problem, I found that quite curious. It had to be something about the different way I was issuing requests.

Because this is a simple tool, I figured I’d use the .NET Framework’s WebClient component to issue the requests. In order to avoid the overhead of constructing a new WebClient for every request, I initialized 100 WebClient instances to be served from a queue, and then issued the requests in a loop, kind of like this:

while (!shutdown)
{
    if (currentConnections < MaxConnections)
    {
        WebClient cli = GetClientFromQueue();
        ++currentConnections;
        cli.DownloadStringAsync(GetNextUrlFromQueue());
    }
}

The actual code is a bit more involved, of course, but that’s the gist of it. The currentConnections counter gets decremented in the download completed event handler.

The important thing to