View  Info
 
Compare   Restore  
   Revision #54 - 7/10/2008 12:42 PM     

Podcast 011

[Incomplete]

Intro, advertising

[1:07]

Atwood: So, um, let's see, what are we going to talk about today?

Spolsky: I have no agenda.

Atwood: I can...  No agenda?

Spolsky: No agenda.

Atwood: How many questions do we have?

Spolsky: Oh we have questions, let's see, I've got four!

Atwood: Ok, we'll try to make sure we have time for those.

Spolsky: Some of them, we can do some of them.

Atwood: Yeah, so, let me start with some complaints about the previous show.

Spolsky: OK.

Atwood: So there were complaints that we were talking over each other quite a bit.  Um, that is hard to not do. *laughs*  Not because we're both liking to talk, although that's probably true as well, but because there's always a little tiny bit of latency, and I find that any time there's latency it really aggravates the "Who's On First"  problem [http://en.wikipedia.org/wiki/Who's_on_First%3F or perhaps he meant "who goes first" - ed], so that it can be a challenge to figure out who's talking.

Atwood: So we will try to do a better job on that.

Spolsky: Yeah.

Atwood: Because I did take that to heart a little bit.

Spolsky:  The trouble with skype is that it doesn't multplex so if we both talk at the same time, it just sort of garbles...gumbles...., you wind up not being able to hear either person.

Atwood:  That is true. You can kind of hear the other person, but their volume level goes way way down.  So, you have to really strain to hear them.  So I empathize, that's not a good thing to do, so we are going to try not to do it. So one thing I'd like to talk about today is a little bit of datbase suff.  Now Joel did provide me with a drop of the discuss.joelonsoftware.com, the .Net forums.

Spolsky: I like the way you say "drop" as if this is something that I'm going to do on a regular basis.

Atwood: Well, we have to do it again.

Spolsky: [simultaneously] Something.

Atwood: Yeah, the reason that I needed this is on a lot of the previous projects I've worked on we did these postmortems, and I've always felt like postmortems are sort of underutilized in software development where you... Where everybody has a meeting.  Because nobody likes to talk about the things that didn't work.  Everybody wants to, particularly in corporate America, it's like everybody's a winner.  There's never any failed projects.

Spolsky: You must live on the west coast.

Atwood: [laughs] Actually, this was on the east coast that I'm thinking of, but I've seen it all over where...

Spolsky: Postmortems I've been to everybody just loves to whine and whinge and all I can remember is the bad things.  And it's usually just the last three weeks of bad things that they remember.  They don't remember any of the good things from the beginning.

Atwood: Well, I think you have to have somebody driving it that's...  You have to have a moderator, big-time for something like that.  Somebody who's not biased, like maybe somebody who wasn't intimately involved in the project who can sort of keep things on track.  But it's important because what I've found is if you don't do the postmortems, you're not really addressing the systemic problems in your development.  Like the things that for future projects that you could actually fix before you even start.  And one of the big things I've seen personally in the projects that I've worked on is we never had good test data, so we always ended up just keying... Developers would just randomly key in data, testers would just randomly key in data, and you just sort of hoped that things worked, and then we'd deploy to production and we'd find that, wow, once you have a thousand records performance tanks, or there's all these cases about nulls that we didn't look at, and yeah.  So having a really good test corpus, and there's really two schools of thought on this.  One is data generation, where you just synthesize a whole bunch of fake data, sort of looks kind of random, but that can be good.  You can have like Unicode in it, things like that that you would not normally have to test.  And the other is just to get a giant body of data.  So in this project, we actually had both options beacuse we're using the Team Suite version of Visual Studio [http://msdn.microsoft.com/en-us/vsts2008/default.aspx] which I got through some friends at Microsoft.  It's a very expensive version of Visual Studio.  It's like, I think the license is like $7000, it's a lot.  But it includes, part of it, one small fragment of it is the database edition and one of the things, one of the many things it will do, is let you generate these data generation plans.  It's kind of cool, it actually ends up being reverse regular expressions in a lot of cases.  That's the really powerful generator.  In a normal regular expression, you're matching, right?  Like, oh I want a number here, a letter here.

Spolsky: Yeah, that's just how you use it, I guess.  Yeah, yeah.

Atwood: But it's kind of cool because I had...  It's not necessarily a unique idea, but when we had this problem I was thinking, wow, wouldn't it be cool if you entered a regular expression, rather than matching those characters, it actually generated those characters. That would be perfect, because you could generate almost anything using regular expressions.

Spolsky: But but but but but but...  You see the bug here, right?

Atwood: Well, no.  What's the bug?

Spolsky: Well, for your very...  In order to generate the test data that's going to test your regular expressions, you're going to copy those regular expressions and you're going to tell some app to generate a bunch of things that match all those regular expressions, and then you're going to get this nice, clean, perfectly conformed data that doesn't really test anything because, you know, it's not...

Atwood: Some of the things you can do, for example, you can set a percentage of how many of the values will be null. So you can have like 50% null on one field, and 0% null in another field.  The other interesting thing is you can set a random number seed.  So you're actually generating repeatably random data.  Which I know sounds weird, and it is a little weird actually, but you can actually write unit tests based on your data generation plan that would say, "oh, well, I know this record is always going to exist because that's the random number seed we used".  So you can actually unit test your database in some form with these data generation plans.  You can also write custom code.  You can also, for example, generate from another table.  Like say you're generating names of cities.  Well, I'm not going to write a regular expression, although you could, it would be weird, that had like fifty cities in it.  You're going to have a table of fifty cities, right? You want the generator to pick at random one of those cities, like Cleveland, Des Moines, place like that. 

Spolsky: How about Cincinnati?

Atwood: Cincinnati perhaps, yes.

Spolsky: Yeah, okay.

Atwood: So it's really pretty flexible.  I really like it 'cause I found very little tooling around this the last time I looked (which admittedly was a while ago... this was around 2005).  But it's really essential to test with a pretty large corpus of data.

Spolsky:  Yep.  There's a new product also, if you don't want to use the Team System there our friends at Redgate have a product called SQL Data Generator [http://www.red-gate.com/products/sql_data_generator/index.htm].  It's $295, and it just generates some realistic test data.  And it does things like actually generate real cities in the city column that actually exist, and then when it generates those it'll get the state right, instead of getting the state wrong, or putting states into German cities and stuff like... you know, it generates pretty realistic looking data.  I'm not really sure what all the features are.

Atwood: Cool.  Well, that's one of the problems that Microsoft has, I think, is that... this particular tool ends up being really expensive and hard to license beacuse...

Spolsky: Yeah.

Atwood: You can get it outside the Team Suite but then it's: you're floating out there with this unusual add-in for Visual Studio.

Spolsky: I don't really get what Team Suite... I mean, my feeling is that Microsoft just noticed companies like IBM with their Rational Suite [http://www-306.ibm.com/software/awdtools/suite/] charging somebody (I don't know who) $6000, $7000 a desktop for, pretty much, nothing.  Or, well wait, not nothing, sorry.  An IDE, and some features for the IDE, and some hard to use bug tracking, and a whole buch of other stuff and Microsoft said "hey, why can't we get in on this $7000 a seat market?".  But it is a very, very niche market, so except for the people that are the partners who get it for free, and so forth, I think we're really talking about a few hundred institutions around the world who are really going to use these gigantic packages.

Atwood:  Yeah, it's frustrating because this data problem, I view as, you know, core mainstream software development, so I want this tool to be in the hands of as many people as possible, and the licensing, and just the understanding of how to even get it gets in the way. Which is why, you know, companies like Redgate, I think, have a great niche because one, their product is only two hundred dollars, you can understand how to get it, like you just download it right?

Spolsky:  Yeah...

Atwood:  ... I assume there's like an eval version that people can get.

Spolsky:  Mm hmm.

Atwood:  And when I was working with, mostly we're using it because we have it.

Spolsky:  Yeah, sure.

Atwood:  Its in our default tool set.  It actually works really well.  I'm sure the Redgate tool is better because another advantage that these vendors have is that's their entire life blood is this product, wheras for Microsoft its just a checkbox on a feature list. I'm sure the team that works on this is very, you know, gung-ho about the feature, but it's not the entire company.

Spolsky: Exactly, they can't get any attention from anybody above them. And and, nobody is going to think of... nobody is even going to know that they neccisarily have this feature.

Atwood: Yeah, it's a challenge. It's a challenge. I don't really know what the answer is.. But my point to all of this is, please look at data generators. They're a really great tool to have in your toolset. I found many shops had no idea what I was talking about, when I went to talk to them about data generation and what it did. But, it was always one of my favorite things to demonstrate, because I felt it was a big quality of life improvement for main stream development. I mean 'cause who doesn't write an app that talks to a database at this point.

Spolsky: You still get the wierdest bugs, even when you have data generation.

Atwood: Well but, yeah, sure, this is like unit tests. I mean, you are just climbing the mountain of quality. It takes a long way to get to the top. We are just talking about working out the base camp now. So another way, if we don't have data generation. You gave me a drop of the data base, so if you have an actual corpus, that works really well.  I was surprised to see that you only posted three times in your own forums. That surprised me.

Spolsky: Umm, oops. Wait, thats not total in my own fourms, thats just that particular .NET questions forum.

Atwood: Thats true, that's fair.  I was giving you a special like ID and looking up your identity and everything and I was like, "I did all that work for three records?"

Spolsky: [laughs]

Atwood: Yeah, so there's like 14,000 records so it's a really really good size corpus, so--

Spolsky: I don't necessarily think we should launch with it 'though.  I mean, we've been talking about launching with it and I have this fear that it will give us a ridiculously strong .NET flavour from day one, which may drive away people in any other technology.  We might be better off launching empty to the beta testers.

Atwood: Okay, I'm open to that.  I mean, I... right now, for development I want a large corpus so I can be confident we're not making any error in the database and so forth.

Spolsky:  Sure.  Yeah.  Hey, the other... speaking of postmortems. You head emailed me that you're in charge of programming the search feature.

Atwood:  Oh good I'm glad you brought that up, yeah I wanna talk about that.

Spolsky:  Yeah, definitely Lucene.net We have been trying to make sql server full text search work for the last 8 years... with very little success.  So...

Atwood:  You know we're gonna get...  we're gonna get emails from people that work on the sql server team, you know that right?

Spolsky:  Yeah, and you know what they're gonna say, they're gonna say: [whiny] "Hey, I'd be glad to listen to whatever problems you've been having and help you solve them.  And I'm the program manager for the next version, and I want to make this better".  And we'll be like yeah, ok, go read the 800 posts that I put on your friggin' forum in 2001 that you haven't answered yet. [giggles] Because I do get actually quite a bit of that from Microsoft when I complain about anything - I complained about MSDN on the web being not webby and changing the urls all the time and got another email from someone who claims to be the program manager and they're revamping the whole thing and its going to be completely different and what would I like to see changed, and I'm like "no, not revamp, thats the whole problem, you guys keep revamping the whole thing".

Atwood:  Right, well. Can I drill down a little bit? (Yeah lets talk about..) I mean 2005... What is... my expectation was in 2005 that full text would be pretty good. 

Spolsky:  It's not.

Atwood:  Why isn't it?

Spolsky:  Don't know.  Well, first of all...

Atwood:  [laughs] Don't know?  Okay, right.

Spolsky:  I'll tell you the two biggest problems that we've always had with it.  One is that it is grafted on using the stand-alone Microsoft Index Server, it is not very native to SQL Server.  And in particular what that means is, for example, instead of being a part of the database, that gets backed up with the database, and treated as a part of the database, and gets detached when you detach the database.  Instead of just being a real part of the database, it's actually its own thing, over there in Index Server land, and it has its own unique identifying numbers that don't match the SQL Server unique identifying numbers, and they just put a million records in the registry to map these things, to connect them.  

Atwood:  What? That sounds insane, what you described!

Spolsky:  [laughs]  One of the reasons why this turned out to be extremely--now for somebody who just has one database, and they're just sort of plinking around, they may not mind this situation, but we host hundreds or thousands of databases on our servers, and this just doesn't survive.  Having the index server being separate from the database itself.  So the index is like its own thing that's not a part of the database.  When you update a record, it doesn't know that it's supposed to update the index.  Instead, the index has to come along and spider it or something later.  And in particular, the problem that we found was that a lot of times, we would detach databases that weren't in use, and we would try to reattach them later, and that would cause the index server to confuse something, and basically full-text search would then be all messed up for the next three weeks.  And we finally learned a fairly complicated and tedious procedure that involved destroying all the full-text indexes, and then detaching the database, and later reattaching it and recreating the full-text indexes, which was the only way to solve this particular problem.  And to be completely fair, between SQL Server 2000 and 2005, this got about 50% better, but it was never really completely solved.  Is that right, Michael?  It's kinda working now, Michael says.  Michael says [garbled] problems with 2005.

Atwood:  Well you know what you guys need, though?  2008.

Spolsky:  Woohoo! 2008! [laughs]  I do.  Are they gonna ship in 2008, do you think?

Atwood:  Yeah, I heard after the summer.  So this year, but not real soon.  I was asking because I was wondering if we should use it on Stack Overflow, but the feedback I got was definitely no.  It's not that ready yet.

Spolsky:  The second problem we have with it is, it's just slow.  Like, a lot of times, a query will just take fifteen seconds, thirty seconds, for the first query, and sometimes it'll speed up later for the next query after everything gets all paged in, but it is just slow.  And when we switched to Lucene in FogBugz, our search became usable. People have an expectation of search in terms of finding things and being usable that they learn from Google, and unfortunately Microsoft Index Server is just not anything like Google-quality search.  It's sort of like 1993 electronic database search if you went down to the library at school and you were doing a paper on psychology and you needed to find something in some corpus and you ran some kind of search, and it thought for a while and gave you back a bunch of wrong results based on, the highest result would be the page that mentions that word the most times in the document.  And it's just one of those things where it doesn't seem to be finding anything, a lot of times it would find things where the word that you're searching for is just not in there and you can't figure out why it's bringing this back as a result, and you suddenly realize that it's conjugating something in some funny way that's causing a match that's incorrect, because it's trying to do stemming.  It's just old-fashioned search, it's just like before-Google search.  And people's standards have risen, and they expect to be able to find things, and people--we discovered when we were using SQL Server built-in full-text search, that people just didn't expect search to work in FogBugz, and weren't using that feature.  They were going to great extents to try to find things using the filters.  And then just scanning.  And once we switched over, it was like, "Hey, the search box works!  You can type things in there and actually find results!"

Atwood:  Well, you guys must have a pretty big set of databases now, because of all the hosting you're doing.  Because, when did you guys start doing hosted FogBugz?

Spolsky:  When did we start doing hosted FogBugz?  Maybe ten months ago, I'm guessing?  I'm not remembering.

Atwood:  So it wasn't that long ago.  So you guys are becoming like a little datacenter.  You're starting to have, like, real major size problems.  Not that--I wouldn't necessarily have, but I would empathize with.

Spolsky:  We have an unusual problem in that we want to give every customer, every hosted customer, their own database.  So in particular, that means having, literally, thousands of databases on our servers.  Which SQL Server was never really intended for.  And actually, SQL Server 2000 was just ghastly if you tried to attach more than about a thousand databases.  Things fell apart.  Suddenly basic operations would take ten minutes.  Things like sp_helpdb to get a list of your databases.  SQL Server 2005, to their credit--and I didn't really blame this on Microsoft, it sort of felt like what we were doing was unusual, and that to design a SQL Server thing to have that many attached databases is a very different project than to design a SQL Server to have a normal number of attached databases.  But the thing about FogBugz is, a lot of these customers use their databases kind of rarely.  Like, they might go in there two or three times a day.  It's not a helluva lotta transaction processing.  So in terms of CPU and disk storage and all that kind of stuff, we can easily put hundreds and hundreds, or even, like I said,  thousands of these databases on a single machine, and it's fine.  Except for the fact that some things weren't scaling in SQL Server 2000.  And years ago, we switched to 2005 and completely solved these problems.  

Atwood: Right, this brings up an interesting... well I think two interesting points, one is one of my favorite bloggers Reginald Breathwaite has a great post on how people who work in corporations are trying to compare your app with what they use on the web, right? Because now, you know, in most corporate situations everything is locked down, you can't exactly install applications. But, there's this emerging class of web applications that everybody can get to right next to your app. And Google is one of those things, right? So, you're right. So, when somebody searches in Google, they see it return instantly. They see it return highly relevant information. And they also see that they can just type stuff in, it doesn't, you know. Your apps usually compare very unfavorably. And, it's almost like unfair comparion because i mean think of all the work google has put into this massive server farm you know and your dinky app. Is it really even fair to compare them. On one level 'no' but in another level 'yes.' So, it's a real challenge. Every app is kinda competing with the web now and there's certain things that it does very very well. And then the other point is that your use case is different then from my use case. That doesn't make either of us wrong. And certainly I totally respect where you're coming from with the things your doing with FogBugz. But, I'm only ever going to have one database, ever. Right? StackOverflow..

Spolsky: Yeah okay, that's true. That problem is not going to probably happen for you. On the other hand, I think performance-wise, and just in terms of the relevance of results.

Atwood: Oh, right. No, I'm totally [???] at Lucene based on the [???].

Spolsky: Illustrate the [???]. Yeah.

Atwood: I was really surprised. I thought it would be better in 2005 but-- no, no, I'm totally going to take your advice. I just want to point out that a lot of times when people are discussing things, they don't talk about their implicit assumptions in their use cases, and they just end up talking. Not that we're doing this, but I see it a lot on discussion forums, and it's like everybody has their pet use case and that's the most important thing in the world to them, but they just don't get that other people are using it in like sometimes radically different ways which would change all the rules for what they're doing.

Spolsky: Yeah, or they're imagining something completely different. They're imagining a different story. You know they have-- I think that's how a lot of political arguments, where you're arguing, you know "Should we allow-- Should there be a tax, an R&D tax break, for research and development?" Well, you know, you can imagine R&D tax break being a way for companies like Microsoft to just not pay their taxes because they spend so much on it. And these are highly profitable companies and they should pay their fair share. Or you can imagine like little startups with two guys in a garage, trying to save a few bucks. And it just depends on whether you are for or against that political statement or whatever. It often depends on what story you're telling yourself is you're having the argument. And if two sides are imagining a different story, they're gonna maybe come to different results as to what should be the policy.

Atwood: Right, but I wish more people would dig down to assumptions when they're talking about stuff. And you kind of touched on this a little bit in the 5 Why's when you guys had that datacenter problem. It's like "Why did this happen? Why did this..." You keep asking "Why?" I mean, there's a similar logic you want to apply to understanding use cases, like "Well, why is this important?", you know, and just digging all the way down to your assumptions that you're-- the assumptions are there because you don't know they're there. You know, that's why they're called assumptions, so.

Spolsky: [chuckles]

Atwood: It's kind of nice to have somebody help you air those out and understand what assumptions you're making. So, Lucene? You guys have had really just blanket great experiences with Lucene? I mean...

Spolsky: Well, to be fair, we started out for some reason trying to use CLucene, which is the C port.

Atwood: [Laughing] Nice.

Spolsky: I'm trying to remember what-- oh I know why, cuz we didn't want a dependancy on .NET. We were trying to avoid a dependancy on .NET...

Atwood: I see.

Spolsky: ...for FogBugz, which we eventually gave up on. And it was just-- it was just like there was threading code in there, let's put it that way, so it just was not stable. And, [we] eventually gave up and switched to Lucene.NET and we've been really happy with it. I've been using-- you know, what made me think Lucene would be good enough is this Lookout for Outlook thing, which I've talked about on my website, which is a plugin for Outlook written... let's see when... uh, help... let's see if it has an About... okay, I just hung it. You know how-- like Outlook uses again the Microsoft Indexing Service, it's never been fast, it's never been good. And these guys, about... let's see when it was... 2003. So, about 5 years ago, basically 1 or 2 guys started this little company to take the Lucene engine and make it available as a plugin to Outlook, so that you could use it to search your email. And it is astonishing how much better it is than any of the search that is built into Outlook, to this date. They then got-- well, the main guy at that company-- the company got quote unquote "bought" by Microsoft, I don't know for how much money, and the main developer on Lookout went to work for Microsoft, and everybody sort of thought that Lookout would then be incorporated into Outlook. Meanwhile, the Outlook team was going in their own directions with search, obviously to be better integrated with the operating system and the search service that Microsoft already had. And Lookout is obviously open source so it's not something they can-- sorry, Lucene is open source so this is not something that Microsoft can really just use. And so, they continued to make Lookout available for a while, and then for some reason when Outlook 2007 came out, they implemented a feature to check if Lookout was installed, and if it was, to disable it. [laughs] And I think, I don't know if they were just being lunatics or if this was just incompetent, rampant incompetence, or if there was some genuine technical reason why they did this. But what they were telling people is "Oh, you don't need Lookout anymore because Outlook now has search built in!"

Atwood: Mm-hmm.

Spolsky: Or "a better search". And it is better, but it's not as good as Lookout. It really isn't. It takes, you know, minutes and it's just tedious. And it's just not as good, it just really is not as good. And so, fortunately, the Lookout programmer has since left Microsoft and he has posted instructions on his website for how to get the old version of Lookout from 2003 to work with Outlook 2007, which is what I use. And it's great, it's fast, and it returns relevant results, and it's just really really reliable. And so, I always had real good experience with Lucene, and that's why I was enthusiastic when the FogBugz team wanted to use it.

 

 

Atwood:  So what have we learned, kids? We've learned that if you want to make change you can't do it from withinside the company.

 

Spolsky:  Maybe, yeah, you have to...

Atwood:  You have to be outside the machine to make the change happen which is really kind of depressing. Because that has implications for the American system of government and things like that -- any large system, it seems like you can make change more effectively from the outside than the inside, its a little depressing.  This has happened to Google too, like Dodgeball, they bought a bunch of stuff -- well not a bunch, but -- there's some really high profile things that all large companies buy that just seem to disappear.  Its like [muffled name?] said, that you buy them and think "oh this is going to be integrated and its going to come with the product, its going to be wonderful" and you know it just falls apart, it just gets absorbed into the machine and just dies.

Spolsky:  Yeah I was going to think of all the Yahoo acquisitions: del.icio.us, flickr, uhhh... what was the other big yahoo acquisition? Where the founders are now gone, yeah, and then you know, none of those things ever shipped another version after they got acquired. Now part of that maybe...

Atwood:  Kinda makes you wonder if there should be large companies, maybe there should just be a whole lot of small companies.  But on a previous podcast you said that large companies just, that its a function of like money I guess, or size, or, I don't know, I didn't really...

Spolsky:  You weren't really listening. [Laughing]

Atwood:  No no, I was listening but I just, it's hard to believe thats why that happens, it doesn't seem very sane.

Spolsky:  Large companies, well, I don't remember what I said then, but I will point out that at large companies you start to develop these local maxima as I call them.  So, local pockets where people maximize for the success of the pocket where they work.  Their team, their division, their P&L, you know their profit and loss statement that they're responsible for, what their bonus is gonna be based on, and people will optimize for that instead of for the good of the whole company, by doing things that are just retarded for the whole company. Or the opposite happens at Microsoft, and I think that's what happens with Lookout, is, Microsoft has this thing that they call The Strategy Tax, which is, basically...

Atwood:  Oh right, that's what you had talked about before, The Strategy Tax.

Spolsky:  It's all the work that anybody has to do to support Microsoft's strategy of Windows Everywhere, and whatever Microsoft's strategy du jour is, and so the Internet Explorer team is told that they can put in some editing, but if it's as good as Word then they're gonna be threatening the Word monopoly and therefore they have to stop making editing in a web browser be as good as Word.

Atwood:  Yeah, yeah, that's too bad, it does happen a lot.  So changing topics a little bit, so one thing that came up last week that I spent at least a day and a half working on is, so, in Stack Overflow, there's wiki-like aspects to Stack Overflow, so you can actually enter something known as Markdown markup.  Have you had a chance to look at any of the Markdown controls or anything?

Spolsky:  Is it the same, are you actually using Markdown there?  I thought it was some slight...

Atwood:  We're using Markdown, we're using a control called the WMD control, again, very unfortunately named, that's a great control.

Spolsky:  [inaudible] Markdown?

Atwood:  Yeah, it uses Markdown.  So, one of the interesting things about Markdown that seems like a plus but quickly becomes a negative as you start writing the code, is that it allows... the spec, not the control, I'm talking about the Markdown spec, allows you to mix
html tags and Markdown tags.

So, the reason, like, Wikipedia and a lot of other sites that allow user input use a separate markup language is just because it's so much easier to do it in a safe way.  Because if you can allow arbitrary html to be inserted into your database, and then rendered to the page, this opens you up to this class of cross-site scripting vulnerabilities, the abbreviation is XSS. And cross-site scripting is really disturbing, and actually I did some research on this in 2007 and since I wrote about it, cross-site scripting is now the most common security vulnerability in the world, for software.

Spolsky:  Wow.

Atwood: So its a really really big deal. So if you allow input from users to go into your database you have to sanitize it. I know it sounds really simple but...

Spolsky: I just did it right now [laughs] I just did a cross-site scripting vulnerability in dev.stackoverflow.com.

Atwood: Yes, you might have.

Spolsky: ...right this very second, [I put] "onclick=" and I wrote some Javascript and it ran my Javascript...

Atwood: Well no no no, you have to submit it to the database. That actually won't go into the database, that's just for rendering.

But if you click "save" and then the page renders that way, absoloutely. So I want to be clear, so it's gotta be written to the database. There's nothing I can do to prevent the preview from showing it 'cause the preview is... so you strip it out.

Its a lot more complicated than developers think and I talked about this on the blog but there's this page of just... Its a hackers page of like all the ways you can type your HTML that are just you know, obfuscated and weird and broken in a lot of ways...

Spolsky: yeah yeah yeah, to get past the filters...

Atwood: Its really disturbing.

I want anybody listening to this who is a software developer who does anything on the web. Please, go to that site and just scroll, its got a huge... You'll scroll for like ten minutes from all the exploits that have been typed in there, its really disturbing. It'll really open your eyes.

Spolsky: Yeah, I mean FogBugz works on the assumption that everything is invalid except for those things which are valid. So it's basically just going to discard everything until it finds something that it is absolutely confident is OK.

Atwood: Right, no. Whitelist. You have to use a whitelist-based approach and for some reason a lot of developers even today... I posted a code snippet so I have a snippet based on a whitelist 'cause our use case is very narrow. I'm only supporting the tags that Markdown emits really.

So you have two options. You can either do it the Markdown way. Like let me give you an example. So italic is asterisk-word-asterisk. That's italic in Markdown. That gets converted to, you know, <i></i> or <em></em>, but you can also enter the <i></i> if you want to.

So I'm only supporting that subset of tags.

Spolsky: If you're only supporting that subset of tags, why don't you just store the Markdown in the database ?

Atwood: Well we are, we're storing both. Because everything is editable so I have to go back to the editable state and I don't want get into a situation where I have to convert from HTML to Markdown. I don't know if I even have code to do that frankly.

Spolsky: Wait wait, there is. Why do you have to store both ? You can always run the Markdown thing again in order to produce a page.

Atwood: There's a problem. Because there's two sets of code because we don't execute Javascript on the server. The thing actually interpreting the Markdown and converting it to HTML is Javascript at this point. So if I want to get a pure, you know, one-to-one, in goes back to out conversion, I'd have to run Javascript on the server.

So I'd be using a .NET Markdown library...

Spolsky: Yeah, there is a .NET library.

Atwood: Markdown is complicated enough that not everybody does it the same way. There's these subtle differences in the implementation OK ?

Spolsky: So wait, what's wrong with the .NET Markdown library ?

Atwood: There's nothing wrong with it, my point is that its a totally different code path so the input and output would be on two different code paths which makes me nervous and...

Spolsky: No no no! wait wait! Stop!

Why don't you do this: Just accept input in Markdown, don't accept any tags. Store the Markdown in the database and when you need to display it, you either run through the Javascript on the client in order to display it and if they don't have Javascript, then they're looking at plain text. Or you run through a different code path, the Markdown .NET on the server in order to send HTML back to the client.

Atwood: Well, I have one problem with that, which is I feel like that gives a really bad experience to people who for whatever reason don't have Javascript running. They're not even going to get basic formatting...

Spolsky: Yeah but that's the whole point about Markdown is that it looks nice. It looks completely legible in its raw format.

Atwood: Oh I suppose. I dunno, I thought like we had plenty of database space. I mean our server has like, 300 gigs of space.

Spolsky: I'm not concerned about that, I'm just wondering why half... Because the way I'm describing it you don't need any filtering.

Atwood: Well you're doing the work on every single page though, that's... I kind of object to that a little. That means that every single page we render, we have to go through and render Markdown again which seems...

Spolsky: You can render Markdown so fast that you can do it on every key-down on the page...

Atwood: On the client sure, but...

Spolsky: ...and that's Javascript. First of all on the server you can do it much faster because it's compiled code, but if you can do that on the client, why not continue to do that on the client. Just keep that Javascript in there and use it to convert the Markdown to HTML on the client.

Atwood: We could. It's something we thought about doing but since I have both versions I figured I have flexibility. I can just send down the pre-rendered version and not have to worry about differences in implementation.

I mean we have all the options so the point you're bringing up we could do that because I'm storing both representations.

Spolsky: It saves you from writing any code to sanity-check HTML.

Atwood: Oh, I see what you're saying. OK.

Yeah that's true because if I'm never writing HTML to the database... But then I have to disallow... wait a minute, that's not true though because Markdown itself allows HTML so you're saying strip out HTML from the Markdown and just...

Spolsky: Change all the <'s to &lt;

Atwood: It also gets a little swirly because we can have code blocks inside the Markdown. Like say you have Markdown issue, oh we're having this problem in HTML right. You have a code block of actual HTML. It could be a cross-site scripting vulnerability, I mean lets bring this full-circle. You're like wow, I've found this cross-site scripting vulnerability. Let me paste it in and show it to you right ?  Well that's actaully safe inside a code block. So if I strip out blindly, I mean I have to have logic to avoid the <pre> blocks, the things that are actually supposed to render that way.

Spolsky: Ah, <pre> blocks you don't just get an &lt; in there ?

Atwood: Um...

Spolsky: I think that's probably what you have.

Atwood: The whole stack gets really confusing because you can actually, like I said, be pasting in script vulnerabilities that actually render safely.

Spolsky: Here's my point. Uhh, in general, my design philosophy, which I have learned over many years, is to try and keep the highest fidelity and most original document in the database, and anything that can be generated from that, just regenerate it from that. Every time I've tried to build some kind of content management system or anything that has to generate HTML or anything like that. Or, for example, I try not to have any kind of encoding in the database because the database should be the most fidelitous, (fidelitous?) highest fidelity representation of the thing-a-majiggy, and if it needs to be encoded, so that it can be safely put in a web page then you run that encoding later, rather than earlier because if you run it before you put the thing in the database, now you've got data that is tied to HTML. Does that make sense? So for example, if you just have a field that's just their name, and you're storing it in the database, they can type HTML in the name field, right? They could put a less than in there. So, the question is what do you store in the database, if they put a less than as their name. It should probably just be a less than character, and it's somebody else's job, whoever tries to render an HTML page, it's their job to make sure that that HTML page is safe, and so they take that string, and that's when you convert it to HTML. And the reason I say that is because, if you try to convert the name to HTML by changing the less than to &lt; before you even put it in the database. If you ever need to generate any other format with that name, other than HTML. For example you get to dump it in HTML to an Excel file, or convert it to Access, or send it to a telephone using SNS, or anything else you might have to do with that, or send them an email, for example, where you're putting their name on the "to" line, and it's not HTML. In all those cases, you'd rather have the true name. You don't want to have to unconvert it from HTML.

Atwood: No, I agree with that, and that's why we decided to store both representations, so we have all the options at that point. It's really a lot more complicated than you'd think after actually dealing with it. I mean, that's really what I learned from this, you have to be.. and thats why cross side scripting is so dangerous, because there are so many ways to get it wrong, that are really kind of subtle, so I think that's the only lesson I have here. And I'm open to whatever rendering strategies we want to use. I  just kind of like the elegance of not having to do anything markdown related on the server. My XSS routine, I posted on refactormycode.com, so people could look at it and make sure my white list based approach was correct. And I got some really nice feedback on refactormycode and I plugged a few holes, and as far as I can tell by the people looking at it, it's actually valid because it's really draconian because I only allow very very specific set of tags, and if you're not on that list you're just, you're gone. Either HTML encoded, or actually deleted. So there's two schools of thought on that. So I feel pretty confidant that the XSS...

Spolsky: [laughs] You've grown attached to this code, even though you don't need it.

Atwood: Well, No, I just like not having to execute markdown on the server. I think thats really nice, because I have one code path.

Spolsky: No, you don't have to. When you're sending a page to the client you send it with a bunch of markdown and the same javascript that you use for interpreting that mark down that you use on the post page, and it runs that javascript real quick if it can, and then they get their page converted to a bunch of HTML.

Atwood: But the control doesn't really arbitraily render markdown, its not designed to do that. I mean, I could repurpose it.

Spolsky: That's, that's, that's easy. There's all kind of javascript versions of markdown floating around. There's one, they probably just copied all of the code from that. There's one that's one of the top three hits for markdown on Google. Where the heck is it? That uhh [sigh]

Atwood: I don't know, I just philisophically I don't like that the page is going to look, granted markdown doesn't look bad per se, but its not going to look great.

Spolsky: Here we go attacklab.net/showdown

Atwood: Well right that's the control we are using. Attack lab's is the WMD offer.

Spolsky: Oh, okay.

Atwood: Yeah, no, we'll look at it. I mean, this is the kind of stuff during the beta and stuff I want to get really good feedback on this. And our performance is really incredible now. I mean, we are returning in milliseconds. Granted we are not doing everything we should be doing, but even already with the beta, we're astonishingly fast. I'm really a stickler about speed. It's actually appalling, like a lot of sites I go to, like I've been doing a lot of searches. 'Cause for examples my SQL has gotten very rusty, because I haven't used it in a while. So I was doing a lot of SQL related searches, and a lot of the sites I would land on would take just forever to load.

Spolsky: And then they would try to sign you up?

Atwood: Yeah, and just the layout is confusing.

Spolsky: And it's like experts exchange, you have to scroll down past all of the advertisements to get to the... [laughs]

Atwood: I'm in the situation now where I really wish StackOverflow was up and running 'cause I would literally take a lot of the stuff I'm looking up about SQL, just really simple things, frankly, and I would post them as questions on StackOverflow and I would personally refer to them because it's gonna be, to me, a better systems. So, one of the things I like about building StackOverflow is even the partial version of it we have now, I  think is actually better than a lot of the sites that are out there, and it's so low friction. You actually enter something on it, you don't have to sign up, there's not a lot of, you know, extra stuff on the page you have to think about. There's pretty much just the question and the answer, a basic, you know, markdown editor. And it's amazing what you can do in markdown. Um, and also mixing in HTML as well. If you get confused, you're like, "I don't understand markdown." You can just type in the HTML, and it'll work.

Spolsky: Hey, where did that...uh...that's an advantage. Where did that...uh...syntax. I saw some kind of syntax coloring when you did code.

Atwood: It's in there. You just have to, so to make a code block in markdown, you indent four spaces. You can click the toolbar button which is Ctrl-K.

Spolsky: Well, how does it know how to syntax, wah wah, it looked to me it was doing syntax coloring unless I'm dreaming it.

Atwood: It is. Okay, so that comes from, that's a project some Google engineer, I think, wrote it--it's called "Prettify." And it's a little interesting in that it actually infers all the syntax highlighting, which sounds like it couldn't possibly work--it sounds actually insane, if you think about it. But it actually kind of works. Now, he only supports it for, there's certain dialects that just don't really work well with it, but for all the dialects that sort of, you'd find on Google. I think it comes from Google's Google Code. It's the actual code, it's the actual javascript which is on Google Code that highlights that the code that comes back when you're hosting projects on Google Code. And you, and you, um, 'cause I think they use Subversion so you can actually click through...

Spolsky: How do they know, how do they even know what language you're writing in? And therefore, what a comment is and...

Atwood: I don't know. It's crazy. It's prettify.js, so if anyone's interested in looking at this, just do a web search for "prettify.js," and you'll find it.

Spolsky: Somebody, somebody call in with a next week, please. Send us, send us an mp3 with an explanation of what the heck this thing does, what languages it understands, why it works. I just typed in some random thing in a language I made up, and it actually syntax highlighted it reasonably well.

Atwood: Right. It's kind of amazing. It's pretty cool. It's really neat. So yeah, no, if anyone else has any other suggestion for syntax highlighting, please let us know. 'Cause traditionally what you do, like on wikipedia, for example, when you put a code block in, you do have to specify the language very explicitly. So it is unusual to have a highlighting engine that just infers everything and most of the time actually does get it right, so it is kind of cool. So, yeah, no, I'm really excited about StackOverflow. I mean, like I said, just doing the searches that I've been doing to build it. It's sort of a recursive thing, where I wish I had it. So I could use it as my research notebook. And that's actually how I'm going to use it. So I figure, even if nobody uses StackOverflow, I'm going to use that crap out of it. So, that's my own logic.

Spolsky: Yeah, so will I. I asked a question yesterday and answered it myself.

Atwood: Yeah, yeah, there you go. So I was going to talk a little bit about Steve Yegge, but I don't think we have time. I'd rather get questions at this point.

Spolsky: Um, yeah. That takes Steve Yegge, I think Steve Yegge takes a really long time.

Atwood: [laughs] So let's queue up some questions here.

Spolsky: Yeah, here's one from Stephen Hill.

Stephen Hill: Hi Jeff and Joel, my name is Stephen. I'm from Blackpool in the UK. What do you think of Microsoft Silverlight? Do you think it will catch on, and will you be using it in the future? Thank you.

Spolsky: Silverlight.

Atwood: Well, okay, first of all, we had some complaints that you were talking over the people asking questions, so I just want to be clear about that.

Spolsky: Yeah, I'm doing that on purpose, so there.

Atwood: Okay, I--The audience is not liking that.

Spolsky: Just the people that emailed you are, don't like that.

Atwood: Well, yeah, it was more like comments, but yet. And also Stephen Hill, thank you. That was a very succinct question, and I love that. I just love quick, and "here's my question", awesome.

Spolsky: I edited like crazy.

Atwood: Did you? No you didn't.

Spolsky: Yeah, do you want to hear the original?

Atwood: Did you?

Spolsky: Yeah.

Atwood: No. No I don't.

Spolsky: I did I got it down from ninety seconds to eleven seconds. He had, actually, eight  questions in a row, and he was very kind and left big pauses between each of the questions and said just go ahead and pick the ones you like. And that's what I did.

Atwood: Yeah, if you can ask short question that's be awesome. So Joel, take it away.

Spolsky: I don't know anything about Silverlight. Didn't you used to work for some kind of Microsoft Solutions Provider or something?

Atwood: Yes. I know far more than anybody should know about Silverlight. So OK. So, I'm of two minds on Silverlight. From a programming geek perspective, it's actually very impressive. So, there's Silverlight 2.0, which is like, the real version of Silverlight that's not quite out yet, still in beta. There's also Silverlight 1.0, which is Javascript based, which is like the fake version of Silverlight, that's kind of going to die off once 2.0 is out. So Microsoft already has this problem of, they have these two sort of radically different versions of Silverlight that are named the same thing, which is annoying and very Microsofty. I guess they felt like they had to get something out there, 'cause you know. So  we're really competing with Flash; let's be clear on the messaging here. So Silverlight competes with Flash, period. It does more than Flash, though. And by that I mean, this is where the programming geek stuff comes in. It's another version of the .NET runtime, and it's a really cool common-language runtime that supports Ruby, Python, VB .NET, C# and I think even Javascript. So you got a really like hardcore, like Anders level language platform running in the browser now. And a lot of geeks saw that and were like "Wow, this is awesome." And it is awesome, right? I mean, you can do real programming. Not like this ActionScript stuff which is still to this day crazy--I mean people build fantastic stuff and it's all Flash, but it's not exactly a programmer's paradise. You know, if you're used to like C# and what I call real programming.

Spolsky: Let me ask you a question. Let's say you were a--you didn't care about any of this Flash competition or anything like that. You weren't even making Flash controls or anything. You just had a web page, like a nice .NET 2.0 thing with like Markdown engines and stuff on your webpage and you've got large volumes of Javascript code to run your site. Now let's say Gmail, for example. There's just a lot of Javascript there running everything. And there's actually so much Javascript there that the performance is starting to be an issue. So my understanding is that you could take this Javascript, compile it into a Silverlight thingamajiggy, without changing it even, and get this bytecode that would run really really fast on any kind of web browser that had Silverlight installed, and then for the people who don't have Silverlight installed, you'd just fall back to the old slow Javascript.

Atwood: Well, not quite. 'Cause with Silverlight you're writing to a different DOM. You're writing to the Silverlight display surface.

Spolsky: But, but but--Ahh!

Atwood: You can get to the browser stuff, but then you have to go through COM, and you're paying a lot of performance penalties at that point. So in Silverlight, it's like Flash, in that you have your own drawing surface that does primarily vector-based stuff. And you can put, you know, user controls on it, and they're building this whole set of dropdowns and radio buttons and stuff. But the use case of talking to the web browser object model is not really what Silverlight is about. It's more like Flash in that regard, where you're actually writing to the Silverlight surface. And you can do almost like 3D type stuff, I mean you can do, it's much more powerful than your typical browser display elements. But it's a totally--the downside is, as you pointed out, it's kind of a separate world as well. So anyway, that's architecturally what it is. I am excited about Silverlight 2.

Spolsky: It's ActiveX controls.

Atwood: Yeah, it is kind of like that. And it runs on the Mac, and it runs on Firefox, and Microsoft has done a very good job with Silverlight, of being open and saying "Hey, we're going to run on all these different platforms, we're not going to discriminate, it's not a windows-only type experience." But on the whole, and I think you brought up one of these points, I think the problem that Silverlight's going to have is--like WebKit, for example, that team is working on massively improving Javascript performance. They have some technique, and I'd have to look it up, but it's some sort of hybrid, not compilation, but really fast interpretation they're doing, and they're getting some amazing results on benchmarks now. Much faster... like Firefox3 was very fast with Javascript, this is, I think, faster still by  like 2x.

[48:08]

Splosky: Wow.

Atwood: So really the competition for Silverlight is that you can still do vector stuff aside, with vectors there's usually this... But talking to the browser like say to the markdown control that we have, if you had really fast Javascript, why not just do it in Javascript? I mean people understand the browser Dom, div's and CSS and stuff, so outside of vector based stuff I don't see... It may not have enough compelling strength in the face of really fast Javascript interpreters that are starting to emerge. I think... I think that's the...

Spolsky: Yeah, I think the truth is I've never seen an application - there's been a lot of programming enviornments that are basically ... the idea being that you get a rectangle inside the web-page somewhere, and it being desktopy inside that rectangle.

Atwood: Yes, that's what it is.

Spolsky: And that... It started with Java applets, and there was Flash, for a while there was ActiveX controls which had huge security problems, so forget that for a moment, but even if they didn't have the serious security problems that they had, just the fact that you're stuck in this little rectangle and you're navigating inside the rectangle, and the back button blows away the entire world that you were just in, and it was just like the un-webiness of these desktop apps running in the rectangle in the web-page...

Atwood: Right.

Spolsky: You know, they were useful for like little controls, so you might use it like the WMD thing that you have, where it is a little WYSIWYG editing box effectively. And you know, it's basically being a control on a form, you know, then it's a descent control development environment, but the idea that you are going to build a whole app on this... And people have built like entire flash apps where the whole app is flash, but you know, they are always sort of upsetting in some ways. You can’t select text and cut and paste it, you can’t bookmark things that are inside there, because they are not real web pages with real URL:s. A lot of times… Have you ever got an email sent to you by your secretary, or something, saying: “Oh, go look at this”, and you’re like “I don’t get it. It’s the top level web page of some furniture distributor.” And you go there and then you realize that it is a gigantic flash thing, and they spent six months and they navigated to something interesting inside this gigantic flash control. And then they wanted to send you a link to that, and they did what they had learned to do, which was to send you what was on the address bar, which hasn’t changed for the last twenty five minutes, while they were browsing around in this gigantic java… oh, sorry… flash based… you know… universe unto itself.

So I think that… Honestly I really do feel like many, many times the lesson has been learned that when you try to build things in the web that aren’t webby, because they try to have their own rectangle, you are going against the grain of the web in a way that makes it extremely, extremely unlikely that people will go for it, that it will take off.

 

Atwood: Right… No I completely agree with everything that you said, because flash still has this problem. I tell people that are really excited about Silverlight… I’m like “Yes, from a geeky perspective it is absolutely cool”. You’ve got this really kick but runtime. You can write Ruby code that runs super fast in the browser, which is incredible. So that is really cool, but you are still playing in that rectangle, right? How much adoption has flash really gotten? And flash has been out for like ten years, a long time. How many websites do you go to that are built entirely in flash, and how many people rightfully complain about those apps? This is not a solved problem, I mean you can have all the same all the same problems that flash has, just different flavours of them. So you’re not embracing the web.

 

Spolsky: And you don’t have the ubiquity. At least flash is ubiquitous, except for the iPhone.

 

Atwood: Yes, and even on Microsoft’s site… They have Silverlighted up some areas of their site in what I consider very inappropriate ways that makes them actually worse and more painful to use. I think Microsoft downloads does this now, and the download site is just painful to look at because the font rendering is wonky. Because the operating systems font rendering is actually outstanding I mean this is like how many years of computer science they spent on this. Where as in Silverlight, it is like “we have our own font rendering engine”. It’s not as good.

 

Spolsky: Really?

 

Atwood: Oh, yeah. It looks really bad. I’m really sensitive… I’m kind of a wonk when it comes to how fonts render. It’s really kind of annoying, actually.

 

Spolsky: I’m surprised that they don’t let the operating system render the fonts.

 

Atwood: Well, there is something about the way that they are doing it that is wrong, that is really wrong, that to my eye was like “Wow, that look really, really bad”. It was really obvious. And flash does this to. The way flash renders fonts: bad, horrible.

 

Spolsky: They have to be scalable.

 

Atwood: Yeah, but there is a font rendering technique that is called sIFR that is really cool, where you take a div, or something, and you can dynamically replace it with flash to use whatever font you like. It sort of solves the web font problem, but it has a fallback because it is a div with another font. But it does not really look right, because the fonts are rendered through flash and not the browser. So I posted about this on my blog, and people was like “Wow, that look really bad” and I was like “Wow, I didn’t see it”, but once they told me I totally saw it, and then I could not un-see it. Right? I totally... They ruined it for me. It was like “Here is this cool thing”, and they were like “It’s horrible”, and I was like “Oh, you’re right, it’s horrible”. So I agree with all those criticisms.

 

Spolsky: So, ok. We officially poo-poo on Silverlight.

[53:20]

{Dave Roberts Question}

[53:40]

{Learning C - what, again?!?}

[54:10]

{Jeff would have been fired a long time ago}

[54:33]

{The Yegge thing.  Don't let it be just you}

[55:03]

{Programming is a social activity}

[55:53]

{Done and gets things smart}

[56:37]

{A 14-year old named Jason}

[57:00]

{Would Joel hire Jeff?}

[57:35]

{We feel we know each other}

[58:30]

{This is normal back-and-forth}

[59:10]

{WWDC party - that band Paul McCartney was in before Wings}

[1:00:08]

{Would Jeff hire Dave Roberts?}

[1:00:54]

{Hiring at FogCreek - phone interviews, physical interviews, internships}

[1:01:51]

{Hiring is tough}

[1:02:30]

{it's better to have hiring false negatives than false positives}

[1:03:28]

{most people rejected by Fog Creek would probably be fine}

[1:03:52]

{It's just like intimate relationship}

[1:04:16]

{the marketing intern}

[1:04:52]

{thanks for the transcripts}

[1:05:09]

{more questions please}

[1:05:24]

{Wave file?}

[1:06:01 ends]

Outro, advertising

[1:06:55]

Last Modified: 10/10/2008 2:09 AM

You can subscribe to this wiki article using an RSS feed reader.