View  Info
Compare   Restore  
   Revision #8 - 1/11/2010 5:37 AM     

Podcast 079

Spolsky: I was just in a meeting last second with this guy he made a StackExchange called ClimateDeal. Have you heard about that?

Atwood: I haven't heard of it.

Spolsky: It's a climate change StackExchange and I guess there's lots of money in the NGO business.

Atwood: Really? I didn't know anything about that. And another thing I don't know anything about that somebody mailed me about and I want to mention since we're on the topic of StackExchanges is and astronomy StackExchange. Not just astronomy but...

Spolsky: Like I'm a libra and you're a leo therefore...

Atwood: Actually Joel, I'm a capricorn though, and capricorns are very stubborn as you know.

Spolsky: And I'm a leo. I know that you're stubborn I don't know about capricorns in general.

Atwood: That's the crazy thing about astrological science. Everybody is born in a specific month has the same personality traits. That doesn't even make any sense at all. It doesn't even pass the sniff test of like: Is this sensible? No. This is like turning lead into gold ridiculous.

Spolsky: What's real is biorhythms.

Atwood: This is a real thing it's actually astro-tech, I guess it's technical astronomy: interfacing with telescopes and astronomical instrumentation. It's I guess they had an existing site.

Spolsky: Terrible URL...

Atwood: So if you're into astro-tech...

Spolsky: Woah, look at all these people, look at all these questions. Is ascom a thing? I think it might kind of be a thing because there are people asking about ascom on here.

Atwood: Yeah, I think it is a thing. This is, when we created StackExchange we were looking at niches and I'm a big fan of these little niches on the internet, I think it's wonderful.

Spolsky: We got to tell the listeners. Ascom is a many-to-many and language-independent architecture supported by most astronomy devices that connect to computers. So it sounds like it's like MIDI of telescopes.

Atwood: Yes, that makes sense.

Spolsky: That's what is sounds like.

Atwood: But only if it can play music Joel.

Spolsky: Except that it doesn't play music, it plays telescopes. And some of the names here I recognize from StackOverflow so I think that there's some overlap. Maybe not, maybe it's just because there are people named Chris and Bob and stuff. It is sort of interesting how these StackExchanges, like the first level and the second level of spreading I asked Jose who's here from ClimateDeal how he heard about StackExchange and why he decided to start using it. He said they're building a whole organization around StackExchange kind of and they're going to promote to other climate change type of organizations and other "green" organizations that they know. And I asked him how he found out about it and he's like, "We were working with these programmers and they suggested that we look at this and they use StackOverflow." Like all the programmers on StackOverflow have an obligation to tell other industries and get them excited about the StackOverflow vision for the future.

Atwood: Anything to keep people off of PHP. It's like keeping people off drugs, it's the right thing to do.

Spolsky: Right...

Atwood: And when I say PHP I mean PHPBB. I'm specifically talking about PHPBB.

Spolsky: You know that there is 99.999% of PHP is not PHPBB.

Atwood: I know. There's a ton.


Spolsky: Let's keep people off that too.

Atwood: Well, not necessarily, I've kind of resigned myself to a world of PHPBB at this point. So, one thing I want to talk about is over the holidays I was able to contact the person who created the Markdown... there's 2 Markdowns: Markdown is the markup language that we use on StackOverflow. There's 2 implementations that we have: one is for the client side preview, which is the wmd control which we had to reverse-engineer... the whole story is in a previous podcast. And then there's the server-side implementation. So one of the difficulties we ran into was these are subtly different.

Spolsky: So the previews are not matching when it shows up on the site?

Atwood: Right. Over the holiday I did improve the preview, like the main areas that were just kind of oversights really. Like we changed some of the rules about bold and italic and for the most part they match now except for really weird edge-cases. I got rid of all the obvious mismatches.

Spolsky: Yeah.

Atwood: And actually I got help from Jacob, I'm going to mispronounce his name I'm just going to call him Jacob, who runs the MathOverflow StackExchange. He was very helpful. He was helpful in sort of troubleshooting that. They used a lot of weird syntax on MathOverflow.

Spolsky: They do amazing things with math notation basically. They use LaTeX.

Atwood: Yes, but, we've talked about that before, but in addition to that they just have ASCII notation as well, and the ASCII notation can be problematic because you're putting characters in sequences that are just really, really uncommon in any kind of normal text at all. So they were running into a lot of edge-conditions as well so he was very helpful so I do want to give a shout out to Jacob in that regard.

Spolsky: Have you looked at MathOverflow lately? It's absolutely insane. Look at all these tags, they've got tags with little dots in them. Why is that? Oh, I think that the dot is like the...

Atwood: Don't you understand Joel? I'm kind of like allergic to math so it's not really good for me to be around math.

Spolsky: Look at the site and look at their tag cloud over there. What they're doing is, they've got like 2 letter abbreviations for tags. So it's like a 2-letter abbreviation, a dot, and then the name. So it's fa.functional-analysis. Or ra.rings-and-algebras.

Atwood: Look what you've done Joel, you've made me go to a math site.

Spolsky: Are you listening to a word I'm saying?

Atwood: I am listening! I'm just trying to tell you...

Spolsky: Look at the tags, this is a general idea that they seem to have invented here. So now if you want to look at probability stuff you don't have to type 'probability' you just type 'pr.' and then it only has one match. You see? Get it?

Atwood: The Hawaiian Earring?

Spolsky: Look at the tags!

Atwood: I am looking at the tags, I see what you're talking about. I've processed that.

Spolsky: You see what they've done? They have this cool feature, that you can just type like 3 letters and it'll only have a unique match.

Atwood: Very rapidly yeah. Although we do match anywhere.

Spolsky: I know but you'll have multiple matches because those 2 letters... because they put that little dot in there, this means that if you just know the 2-letter code for something, you're just going to hit the 2-letter code and you're done.

Atwood: Yep. Cool. MathOverflow's great, it's been hugely popular. There's definitely been demand for it from way back.

Spolsky: I don't understand anything. Like nothing. Hawaiian Earrings, I know what those are.

Atwood: I'm glad there's people smart enough to do this advanced math because I really, really suck at it.

Spolsky: I'm voting that up.

Atwood: Wow, you can vote on MathOverflow?

Spolsky: No, it didn't let me. I need to talk to Aaron, I want to be able to vote on MathOverflow.

Atwood: So anyway, MathOverflow is fantastic and Jacob's the guy who's been helping us out on that. The server-side implementation was where I wanted to do some additional work and it wasn't actually an open-source project. I don't think it was intentional, but the original author did not present it under an open-source license, which means, as you know: if it's not open-source it's copyrighted by default. So I contacted him and he was totally cool about it and he granted the copyright to me. So I was then able to turn around and open-source that and put it up as Markdown Sharp on Google Code and I'll link that in the show notes. I was able to make quite a bit of progress. You know, we're a little bit down on unit-testing, but this is like a textbook example of where you want unit-tests. One of the first things I did was put in unit-tests. Unit-tests for Markdown are pretty simple, they're basically just input and output. You have an input file which contains Markdown and you pass it through the processor and the output should match the reference.

Spolsky: This is an awesome example of where it's straight text transformation, it's so easy to do automated tests, unit-tests, TDD and all that kind of stuff.


Atwood: It's brilliant, because I found just an unbelievable number of bugs... oh my gosh I found a lot of bugs. Bugs, like, in our port, just accidental bugs. Literally just like an extra space in the regex in the wrong place.

Spolsky: Right.

Atwood: And it was causing it, it wasn't causing it to break, but it was causing like failure-to-match and that was causing the output to be subtly wrong. Not in a way that really broke anything per se but it was wrong. I fixed that, and there's a lot of bugs from the actual implementation, the Perl implementation. The original implementation of Markdown is Perl.

Spolsky: Wasn't it done with like 8 millions regexps?

Atwood: Yes, so I sent you a link. You should click on that link now and look at that.

Spolsky: Yeah, I saw that, it was uh... the most beautiful bad code I've ever seen in my life.

Atwood: Yes, there's somewhat of a tradition, unfortunately, of writing Markdown parsers in regexes. That definitely starts to have a downside. I'm a huge fan of regular expressions, but there's a point where it becomes extremely complicated code. I haven't been able to get anyone to really help me. Now, to be fair, this gets into issues of like running an open-source project. Now I am "running an open-source project." It's a very small one. And I solicited help and a lot of people have contributed patches and stuff and I really appreciate that, but one thing I've noticed is there's a lot of "painting the bike shed" that goes on versus the core problem of when you have this dense mass of code that's just a bunch of really complicated regular expressions; although, some of them aren't too complicated, but the flow of the program is very regex based. People are not really able to help you very much. That's what I've seen.

Spolsky: You mean they just can't parse it, figure it out.

Atwood: They can't or they don't want to, but the really hard part of the code I'm not getting a ton of help with.

Spolsky: I think that's always true with language. But, here's some things- I have to say a couple of things: one is this code is, it says copyright... based on John Gruber.

Atwood: Let me clarify, you're looking at the PHP Markdown. Now one of the problems- let me give you a little background: when I mentioned we have a reference Markdown standard, that's kind of the problem with Markdown. It is kind of a standard, John Gruber laid out the specification, but there's a lot of edge-conditions he didn't cover.

Spolsky: That always happens.

Atwood: There's just a lot of bugs. I mean a ton of bugs.

Spolsky: This makes me kind of wonder: is John Gruber a computer scientist in any way? Or did he just sort of...

Atwood: I don't know. I don't think you need to be a computer scientist to write code.

Spolsky: Wellll, this code you do. No I'm sorry, here's what I what I want to say, I feel like I may be talking out of school, but Markdown is not a regular language. And when you have a non-regular language, you can't use regular expressions to parse it. You can try, but there's gonna be not just edge-cases, but you're gonna have sort of infinite bugs basically. And anybody that has taken a compiler course, I think, looking at this problem of Markdown, would have said "I don't need a bunch of regular expressions, I need a lexer and a parser." Lexers and parsers are easy to write and there are utilities that generate them for you. So, unless I'm- actually I'm looking at this code and it has something called a parser...

Atwood: It's not really a parser.

Spolsky: Yeah. It's not a parser.

Atwood: No. And you definitely- as I said, this is the PHP implementation. What I found is that the PHP implementation is actually much better than the Perl implementation. Even the- there's some secret unreleased versions of the Perl implementation.

Spolsky: So the original one was Perl. Well, it was probably all the same regexps.

Atwood: The thing about the Perl implementation is it's really close. But it had edge-conditions that are super-super-hard to get rid of without writing a lot of complicated code. I think it's the classic example of Perl code in that it worked for the 95% case, but once you start looking at the unit tests that fail, to fix them is this rabbit hole of like-

Spolsky: If you start out by using Yacc and Bison and lex and those tools, that do lexical analysis and make a parser for you, using state transitions like a normal compiler writer does, you wind up with code that never has edge-conditions, and when it does, they're very easy to fix. It forces you to go back and fix your spec because then you go back and look at your spec which you did not write in BNF form because you don't know what BNF form is, because you're not a computer scientist, and you say "oh, there are going to be a million things I forgot to define because I'm not defining this rigorously, and I'm not defining this rigorously because I'm just writing a bunch of text in English. I should be writing-" (Joel started coughing)


Spolsky: I'm sorry, I didn't mean to criticize anyone in particular, it's just that the choice of- you know a lot people see a problem with Markdown, and they say "Ah, I need to search for certain things and replace them with other things." And I think that that's kind of- that the real way to look at that is- I mean you can do- you can go down that path, of regexps and I am searching for things and replacing them with other things, but when you do that, you're not really keeping track of what state you're in as you go through the tree and you make mistakes and there are edge-conditions and there are things that people can insert that will cause you to output things that are very, very invalid. And I think that somebody who has taken a compilers course would say "Oh, I have text that I have to translate into a different form, I need to lex it and parse it, and then I need to create an abstract syntax tree, and then I need write out that other form." This is not a lot of code and you wouldn't get a lot of code if you did it that way actually.

Atwood: There was a funny post on Reddit, a reaction to the blog post that I put up and he said "It became a tradition to have crappy implementations of Markdown." Because the reference implementation was a certain way so it kicked off a lot of clones, because people all just copy this. It really does work for the 95% case. The edge-conditions are not terribly bad.

Spolsky: Yeah.

Atwood: But fixing them is just unbelievably difficult and that's where you get into "If you want to do this the right way," then it is difficult to do with regular expressions.

Spolsky: I am speaking out of school here, somebody will come up with a proof of this, but I don't even think it's possible because the language is not regular and regular expressions are for regular languages.

Atwood: It's possible, it's just that the code becomes very, very, difficult to work with in my opinion. I'm certainly seeing that with the PHP implementation where they fixed a lot of the problems with the Perl 1.01 and the 1.02 the unreleased version. He had a different parser there, and it's really complicated.

Spolsky: Hey, if you're a computer science student and you're looking for a senior project, or just a project, do an implementation of Markdown as a compiler.

Atwood: I think there are actually, but the problem is I just did a cursory look. My goal was really simple, I sort of fell down the rabbit hole as I got- okay I'm just porting code, I'm not trying to write new code, that's not really my goal here. I just want to make sure I match the reference implementation. You have 2 problems: one is the reference implementation kind of sucks, it's not really right.

Spolsky: It's not "referency."

Atwood: It's not "referency" at all. So then you look at the alternative implementation which is PHP Markdown, honestly the most mature one, the one that's maintained the best, the most accessible, the one that I could find, and it follows the lead of the original implementation.

Spolsky: It is gorgeous code in the sense that it's very well structured, it's clean and there's no reason to be embarassed by this code.

Atwood: For PHP it's quite good.

Spolsky: The regexps are all broken out on multiple lines with millions of comments, stuff like that.

Atwood: Well, I sent Joel a link and I'll put this link in the show notes but that's the link to the HTML detection regex which is like, I would say on an average large programmers monitor, it's a regular expression that's probably 2 to 3 pages long. And it's used with whitespace, I mean it's broken up, it's probably the most complicated regular expression I've ever seen that's actually a real thing and not a joke.

Spolsky: Well they have those ones that match email addresses.

Atwood: I know, but I conside that one kind of joke. Nobody hopefully really uses that. But this was written by a human being and it's commented and uses whitespace and all the right things, just to show you how complicated it is, if you specify compiled on that regex it does not help it actually hurts in this case because the regex is so complicated. .NET freaks out on my machine for about 5 seconds, like trying to compile this thing.

Spolsky: (makes the sounds of a dying computer)

Atwood: It works, it does compile it, but it takes like- it literally just freezes; your CPU usage goes way up, and it kind of drives the regex compiler a little bit crazy I think. So it's quite a sight to see. It really highlights to me one of the big weaknesses of regular expressions which is matching pairs.

Spolsky: Yes! That's what it means to be regular or not regular.

Atwood: Yeah, that's really a pain in the butt. And that's what a lot of the hairiest code is balanced matching.

Spolsky: Right. Precisely what regular expressions, by being called regular expressions and not irregular expressions, were meant to signal to the people who knew what the word regular meant.




Last Modified: 1/12/2010 3:57 AM

You can subscribe to this wiki article using an RSS feed reader.