Spolsky: I was just in a meeting last second with this guy he made a StackExchange called ClimateDeal. Have you heard about that?
Atwood: I haven't heard of it.
Spolsky: climatedeal.org. It's a climate change StackExchange and I guess there's lots of money in the NGO business.
Atwood: Really? I didn't know anything about that. And another thing I don't know anything about that somebody mailed me about and I want to mention since we're on the topic of StackExchanges is and astronomy StackExchange. Not just astronomy but...
Spolsky: Like I'm a libra and you're a leo therefore...
Atwood: Actually Joel, I'm a capricorn though, and capricorns are very stubborn as you know.
Spolsky: And I'm a leo. I know that you're stubborn I don't know about capricorns in general.
Atwood: That's the crazy thing about astrological science. Everybody is born in a specific month has the same personality traits. That doesn't even make any sense at all. It doesn't even pass the sniff test of like: Is this sensible? No. This is like turning lead into gold ridiculous.
Spolsky: What's real is biorhythms.
Atwood: This is a real thing it's actually astro-tech, I guess it's technical astronomy: interfacing with telescopes and astronomical instrumentation. It's answers.ascom-standards.org. I guess they had an existing site.
Spolsky: Terrible URL...
Atwood: So if you're into astro-tech...
Spolsky: Woah, look at all these people, look at all these questions. Is ascom a thing? I think it might kind of be a thing because there are people asking about ascom on here.
Atwood: Yeah, I think it is a thing. This is, when we created StackExchange we were looking at niches and I'm a big fan of these little niches on the internet, I think it's wonderful.
Spolsky: We got to tell the listeners. Ascom is a many-to-many and language-independent architecture supported by most astronomy devices that connect to computers. So it sounds like it's like MIDI of telescopes.
Atwood: Yes, that makes sense.
Spolsky: That's what is sounds like.
Atwood: But only if it can play music Joel.
Spolsky: Except that it doesn't play music, it plays telescopes. And some of the names here I recognize from StackOverflow so I think that there's some overlap. Maybe not, maybe it's just because there are people named Chris and Bob and stuff. It is sort of interesting how these StackExchanges, like the first level and the second level of spreading I asked Jose who's here from ClimateDeal how he heard about StackExchange and why he decided to start using it. He said they're building a whole organization around StackExchange kind of and they're going to promote to other climate change type of organizations and other "green" organizations that they know. And I asked him how he found out about it and he's like, "We were working with these programmers and they suggested that we look at this and they use StackOverflow." Like all the programmers on StackOverflow have an obligation to tell other industries and get them excited about the StackOverflow vision for the future.
Atwood: Anything to keep people off of PHP. It's like keeping people off drugs, it's the right thing to do.
Atwood: And when I say PHP I mean PHPBB. I'm specifically talking about PHPBB.
Spolsky: You know that there is 99.999% of PHP is not PHPBB.
Atwood: I know. There's a ton.
Spolsky: Let's keep people off that too.
Atwood: Well, not necessarily, I've kind of resigned myself to a world of PHPBB at this point. So, one thing I want to talk about is over the holidays I was able to contact the person who created the Markdown... there's 2 Markdowns: Markdown is the markup language that we use on StackOverflow. There's 2 implementations that we have: one is for the client side preview, which is the wmd control which we had to reverse-engineer... the whole story is in a previous podcast. And then there's the server-side implementation. So one of the difficulties we ran into was these are subtly different.
Spolsky: So the previews are not matching when it shows up on the site?
Atwood: Right. Over the holiday I did improve the preview, like the main areas that were just kind of oversights really. Like we changed some of the rules about bold and italic and for the most part they match now except for really weird edge-cases. I got rid of all the obvious mismatches.
Atwood: And actually I got help from Jacob, I'm going to mispronounce his name I'm just going to call him Jacob, who runs the MathOverflow StackExchange. He was very helpful. He was helpful in sort of troubleshooting that. They used a lot of weird syntax on MathOverflow.
Spolsky: They do amazing things with math notation basically. They use LaTeX.
Atwood: Yes, but, we've talked about that before, but in addition to that they just have ASCII notation as well, and the ASCII notation can be problematic because you're putting characters in sequences that are just really, really uncommon in any kind of normal text at all. So they were running into a lot of edge-conditions as well so he was very helpful so I do want to give a shout out to Jacob in that regard.
Spolsky: Have you looked at MathOverflow lately? It's absolutely insane. Look at all these tags, they've got tags with little dots in them. Why is that? Oh, I think that the dot is like the...
Atwood: Don't you understand Joel? I'm kind of like allergic to math so it's not really good for me to be around math.
Spolsky: Look at the site and look at their tag cloud over there. What they're doing is, they've got like 2 letter abbreviations for tags. So it's like a 2-letter abbreviation, a dot, and then the name. So it's fa.functional-analysis. Or ra.rings-and-algebras.
Atwood: Look what you've done Joel, you've made me go to a math site.
Spolsky: Are you listening to a word I'm saying?
Atwood: I am listening! I'm just trying to tell you...
Spolsky: Look at the tags, this is a general idea that they seem to have invented here. So now if you want to look at probability stuff you don't have to type 'probability' you just type 'pr.' and then it only has one match. You see? Get it?
Atwood: The Hawaiian Earring?
Spolsky: Look at the tags!
Atwood: I am looking at the tags, I see what you're talking about. I've processed that.
Spolsky: You see what they've done? They have this cool feature, that you can just type like 3 letters and it'll only have a unique match.
Atwood: Very rapidly yeah. Although we do match anywhere.
Spolsky: I know but you'll have multiple matches because those 2 letters... because they put that little dot in there, this means that if you just know the 2-letter code for something, you're just going to hit the 2-letter code and you're done.
Atwood: Yep. Cool. MathOverflow's great, it's been hugely popular. There's definitely been demand for it from way back.
Spolsky: I don't understand anything. Like nothing. Hawaiian Earrings, I know what those are.
Atwood: I'm glad there's people smart enough to do this advanced math because I really, really suck at it.
Spolsky: I'm voting that up.
Atwood: Wow, you can vote on MathOverflow?
Spolsky: No, it didn't let me. I need to talk to Aaron, I want to be able to vote on MathOverflow.
Atwood: So anyway, MathOverflow is fantastic and Jacob's the guy who's been helping us out on that. The server-side implementation was where I wanted to do some additional work and it wasn't actually an open-source project. I don't think it was intentional, but the original author did not present it under an open-source license, which means, as you know: if it's not open-source it's copyrighted by default. So I contacted him and he was totally cool about it and he granted the copyright to me. So I was then able to turn around and open-source that and put it up as Markdown Sharp on Google Code and I'll link that in the show notes. I was able to make quite a bit of progress. You know, we're a little bit down on unit-testing, but this is like a textbook example of where you want unit-tests. One of the first things I did was put in unit-tests. Unit-tests for Markdown are pretty simple, they're basically just input and output. You have an input file which contains Markdown and you pass it through the processor and the output should match the reference.
Spolsky: This is an awesome example of where it's straight text transformation, it's so easy to do automated tests, unit-tests, TDD and all that kind of stuff.
Atwood: It's brilliant, because I found just an unbelievable number of bugs... oh my gosh I found a lot of bugs. Bugs, like, in our port, just accidental bugs. Literally just like an extra space in the regex in the wrong place.
Atwood: And it was causing it, it wasn't causing it to break, but it was causing like failure-to-match and that was causing the output to be subtly wrong. Not in a way that really broke anything per se but it was wrong. I fixed that, and there's a lot of bugs from the actual implementation, the Perl implementation. The original implementation of Markdown is Perl.
Atwood: Yes, so I sent you a link. You should click on that link now and look at that.
Atwood: Yes, there's somewhat of a tradition, unfortunately, of writing Markdown parsers in regexes. That definitely starts to have a downside. I'm a huge fan of regular expressions, but there's a point where it becomes extremely complicated code. I haven't been able to get anyone to really help me. Now, to be fair, this gets into issues of like running an open-source project. Now I am "running an open-source project." It's a very small one. And I solicited help and a lot of people have contributed patches and stuff and I really appreciate that, but one thing I've noticed is there's a lot of "painting the bike shed" that goes on versus the core problem of when you have this dense mass of code that's just a bunch of really complicated regular expressions; although, some of them aren't too complicated, but the flow of the program is very regex based. People are not really able to help you very much. That's what I've seen.
Atwood: They can't or they don't want to, but the really hard part of the code I'm not getting a ton of help with.
Atwood: Let me clarify, you're looking at the PHP Markdown. Now one of the problems- let me give you a little background: when I mentioned we have a reference Markdown standard, that's kind of the problem with Markdown. It is kind of a standard, John Gruber laid out the specification, but there's a lot of edge-conditions he didn't cover.
Atwood: There's just a lot of bugs. I mean a ton of bugs.
Atwood: I don't know. I don't think you need to be a computer scientist to write code.
Atwood: It's not really a parser.
Atwood: No. And you definitely- as I said, this is the PHP implementation. What I found is that the PHP implementation is actually much better than the Perl implementation. Even the- there's some secret unreleased versions of the Perl implementation.
Atwood: The thing about the Perl implementation is it's really close. But it had edge-conditions that are super-super-hard to get rid of without writing a lot of complicated code. I think it's the classic example of Perl code in that it worked for the 95% case, but once you start looking at the unit tests that fail, to fix them is this rabbit hole of like-
Atwood: There was a funny post on Reddit, a reaction to the blog post that I put up and he said "It became a tradition to have crappy implementations of Markdown." Because the reference implementation was a certain way so it kicked off a lot of clones, because people all just copy this. It really does work for the 95% case. The edge-conditions are not terribly bad.
Atwood: But fixing them is just unbelievably difficult and that's where you get into "If you want to do this the right way," then it is difficult to do with regular expressions.
Atwood: It's possible, it's just that the code becomes very, very, difficult to work with in my opinion. I'm certainly seeing that with the PHP implementation where they fixed a lot of the problems with the Perl 1.01 and the 1.02 the unreleased version. He had a different parser there, and it's really complicated.
Atwood: I think there are actually, but the problem is I just did a cursory look. My goal was really simple, I sort of fell down the rabbit hole as I got- okay I'm just porting code, I'm not trying to write new code, that's not really my goal here. I just want to make sure I match the reference implementation. You have 2 problems: one is the reference implementation kind of sucks, it's not really right.
Atwood: It's not "referency" at all. So then you look at the alternative implementation which is PHP Markdown, honestly the most mature one, the one that's maintained the best, the most accessible, the one that I could find, and it follows the lead of the original implementation.
Atwood: For PHP it's quite good.
Atwood: Well, I sent Joel a link and I'll put this link in the show notes but that's the link to the HTML detection regex which is like, I would say on an average large programmers monitor, it's a regular expression that's probably 2 to 3 pages long. And it's used with whitespace, I mean it's broken up, it's probably the most complicated regular expression I've ever seen that's actually a real thing and not a joke.
Atwood: I know, but I consider that one kind of joke. Nobody hopefully really uses that. But this was written by a human being and it's commented and uses whitespace and all the right things, just to show you how complicated it is, if you specify compiled on that regex it does not help it actually hurts in this case because the regex is so complicated. .NET freaks out on my machine for about 5 seconds, like trying to compile this thing.
Atwood: It works, it does compile it, but it takes like- it literally just freezes; your CPU usage goes way up, and it kind of drives the regex compiler a little bit crazy I think. So it's quite a sight to see. It really highlights to me one of the big weaknesses of regular expressions which is matching pairs.
Atwood: Yeah, that's really a pain in the butt. And that's what a lot of the hairiest code is balanced matching.
Atwood: In fact, there's some tricky way in Perl to do this that doesn't really work in .NET or PHP's regular expression engine. So, what you have is essentially you loop through N times, and you can only match N deep. So there's actually that repeat string call you can see in there, is repeating match ( or match [. It will only match N deep where N is in this case 6, because it's literally concatenating 6 strings together to get a recursive match. There's some super-weird way to do it in .NET though, there's some variant of the regex, because regex implementations are not all the same and on .NET one special thing they do: they have a form of balanced matching, but I've never really been able to use it. It's kind of weird. It's just not an area of strength for regular expressions. Any time you have to match things that are balanced, it's just really awkward and there's tons of that in the Markdown parser.Spolsky: The people that try do things to HTML with regular expressions...
Atwood: Yeah, and that's what we're doing. That giant regex, that 3-screen regex...
Spolsky: Have we already talked about that very, very, very famous StackOverflow question and answer?
Atwood: Of course.
Spolsky: Yeah, we talked about that one. Madness! It's like, when you try to do this, at some point, you're the computer in Star Trek and you've just been asked to calculate pi to the last digit. You're like "Okay! I can do this! Look I'm getting more digits! The more I work on it the more digits I'm going to get, this is bound to terminate at some point."
Atwood: Right, well let me be clear: we did improve it. When I started the Markdown C#, it was passing 15 out of 23 unit tests. I got it, without taking on this really hard part- this is the really hard part, to be clear, this block parser. It's the last piece of the puzzle. Before I got to that point it was passing 21 of 23 tests. There's only 2 tests that are failing, and it's basically an issue of nesting with tags, where it can't quite figure out the right order, because the whole balanced matching thing.
Spolsky: Yeah, you could have- think of HTML, you have <"...>... that doesn't close the angle-bracket. Right? You've got <img src="...>">. That > was inside quotes, it doesn't close the < because it's quoted, it's part of the quotes. And people will do that and then they'll expect that that > is part of the quotes, it's not closing the tag. In a regular expression, there is literally no way, maybe there is some way in the Microsoft version, there is literally no way to be in a different state inside those quotes than you are outside of them because of the whole problem with matching.
Atwood: Absolutely. This gets to the heart of the problem.
Spolsky: And that stuff is sooo trivial with a state machine. Soo easy. So, so easy to know what state you're in. It just happens automatically when you write the code the right way. This I think one of those instances where- this is one of those few, few instances where the things that are part of the standard computer science curriculum are really an important part of the working knowledge of working programmers.
Spolsky: That's my official opinion on this matter.
Atwood: Right. No, I agree, you're essentially creating a language so, I think you're at the state where you need to do that. So, anyway, enough about that. It is an open-source project, we have improved it, we fixed a ton of bugs in it.
Spolsky: Maybe I should make a prize for the computer science student that re-implements Markdown as a proper...
Atwood: There are some out there. Somebody referenced one in Haskell, but I don't even know Haskell, like, even a little, so...
Spolsky: That should be really short because Haskell is super-optimized for that particular task.
Atwood: And that doesn't really help me because I need to convert it to C# so...
Spolsky: Yeah, you would have to use... what's the- you could convert it fairly easily to F#.
Atwood: I see. Well maybe someone listening will take an interest in this and will translate that. Like I said, it's open-source so you can contribute, I do take patches. Running an open-source project has been interesting. I haven't really done that before actually, the whole concept of: you have to take patches. I was a little disappointed that there were not very many people willing to help me with these really hard regex problems, but a lot of people will comment on, like, the way I structured the project, the way I stored my files. You know these aren't really the problems that I really need help with. I appreciate, don't get me wrong, there was some good feedback about how to lay out the project, and I agreed with a lot of it, but it wasn't really helping me fix the failing unit tests. You know what I mean?