Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

unless you ever need to process data in a format for which a decent library doesn't already exist


Still no pedagogical value. Any smart undergrad can pick up how to write a parser with a little background reading and a couple hours. The trick is that you spend the rest of your life fixing maddening corner-case bugs in the thing.

Parsers, from a practical perspective, are child's play for people who seek to eventually master a compiler.


Do you have an example in mind of such data?


New file formats of all kinds are invented every day, often with some proprietary tool attached. Writing your own parser allows you to add value and/or interoperate with the proprietary system. Even in the cases where an open source parser already exists it may have performance issues, or not support the latest version of the format. Being able to roll your own in those cases is empowering.

Of course you asked for examples... let me give that a try:

1) data from your favorite application that's been end-of-lifed and you're thinking of replacing with a competitor's tool 2) data from a later version of your favorite application that you want to use with an earlier version, because you don't want to upgrade 3) configuration information from some part of your IT infrastructure that you need to refer to as you restructure and upgrade 4) a big config file for some software, that contains an error somewhere and "grep" won't find it. Maybe it's a semantic error, for example. 5) a config file for some ancient crufty software you're replacing, but the config file is huge and contains a lot of institutional knowledge, so you want to automatically translate it to the new system's setup.

Being able to generate even simple parsers gives you a lot of power. It's not as uncommon as you might imagine.


Having been in the business of reversing data formats in two different real-world contexts, I feel comfortable in saying that just about the last thing I would do is write a parser. One context was building a system to pull live financial data feeds. The other is in the software security business. In the former, often CSVs were what you would get, or fielded data. We built an engine that could easily inhale these. Including the bloomberg feed which was unusually complex.

In the security business, one is often asked to assess some not-very-well specified protocol, or some protocol for which there is no documentation. So to deal with it you 1) fuzz the hell out of it to make the end point fall over or 2) hexdump the protocol and write pieces of it in ruby or python to get messages through so that you can fuzz the hell out of it in a structured way.

And if there was some need to write a parser, you can bet it ain't gonna be LALR, it will be hand-crafted, likely recursive descent.

To reply to each of your points:

1) If you are lucky, this XML. I don't need to know how to write a parser if the data is XML. If it is some sort of Java serialization, dejad is your friend--no parser required. If it is binary, you are going to use the protocol reversing route mentioned above.

2) See #1

3) Maybe just insert parentheses around the whole bit of data, and insert more strategically, and you are all but done.

4) See #3 or #1.

5) See #4.

If I was working on a team, and I saw someone writing a parser for a data-related problem, I would seriously question what they are doing.


It's nice that things have gone in the direction of XML and JSON lately, and many people devise formats that build upon those. I was thinking more of arbitrary text formats. Even if it's XML or JSON though, the existing parsers only handle comprehending the structure of the data itself. You have to write some semantic analysis on top of those, because the standard "parser" will just give you the input data as a tree - but the semi-standard format certainly helps a lot.

I do think you overestimate the work required to write a parser for simpler formats - for someone familiar with one of the popular parser generators this can be a handful of hours, and the quality of results should be much higher than an ad hoc method. This can be a good design decision.


No, I don't underestimate it, because I have done it, i know how long it takes, and what one gets from it.

There are simpler ways, as I point out above.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: