Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Unix Text Processing (1987) (oreilly.com)
108 points by rhollos on Dec 11, 2012 | hide | past | favorite | 63 comments


I sought permission from Tim O'Reilly, he co-wrote this back when he was an author, to re-enter the text of the book as troff source, the original being lost. He kindly gave it and volunteers from the groff@gnu.org mailing list split the chapters amongst themselves; it was quite fun. http://home.windstream.net/kollar/utp/ is the result.


Thanks. This is the best version by far. Is it also possible to generate the PDF hyperlinks/outline so it is easier to navigate the document?


That's a good point. Back when it was created, doing that with the toolchain wasn't so straightforward as I understand it is now; I'll raise it on the groff@gnu.org list.


You can generate PDF (without the hyperlinks) with standard psutils (e.g.: ps2pdf). A good PDF reader will offer text search, which is a bit of a help.

Otherwise, I'd have to investigate groff and text conversion utilities to see if the metadata are readily accessible. Interesting question, though I suspect it might take a slight rewrite.


PDF is provided on the book's page I gave above. You're correct, small mark-up changes are needed to make use of the PDF-linking macros now available that weren't then.


Thank you for taking the pains/effort. :-)


I see resources about Unix text processing utilities, about Bash, about readline shortcuts, etc. etc. submitted very often to HN.

Exactly how useful are they? Would you say they're among the most important skills a programmer could have? Or do we just have a disproportionately large amount of sysadmins for HN readers? Isn't some general-purpose language like Python or Ruby almost always much better, much faster than these solutions? Isn't it worth it to rather invest your time learning Python well -- instead of getting familiar to the segregated and messy environment of modern Unix-land? Certainly, the learning curve is much steep for Unix utilities than, say, Python.


Scripting languages still usually represents a substantial typing overhead compared to throwing together a pipeline on the command line with standard Unix tools for many simple tasks.

The moment I see I'll need to do something many times, I'll consider writing a script, and then I'll often pick Ruby. But for one-off stuff, the command line is often faster once you get comfortable with a handful Unix tools.

In fact I often find that even when I do things many times, the mental overhead of remembering "yet another script" is often high enough to make it faster to just re-compose the command line I want.

For example, I very frequently do some variation over "grep [some term] | sort | uniq -c | sort -n" to get a sorted list by number of occurrences of [some term], but the key part is "some variation", and that makes adding and remembering an alias less useful.

Another big consideration is that these tools are present on all or most machines many of us use.

For larger applications, having to install other packages is often no big deal, but for example I don't want to find myself dealing with an emergency and suddenly having to pull down tons of packages to use Ruby because I'm not comfortable with the tools that are already on the machine.


A number of things make me more inclined to look at writeups on using the command line, and command-line tools, than those on higher-level languages like Python.

First, many of the things I want to automate are most naturally done at the command line. For these, I already know the commands I want to run, and just need to script the logic that glues them together. Second, I mostly want small scripts I can send to colleagues, and not have to worry about whether or not they have Python (or whatever) installed.

I do write scripts in Python and Ruby, but they tend to be longer, since they reflect tasks where the data have to be pulled apart and put back together in multiple ways (for example, a dependency-generator for some custom makefiles I maintain). This sort of task favors building up structures in memory, over pipelines, and I use the appropriate tool accordingly.

As for the question whether any of these tools are the most important skills a programmer could have, no, I don't think so. As a programmer, your most important skills are in the language you use all the time, the language your "deliverable" is written in. Most people probably don't use the unix tools as their primary programming platform. But those tools support and extend the environment in which we get our "real" programming done.

To borrow the woodworking analogy from (I believe) the Pragmatic Programmers, all those articles about the Unix tools are probably the equivalent of articles on keeping your chisels and saws sharp. No, the file you use to keep your chisel sharp isn't your most important tool; your chisel or saw is your most important tool. But the craft of keeping your most important tool sharp can be fun and rewarding in itself.

By the way, speed is never an issue with any of the scripts I'm likely to run. But if it was, I doubt Python would be faster, since those scripts involve a lot of calls on system resources.


I used to have that attitude, and Python is still my main language. But I do more and more stuff in shell now. Every project I write has a whole bunch of shell snippets now, which saves a ton of code and lowers the barrier to automation considerably.

If you're trying to write shell scripts in Python, they're going to be 3-5x longer. Bash is a higher level language than Python. Every tool has its place, and bash and Python are complementary.

Also I realize there is some useless ceremony in Python standard practice. Do you want to know what my Python test runner looks like now?

$ find . -name \*_test.py | sh -x -e

That's it... no BS. I don't know what people are using in the Python world these days but some of it has drifted toward "framework land".

Also, it's not very hard to make this parallel, whereas it is somewhat annoying in Python. And you don't have to worry about global variables polluting each other -- tests stay independent.

I test a big part of my C/C++ code with shell scripts as well.

So basically I find it very helpful to think of yourself as writing shell utilities in Python. Python's not your world. It's part of your world.


Automating stuff at the command line is super important. If it's easier for you to do in Python, then that's fine. I use awk and bash for a lot of that stuff because it's what I happen to know.

Other people I know use perl. The important thing is to be able to automate anything non-trivial that you will do more than a single-digit number of times. One time I spent 30 minutes writing a script for things that a dozen people were doing a dozen times a day. It maybe saved a minute each time, but that's over 2 hours in the first day we had it. It paid off in time for me in 3 days, and probably less than that it terms of "damnit this is boring"

For me, at least, Python has a much higher friction to get from "here's the list of what I do" to "here's an automated version"

I do use Python for some of the more involved things (manipulating timestamps from bash is a pain), and I've converted some bash scripts into "real language" scripts when things started to get hairy, but most of the time, you get 90%+ of the time savings from the first 1% of effort.


Python scripting is an investment, eventually it becomes as frictionless as bash scripting. Especially if you have a good shell scripting library.

I usually use Fabric, SaltStack, or https://github.com/kennethreitz/envoy


Just yesterday I was able to write a one liner on the command line to analyze my skype usage. I wanted to see if it was making financial sense to keep skype alongside my prepaid wireless plan.

You can download your last six months of skype activity from skype.com and they are in the form:

Date;Date;Item;Destination;Type;Rate;Duration;Amount;Currency "July 31, 2012 21:16";"2012-07-31T21:16:01+00:00";"+11234567890";"USA";"Call";0.000;00:00:10;0.000;USD "July 31, 2012 21:15";"2012-07-31T21:15:38+00:00";"+11234567890";"USA";"Call";0.000;01:17:02;0.000;USD

After 15 minutes or so, I came up with the following one liner:

cut -f7 -d";" call_history* | grep -v "Duration" | awk '{ FS=":"; s+=$1*60; s+=$2; if ($3 != 00) { s+=1 } } END {print s " minutes"}'

I could have done the same thing in perl or python in 5 minutes, but it was interesting to "program" only by hooking programs together to achieve the same thing.

After analyzing my skype logs, I found I used 1800 minutes. That would have cost me $180 with prepaid minutes, but only cost $30 with skype.


Another way would be all awk.

    awk -F\; '
        $7 != "Duration" {
            split($7, t, ":")
            s += t[1] * 60 + t[2] + (t[3] != 0)
        }
        END {print s + 0}
    '
Note the handling of s == "" in END.


The sed and dc combo are perhaps not as readable, but there was a time before the dawn of the One True Awk. :-)

    sed '1s/.*/0/; s/;[^;]*$//; s///; s/.*;//
        s/00$//; s/:..$/+1++/; s/:$/++/; s/:/ 60*/; $s/$/p/' |
    dc


Yeah, it's a good idea to learn Python well. And perl.

But... sitting here in this cubicle, in front of a PuTTy window working on a production server for a large financial institute to manipulate text files and I don't have access to either of those. Ruby is not installed.

Most of the time cut, comm, paste, diff, tr and ed|sed does my job. When they aren't, then there is an old awk binary. So, it depends on the job and the tools you have. IMO every developer should learn the bare minimum about UTP utilities. It won't hurt.


Depends, for one thing one how frequently you intend to use it. Consider variable names, it is often said that it is desirable to have nice, long explanatory names. But again it depends. For variables that have big scopes, it is indeed important to have descriptive names, but those that disappear after the line, its ok to use i,j,k. In fact short names in such cases increase comprehension rather than impede it. The reverse holds as well.

   > Certainly, the learn curve is much steep for Unix
   > utilities than, say, Python
I am not certain about that at all. Yes they may have trucload of options, but I have never had to memorize them. Many disagree that these tools really dont strictly adhere to the "do one thing but do it well" philosophy they do that to a satisfying degree of approximation. Coreutils, textutils, find and xargs can go a really really long way.


>Certainly, the learning curve is much steep for Unix > utilities than, say, Python.

Now what makes you think that? UNIX shell tools have a very consistent interface which is really pretty simple (pipes in, pipes out,) they are well documented, and they're interactive and easy to play around with no zero setup.

Also, the UNIX shell has been around for 30 years now and isn't going anywhere. So what's the best investment? Python will change more in the next five years.


I would say they are among the most important skills a programmer could have - the amount of time I have saved by being able to do fast manipulation with chains of things like sort, tr, sed, awk, head, tail, paste, strings, nm, hexdump, xargs, etc. and get diagnostic information for deeper issues

it's like anything else; each tool is just a building block. the number of creative things you can do with very little effort by piping the output of one to the input of another is incredibly powerful. it's not always computationally efficient, but many times that doesn't matter.

Python is another great tool, but sometimes a machete is quicker and easier. Different tools for different situations.


Honestly, I think it's a little of column A and a little of column B. While sometimes it can be useful and faster to bust out a one line bash shortcut, I think the HN community is also a bit guilty of over-romanticizing Unix/Bash/<Insert non-visual & somewhat geeky tool/language>.


About non-visual stuff I can simply answer that I am more productive with a shell, R, ggplot2 and python than with excel, just because I know this better. This is my stack along with matlab and mathematica in the bank where I work now, both basically at cli level. The majority of people in my division uses excel.

I also think of things in an every unix way in my using of computers largely because of 15 years of using unix like systems, it's the same when I need to code more than a couple of lines and do not have vim, for me the brain became too much used to it, I believe this will be the case for many people in HN.


Well, it's not romantic if it's actually practical. Bottom line, I think, is if you really get the shell, it is almost infinitely useful.


While I can't speak for other texts, UTP is my and probably the go-to book for learning Troff. I owe the typesetting of my ebook to it. Since it occupies that niche area, it is still quite relevant and useful today.


somewhat curious... What was your book on?


Scanning his submissions, I think http://www.dreamincode.net/forums/blog/48/entry-4298-lessons... is relevant.


Ralph, thank you very very much for going out of your way to be helpful in this entire thread. Very much appreciated. :-)


It might be that at any given point of time, there are a lot of people who want or need to learn Unix. I already know Python, but I think there is value in learning the fundamentals of the Unix shell programs. I've spent the first five years of my career on Windows, am now on Linux, and this book looks like a useful resource.


A very nice book available for free. It makes me very happy that I can learn new ways to do my work smarter.

Some of the best known advantages of Unix text processing tools is, as long as you can reduce your problem to 'Text'- There exist some very powerful, succinct and quick solutions to even some very difficult problems.

Well it takes some time to get a grip on how to work with Unix text processing utilities and tools like Perl. Once you are upto speed, you see how much work you can do so quickly with so much little effort. In fact the more you get into it, you realize how much useless code you have writing over the years. While all you needed was a command with a few options.


So much of this is still useful and relevant. Kind of amazing how much the web relies upon these ancient (by our standards) tools. UNIX is one of the most incredible technology stories ever.


This book is an absolute treasure, even with it's age, is still one of the best descriptions of troff/nroff.

One thing that I do find interesting, is that the book (on text processing, no less), that is 680 pages x 30 lines x 80 columns (plus some minimal line art) - which, in theory, is around 1.6 megabytes of data, weighs in at 28 MBytes in this PDF.

Regardless of the Irony, great book which the authors clearly put blood, sweat and tears into.


26.7 MB is because it's scanned book with text layer on top. If it was retyped in, let's say, LaTeX, and outputted to PDF it wouldn't take more than one tenth of the current size. Remember that scanning the book has also archival purpose.


The book has been re-typed, as troff, just like the original. See my comment elsewhere on this post.


No reason to downvote me (whoever it was). The PDF

http://oreilly.com/openbook/utp/UnixTextProcessing.pdf

has _scanned_ pages with text layer on top of it. That's the reason why it's so big.

I applaud retyping effort, as it's always better to preserve real content than images of it (even if OCRed), but I cannot say that I'm happy about troff being used for this purpose. It can be a matter of taste, but I don't like the way formatting is done in troff/nroff. That's why I never use it directly (e.g. using ronn to convert markdown text to man page, etc.).

But I understand it's done that way to preserve "the creation process" too, which is also appreciated. And the book is about troff/nroff, so dogfooding is present. ;)


I was even writing good old roff a few days ago. So this book is not even out of date.


>> I was even writing good old roff a few days ago

Just curious, what was it for? :)

PS: I had a groff based CV (2006-ish). Now it is a LaTeX based CV.


I still regularly use groff; letters, invoices, ad hoc formatting of one-off tasks. tbl(1) is nice and the overall speed makes it handy for producing PDF in the back-end.


I was going to ask why one would prefer groff to some TeX variant. Speed is a pretty good reason; I'm impressed at how slow TeX is even on modern hardware. I usually generate HTML unless I want something beautiful, and then I usually use ConTeXt. I'm pretty fond of classic Unix stuff so maybe I'll delve more into groff.


I've read _The TexBook_ and other material but just find the TeX mark-up so noisy to parse compared with the troff style of `.cmd' at the start of the line, cmd is often short, with the odd bit of \s+2 embedded within the line, depending on personal preference.

troff and friends were developed on Unix in its early days by the originators of Unix and it shows in what a good fit they are to the environment and in their elegance; they are Unix programs. TeX was born outside of Unix; it runs on Unix.


Feel like answering a couple more questions? :) The kinds of questions I have are not "good" questions for S.O. I'm tempted to try redoing a project of mine from ConTeXt in troff just to see what it's like. Do you recommend any particular macro package? Without any other input I'd probably try the -me macros, just because they've been compared to Pascal (versus the FORTRAN of -ms and the PL/1 of -mm).

Also, do you find yourself writing your own macros much? I haven't the faintest idea what that would require with troff, but I rely on this with TeX quite a bit, mostly to elevate stylistic markup into semantic markup. My impression is that if you want semantic macros you use a macro package, and even then you probably freely intersperse non-semantic macros.

This project I've been working on for a while, uses LuaTeX so that I can connect to a local database, perform some queries and typeset them and their output. I imagine this kind of thing would not be difficult to do directly with a custom pipeline step using troff. Have you done that kind of thing before? If so, how unpleasant was it?

Thanks for talking with me about this.


You'd do well to ask these on the https://lists.gnu.org/mailman/listinfo/groff list for a wider set of opinions. It depends on the style and complexity of the document. -ms is simple enough that people like W. Richard Stevens would tweak it for their books. I understand the relative newcomer, -mom, is comprehensive, modern, and well supported by its author on the above list. I don't recognise the Pascal, etc., analogies. :-)

I do write my own macros. They can be just short-hands for a combination of others in the same way my ~/bin/l is exec ls -l "$@", or sometimes for a simple document I start with just troff and have some macros on top of that. Yes, any distinction over semantics is purely convention.

You may wish to read Kernighan's _Nroff/Troff User's Manual_, http://troff.org/54.pdf, otherwise known as CSTR #54. It's original troff, not groff, but as a succinct reference with elegant prose we often refer back to it. At the end is a tutorial introducing simple macros.

Integrating troff and friends in pipelines and scripts is easy. They take line-based text as input and produce it as output, only switching to binary for some output formats at the last hop. You can also run system(3) from within troff documents, e.g. to include the output of a command, but often that's not the easiest fit.

I recommend again the groff@gnu.org list; they're friendly, patient with newcomers, and interested in showing how they tackle the task at hand.


As it happens, the analogy is from the Unix Text Processing book, page 97: "Mark Horton writes us: I think of ms as the FORTRAN of nroff, mm as the PL/I, and me as the Pascal."

Thank you for taking the time to answer these questions! I will plough through some of this documentation and make my way over to the list.


+1 on speed and memory.. Some table and figure heavy documents I've done have had internal memory overruns on latex.

A lot of my forays into programming started with the Bell Labs books (with troff | pic | eqn ...) prominently on the copyright page. So, it was one of the first things I looked up when I got access to a UNIX terminal. Then I discovered that all the UNIX books by Stevens were done up with troff/groff. From there, it was all "steadily downhill" for me ;-)

P.S: If you already know (La/Con)TeX then groff is literally a walk in the park. And it is always good to know more than one way to do things imho. Good luck with it.


My knowledge of ConTeXt is pretty limited, but unlike LaTeX you don't need to be an expert to get good looking results. I find ConTeXt a lot more intuitive.

I got a book on TeX coming soon because I want to get better at ConTeXt. ConTeXt is not shy about telling you that for better results you need to understand TeX and use it appropriately. I find it kind of absurd that I've been using TeX for so many years without actually understanding it.

I have been curious about roff since trying to use Plan 9, but in a pretty absent-minded way. They're quite unashamed of providing roff at the expense of TeX, and I believe all of its documentation is roff-formatted, including the technical reports, but I could be mistaken about that.


+1 for reminding me of tbl. :-)

EDIT: Wanted to say, tbl nailed it for me that whatever conveniences WYSIWYG formatting might offer, it is never a good substitute for typesetting (a strong opinion to this day -- I still prefer typesetting for documents that need to 'travel')

And yes the letter(heads) came rather nice too.


Would you use for a 200-page product manual?


Yes, sure I would. Doesn't sound like a glossy DTP kind of document. The document size wouldn't be a problem. Ask on the groff list with more of a description of the kind of content and what you might be worried about. Books like _The C Programming Language_, W. Richards Stevens' _Unix Network Programming_, and others are examples of what can be done. http://troff.org/pubs.html


Mind blown dude! Whole dictionaries use troff!


I have used LaTeX for 200+ page documents (various) in the past, so to your question, yes I would.

Handing it over to someone else to maintain is a different thing though. Products like InDesign or PageMaker or Word understand that this is what eventually happens in reality. So, the not-so-steep learning curves that WYSIWYG tools offer have won out in the end.

If the manual is written by techies, and maintained in-house like all those old Bell Labs manuals that used troff, then markup based typesetting is not such a pain, imho.


Thanks - I know Latex is used, I was asking him about troff :)


I was also using it 2006 for my first CV and cover letter. This was at the beginning of my professional career. Now I am using MS Word. How my ideals deteriorated …


:) Ha ha...

My first CV was in (gasp!) MS-Word. ;-)

Then I did a groff one for larks, and was a bit surprised at how well it was received (Wow! how "professional" it looks), and can they please have the "original doc file" for their own modifications. Sent them the source/text file only to receive the rather "sailory" emails that came back. :-D


I went through a period of having my CV in XML and using XSL to format it it.

Then I recovered.

Sometimes it's better if ideals are replaced with pragmatism - I achieve a lot more than I used to.


Question for the groff experts out there:

XeLaTeX has finally moved LaTeX into the realm of directly embedding virtually any true type font into the final document. Literally 5-6 lines of LaTeX commands and you are done! I was amazed when I first did it. It was so convenient, when compared to messing with pfbs, then afms, and then.... you get the idea, no?

Is there a similar groff mechanism to pull TTF/OTF fonts into the final PS/PDF document with similar minimal effort? If, could someone be kind enough to point a resource to me. Thanks.

P.S: I was searching, but could not home in on the right keywords to drive me to an answer.


I think TrueType fonts with groff is fairly painless; the topic comes up on the groff@gnu.org list now and again but I don't pay much attention. The list is low volume and very friendly, with some real old Unix hands on there; feel free to ask, say I sent you if you like. https://lists.gnu.org/mailman/listinfo/groff

Gunnar's Heirloom troff has TrueType support. "troff can access PostScript Type 1, OpenType, and TrueType fonts directly, that is, it can read font metrics from AFM, OpenType, or TrueType files, and can instruct its dpost post-processor to include glyph data from PFB, PFA, OpenType, and TrueType files into the output it generates". http://heirloom.sourceforge.net/doctools.html


Thanks. I will check both the list and heirloom. Hadn't heard of heirloom before. So, thanks again.


roff: still the normal way to generate an IETF Internet Standards Draft text document. This mechanism is showing its age though, especially regarding UTF8 and width limitations.


I don't know about other systems, but on Plan 9, where UTF-8 was invented, troff handles it just fine:

http://plan9.bell-labs.com/magic/man2html/1/troff


groff does it with the help of its preconv(1), also invoked with groff's -k option.

    $ printf 'Hello ①②③\n' | preconv
    .lf 1 -
    Hello \[u2460]\[u2461]\[u2462]
    $ printf 'Hello ①②③\n' | groff -k -Tutf8 | grep .
    Hello ①②③
    $


Yeah, but UTF8's not acceptable to the IETF Secretariat.


Is there any advantage (speed?) in using something like Awk over something like Python?


Open up a terminal on Linux, OSX, or heck even Cygwin or MinGW on Windows. I'm willing to bet that awk is already pre-installed, even on a base or fresh install. Awk, like grep, sed and all the rest are pretty standard and very powerful. Not knocking Python, shell programming is just another paradigm, and the tools are pretty universally available.


Awk is part of the POSIX specification, so yes, it's going to be present on pretty much any Unix-like environment. Even those which don't aim at POSIX compliance will almost certainly have an awk interpreter. Busybox includes one, which means that many minimal / embedded systems will include awk by default (as they use busybox to provide core utilities).


It's awkward to embed Python in a pipeline and the awk can be quicker to write if the problem suits its domain. Execution speed may be slower than Python, depends on the work needed and the awk used; gawk, mawk, Kernighan's One True awk, ...

    awk 'NR > 3 && !/foo/ {s += $(NF - 2)} END {print NR, s + 0}'




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: