Rust clean-slate POSIX CLI utilities 0.2.1 release: Awk, M4, ftw and more

nialv7 3 days ago

Should've called it Oreutils.

ramon156 2 days ago

Or like uutils, ooreutils
ksnsnkdjd 2 days ago

+1

I like standards and abhor bloat, but I must admit there are GNU extensions that are so useful and well known that it would be difficult to do without. Probably this happens when POSIX specs are too strict or feature-poor to be of use even for medium-complexity tasks.

One example is "make": I'm afraid that a POSIX-only implementation wouldn't run most Makefiles out there!

marcus0x62 2 days ago

They acknowledge this. From the project's README.md:
> Popular GNU options will be supported by virtue of the "don't break scripts" rule. Unpopular options will not be implemented, to prevent bloat.

rybosome 2 days ago

I’d love to see what performance benchmarks look like. The old ones were highly optimized, but perhaps for different challenges than today’s architectures present.

simonask 2 days ago

Would definitely be interesting, but from a cursory look at the repository, it doesn't look like squeezing the last percentage points of performance has been a priority yet.
Things that stand out:
- The `awk` implementation uses the Pest parser generator (https://pest.rs/), which is known to not generate the fastest possible parsers, but is great for getting up and running.
- They are using the `clap` crate for argument parsing, which is also known to not be the fastest, but again is very user friendly (for example, it does Unicode linebreaks in the output of `--help`). It's marginal, but for a tiny utility being invoked many times from a shell script, this can add up.
It's very probably "fast enough", and it makes sense to prioritize like this at this point, but people shouldn't use this expecting a performance improvement right now.
- jgarzik 2 days ago
  
  I wouldn't assume. The awk implementation likely stands up well for a 1-billion-row challenge, with its thoughtful bytecode-based design.
  Redditors ran some quick performance tests on parsing, also: https://www.reddit.com/r/rust/comments/1fd7qgl/comment/lmelo...
  - dundarious 2 days ago
    
    Parsing the awk input? While not an irrelevant concern, that is obviously on the low priority end when considering awk performance in general. My most often used awk program is just '{print $1}', but I use it on enormous files. The performance when operating on the enormous file is the concern wrt performance, not the initial parse of '{print $1}' or of command line arguments.
    I know you're just directly responding to the concerns of a parent comment though.
    
    jgarzik 2 days ago
    
    It is hoped that posixutil's awk's bytecode-based modern design should keep performance high, theoretically higher than ancient C-based awks.
    Inspired by Ray Gardner's "wak" awk implementation https://github.com/raygard/wak
    A volunteer benchmarking our awk on a 1-billion line text file would be welcome.
    
    dundarious 2 days ago
    
    Yes, that's the most important part to focus on. I have nothing to remark about it, I haven't seen/gathered any numbers on it.
- littlestymaar 2 days ago
  
  Yeah, and I don't think performance matters that much for these utilities (and AFAIK many of the original haven't been particularly optimized for performance anyway).
  - dundarious 2 days ago
    
    Check the list, performance matters a great deal for many of these utilities, and the GNU project version is often pretty well optimized (often best in class of POSIX compliant impls that ship with an OS/distribution).
    I do not want a "performance will be looked at later" version of m4, awk, grep, cp, find, diff, sort, uniq, etc., right now in my personal or dev environments. I can understand using a not-yet-optimized memory safe language version to compete with OpenBSD though, but not for me right now.
    
    littlestymaar a day ago
    
    You are already using “performance will be looked at later” versions of grep and find as they are much slower than alternative implementations (ripgrep and fd are way faster) and these are the one for which perf is the most sensitive in your list…
    The fact that performance don't really matter for these tools in the general case is the reason while most people still stick with those slow POSIX-compliant tools when much faster alternatives exist. (POSIX-compliance only matters when using it in existing scripts, but not at all when using it in the command line or when writing new scripts)
    
    burntsushi a day ago
    
    > You are already using “performance will be looked at later” versions of grep
    Are you referring to GNU grep? Because if so, this is wildly untrue. GNU grep has been very significantly optimized. There are huge piles of code, including an entire regex engine, in GNU grep specifically for running searches more quickly.
    Maybe something like busybox's or BSD's grep would satisfy "performance will be looked at later," but certainly not GNU grep. I can't tell which one you're referring to, but a generic "you are already using" seems to at least include GNU grep.
    
    littlestymaar a day ago
    
    What I mean is not really that grep hasn't been optimized, but that it kind of looks like it is. It's now far from the state of the art from most perspective (in part because of you).
    And because there exists alternative that in most workloads are much faster than grep without really endangering grep's prominence, it shows that performance isn't really a criteria in its use nowadays (computers are fast).
    My point as a whole is that as long as the implementation isn't catastrophically slow, then any port is good even if it's slower than GNU's because workloads that both care about max performance, and must be 100% compatible with the original, is small enough.
    
    burntsushi 21 hours ago
    
    I didn't really want to engage with your broader point. I do disagree with it, but only in a measure of degree, and that's hard to debate because it's perspective and opinion oriented. If performance wasn't as important as you seem to imply, it's very unlikely that ripgrep would have gotten popular given that performance (and filtering) are its two more critical features. That is, that's what I hear from users the most: the like ripgrep because it's fast or because it filters or because of both.
    But on the smaller point, it is just factually untrue that GNU grep falls into the "performance will be looked at later" bucket. That was why I responded.
    This post is a classic about why GNU grep is fast, demonstrating that GNU grep definitely doesn't look like it wasn't optimized: https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug...
    And then before that, there's this: https://ridiculousfish.com/blog/posts/old-age-and-treachery....
    
    littlestymaar 21 hours ago
    
    > if performance wasn't as important as you seem to imply, it's very unlikely that ripgrep would have gotten popular given that performance (and filtering) are its two more critical features.
    I'm not saying performance isn't a feature, but if one want performance they shouldn't use grep, and use ripgrep instead.
    There was a need for a high performance grep replacement, and you filled it perfectly. But now that ripgrep exists, I don't think it makes much sense for a grep perfect clone to seek max performance, since it will definitely fall behind ripgrep no matter how hard they tried (at least because they can't ignore files or restrict the kind of regular expressions they support by default).
    
    burntsushi 20 hours ago
    
    I don't really share that philosophy personally. But I do think there is a balancing act. There's a big gulf between "not catastrophically slow" and "state of the art." :-)
    
    kazinator a day ago
    
    POSIX compliance totally matters for interactive use, too, if POSIX is what you know.
    That's like saying English only matters if you're revising some existing paper written in English, but if you're writing something new or conversing with somebody face to face, then English doesn't matter at all.
    Sure unless English is your most proficient or perhaps only language.
    The GNU project didn't have to implement Unix compatible utilities. They were starting with a blank slate and could have implemented anything. Why did they choose the compatible route? Because it matters. People were able to replace proprietary Unix programs with GNU programs without disrupting their workflows or having to learn anything (unless they were curious about GNU extensions).
    
    burntsushi a day ago
    
    > POSIX compliance totally matters for interactive use, too, if POSIX is what you know.
    When you use grep on the CLI, do you specifically limit yourself to options specified by POSIX? How do you even know if you do or not?
    The `-r/--recursive` flag, for example, isn't even part of POSIX grep. Do you ever use it? What about `-w/--word-regexp`? Not in POSIX either. `-P/--pcre2` isn't. Neither is `-a/--text`.
    Maybe what grep is, is actually more than what POSIX says it is.
    
    dundarious 14 hours ago
    
    You're absolutely right on the strict point of POSIX compliance. I think there's a point to be made though about just stability of the functional/CLI interface, aside from what's technically in the POSIX standards. ripgrep is pretty damn good on interface stability AFAIK, but the other tools that similarly don't try to position themselves as POSIX+/POSIXish compatible, are a mixed bag. It's nice to just have the stable tools you know, and have them practically everywhere (if you include WSL/msys2/Git for windows, etc.). And similarly you pretty much know what POSIX+ features you have on a GNU system (most Linuxes), or a BSD one (macOS, OpenBSD, etc.).
    I like having an excellent set of system tools from GNU/BSD, etc., then I can install/use the SotA stuff -- and I'll still end up using both sets of tools all the time, even though one set is not absolute best in class, because I don't have to worry about how to use sota-tool 1.2 on one system vs sota-tool 2.1 on another, when there may be important interface changes.
    And to provide the full context going back to my first comment: I won't consider using "performance will be looked at later" tools whose purported benefit is just the use of Rust, pretty much at all.
    
    burntsushi an hour ago
    
    For me, I just find it very dubious when folks point to POSIX as a point of stability. The reality is that POSIX is so barebones in a number of respects, that most "POSIX tooling" actually has more features than what POSIX specifies. And sometimes the behavior of those features differs across implementations, with sed's `-i` flag being an infamous example. As I pointed out, the main problem is that, other than POSIX experts who have memorized the spec, most folks have no idea when they're crossing the line between "POSIX specified" and "extra feature."
    I agree that stability is really the main important bit here. That is, is that script I wrote using your tool 5 years still going to work the same way today? And that is exactly my point. Because this discussion is totally different if we focus on what actually matters (stability) instead of some flawed idealistic approximation of it (POSIX) that only actually exists in theory. Because even the most spartan of tools, like busybox, implement features beyond what POSIX requires. And people use those not because they are in POSIX, but because they are in practice, in the real world, stable.
    In other words, folks harping on POSIX as an end have confused it with what it really is: a means to an end.
    
    littlestymaar a day ago
    
    > POSIX compliance totally matters for interactive use, too, if POSIX is what you know.
    Nobody knows the entire spec, implementing it only matters for compatibility with existing scripts that can use every features. For interactive use, keeping the same interface for the most common use-case is enough.
    Switching to non-POSIX tools is well worth the moderate effort if performance matters, but that's my point: performance doesn't really matter for these tools. Anything that's not orders of magnitude slower is good enough.

wmf 2 days ago

I didn't realize The Open Group still exists and is updating POSIX.

jgarzik 2 days ago

Updated in 2024, no less! (2024 version still has UUCP though, heh)
- 7bit 2 days ago
  
  I always wondered, is there an actual POSIX documentation or spec out there? The last time I researched it, it seemed that POSIX is a number of documents behind a massive paywall. Yet, so many people seem to know what is and isn't POSIX compliant, that it just seems unlikely that POSIX is locked behind a paywall.
  - d0mine 2 days ago
    
    https://pubs.opengroup.org/onlinepubs/9799919799/
    
    7bit 2 days ago
    
    Thanks bruv!