Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Pypipe – A Python command-line tool for pipeline processing (github.com/bugen)
213 points by bugen on Oct 23, 2023 | hide | past | favorite | 45 comments
pypipe is a command-line tool for writing data pipelines in Python. When working with data processing in the terminal, I often find myself wanting to pass the output of commands to Python for further processing. In such cases, one can either write one-liners or create regular Python scripts and connect them through pipes. However, using pypipe makes this process more convenient and efficient.


  $ echo "pypipe" | ppp "line[::2]"
  ppp
This is an incredibly confusing first example!


It's pretty clear if you're intimately familiar with Python's slice syntax, but too clever by half otherwise. I've been coding Python since 2000 and can count on one hand the number of times I've used the step parameter in a slice.


90% of the time I remember running into it, it was just used to reverse a copy of the list. (i.e. 'a = l[::-1]')


Oh - no, I know that one (-: It's that the command ppp is also the output.


It's pretty clear if you're intimately familiar with the Unix command line, but too clever by half otherwise. I've been using Unix since 1994 and can count on zero fingers the number of times I've needed to slice input through a pipe like this. :-)

Also, just to make sure we're on the same page:

The command `ppp` (found somewhere in PATH, unless it's a shell function or an alias, but we believe it's a symlink to `pypipe.py`) receives `pypipe\n` on stdin, sets a variable named `line` to `pypipe`, uses the slice notation to convert that to `ppp` and then prints `ppp\n` to stdout.


I think it's to show you were the name comes from. :)


it gets worse. I have to look up numpy comma/tuple slicing syntax everytime I use it.


A bit unrelated, but one thing I absolutely love is the fact you can install it by copying a single file to a folder in your PATH. I have been trying to follow this approach for my Python scripts (standard library only, everything in one file) and I really enjoy the experience. Most of the features I need only require Python 3.8, and Ubuntu comes with Python pre-installed, so

  rsync
  chmod +x
"just works".


You can use shiv to make any script with deps a single file


Yes, I've used shiv and PEX in the past, and I love those. But that means "adding a build step". Also, as far as I understand, by default, those tools normally decompress the generated zip file on startup, but the decompressed artefacts have to be manually cleaned.

I think they are worth for more complex apps (I actually have a draft blog post with some experiments I did using PEX + gunicorn), but for my use case, it's not worth the effort when I only need the standard library.


You can try https://github.com/jaraco/pip-run or https://github.com/PyAr/fades (or eventually pipx: https://github.com/pypa/pipx/issues/913). They don't require a build step.

For example,

  #! /usr/bin/env -S pip-run Jinja2==3.*
  
  from jinja2 import Environment
  
  env = Environment(autoescape=True)
  template = env.from_string("Hello, {{ name }}!")
  print(template.render(name="world"))
There is a downside. Because pip-run recreates the virtualenv every time, the script takes a second to start up. pipx will cache virtualenvs once the single-file script feature is released. I haven't used fades yet.

Edit: fades caches virtualenvs.

  #! /usr/bin/env fades

  from jinja2 import Environment  # fades Jinja2==3.*

  env = Environment(autoescape=True)
  template = env.from_string("Hello, {{ name }}!")
  print(template.render(name="world"))


If you do that, you may want to know about this magic to test code that's in a single file Python CLI program: https://linsomniac.com/post/2023-03-21-python_testing_a_cli_...

I like to set up at least some tests on my scripts so that I can reduce the number of times I push something out that is obviously broken. pre-commit can also help with preventing shipping things with syntax errors if you enable the "ast" check, which does a simple syntax check on the code.


Thank you for all the comments and advice. I'm truly surprised by the response, it's beyond what I expected. It was here that I learned about other projects similar to pypipe for the first time. After checking them out, I now understand that pypipe's strength is in its simplicity. I plan to improve pypipe while keeping it simple, so anyone can easily understand how it works by reading the source code and make their own customizations.


Cool!

My tool of choice for such things is awk, still, it's good to have more alternatives


I couldn't dream of using awk for json data (ubiquitous nowadays). Of course there is jq and others. It is as the Pragmatic Programmer puts it that we have to take care to curate and master our tools like a woodworker and their tools.


This is an awesome tool, I love cmd tools that make it easier to manipulate and work with tabular data. I work with a lot of tabular data, mainly in s3, and I put together "s3head" for easily streaming s3 data into stdout:

https://github.com/dbragdon1/s3head

and I'm gonna have a good time piping the output from s3head into pypipe.


Can't you just use:

  aws s3 cp s3://YOUR_FILE - |


This way still seems to give you a weird broken-pipe error that I was seeing when looking up how to do this with the aws cli. I just tested your method vs s3head and s3head seems slightly faster for some reason, don't have the time right now to dive into why though...


Nice project!

It gives me so much Perl one-liners vibe, when `perl` command combined with `-p` and `-e` flags allows you to write super concise programs for bash pipelines.

Some examples https://learnbyexample.github.io/learn_perl_oneliners/one-li...


I’ve had dreams about making this sort of tool. I’m so thrilled to see this!!!!


Nice! My go to system scripting is bash that calls python for the things that just suck doing in bash. I didn't see a method to do it but it would be great if this could cleanup bash/python interop by giving an ergonomic interface to define custom python functions and call them.

Also since you really want to think of this as an extension of coreutils it would be great to offer this as a brew/apt package even if it's this simple. I just want to add it to my system package list and be able to depend on the command.


In the same spirit as the nushell project https://www.nushell.sh/


"To make it easier to type, it's recommended to create a symbolic link. ln -s pypipe.py ppp" Glad to see you didn't name it ppp.py


Pretty cool! I feel that this would be extremely helpful for me since at times I struggle remembering the incantations for xargs, awk, ... .


Why didn't I think of this? Very cool.


Cool!

I was going to ask how this differs in broad strokes from pz, but when I went to get the reference link found that pz hasn't been updated in two years, so that's one big difference.

https://github.com/CZ-NIC/pz


I was looking for something like this, will definitely try! I was always envious of perl being able to be easily incorporated into shell pipelines and wished python would support something like that.


I will check this out. There is also a similar tool too here: https://github.com/zqqqqz2000/shshsh


This is awesome, great work bugen!

I've created a package in Wasmer [1] to showcase this tool (also, it will do the processing fully sandboxed thanks to Wasm!)... hope you all like it! (here's the PR [2])

  # Install Wasmer
  curl https://get.wasmer.io -sSfL | sh
  # Add ppp alias
  alias ppp="wasmer run syrusakbary/[email protected] -- "

And then, run it normally:

  $ cat staff.txt |ppp 'i, line.upper()'

[1] https://wasmer.io/

[2] https://github.com/bugen/pypipe/pull/2


wow that wasmer thing is _SLOW_.. we are talking about 57x time slow! (granted most of this is likely startup delay). Here is a random benchmark with warmed-up cache:

    $ time cat /var/lib/dpkg/status | wasmer run syrusakbary/[email protected] -- 'i, line.upper()'  | wc -l
    39175
    real    0m5.761s
    user    1m15.071s
    sys     0m4.838s
vs regular python:

    $ time cat /var/lib/dpkg/status | python3 pypipe.py 'i, line.upper()'  | wc -l
    39175
    real    0m0.107s
    user    0m0.096s
    sys     0m0.026s
and the wasmer install procedure.. not a deb file in sight, adds itself to ~/.bashrc (of course...) and apparently requires two environment variables to even work.

Compare this to OP's instructions: (1) check out the repo (2) execute the file directly.

Not sure why would anyone want wasmer for simple command like tools like those.


I have a feeling that in most use cases this is replacing grep and awk in a familiar way to Python programmers, especially the latter with its own grammar. Fun stuff!


Nice! How does it compare performance-wise with AWK?


A better analog would be perl -ne. It was only a matter of time before python got this.


This is great!

I've been making a lot of tools in this similar vein. I've been keeping them in my dotfiles.

I've got plt [0], a simple matplotlib templating language built with Python Lex Yacc for making quick plots from CSVs , eg,

  cat data.csv | plt '[a_version_count, b_version_count], date { plot 1px [solid blue, solid red] }' > plot.png
There's a plugin format so you can make extensions like bleep [1]:

  plt 'a_version_count, date { bleep blop blip green 10 } --py' > bleep_plotter.py
  cat data.csv | python3 bleep_plotter.py > bleep_plot.png
To create a plugin xyz, just call it "xyz_template.py" and put it in ~/dotfiles/plt. Outputs to Python code are optional but useful for minor adjustments.

(Does plt look familiar? Can you tell I just read the latest version of The Awk Programming language?)

Or I was reading The Unix Programming Environment (1982) and being inspired by the pick command, wired up electron to allow for STDIN/OUT/ARGV in the browser context, for what I'm calling elec [2]:

  elec textarea -x 300 -y 0 | elec pick -x 300 -y 600 | awk '{ print $0 " " $0 }'
Again, to create a plugin xyz, and in this case all elec commands are plugins, add "xyz.html" to ~/dotfiles/elec, as seen with the pick [3] plugin.

ANYWAYS, where I'm going with this instead of,

  cat staff.xml | ppp custom -N xpath -O path='./Animal/Age'
How about?

  cat staff.xml | ppp xpath -O path='./Animal/Age'
Convention over configuration!

Again, this tool is great, it's already in my dotfiles and I've already used it at work this morning, so thank you!

[0] https://github.com/williamcotton/dotfiles/blob/master/bin/pl...

[1] https://github.com/williamcotton/dotfiles/blob/master/plt/bl...

[2] https://github.com/williamcotton/dotfiles/blob/master/bin/el...

[3] https://github.com/williamcotton/dotfiles/blob/master/elec/p...


Thank you for using pypipe!

  How about?  
  
  cat staff.xml | ppp xpath -O path='./Animal/Age'
I also wanted to allow custom commands like this, but I decided on the current format for a few reasons, including the ability to omit the default 'line' command from the arguments. For frequently used commands, please consider setting up aliases in your configuration files (e.g ~/.profile).

  alias xpath='ppp custom -N xpath'


Reminds me of awk from long ago.


Looks fun! Like a nicer AWK


This looks interesting!

A tool in the same vein that I already use is pyp (`pypyp` on PyPI). This project, pypipe, has built-in record splitting and CSV support that pyp doesn't. CSV is lower-friction. pyp automatically determines what modules to import. It is very convenient and would be nice to have here. A version of `pypprint` (https://github.com/hauntsaninja/pyp/blob/9408446a41bfdc60e44...) may be useful, too.

The most famous Python command-line tool like this is probably Mario (https://github.com/python-mario/mario), which isn't maintained. pyp's readme compares Mario and some other alternatives: https://github.com/hauntsaninja/pyp/blob/9408446a41bfdc60e44.... pypipe is different from most for also having a feature like `--explain` in pyp (code generation).


It's incredible to discover so many projects similar to pypipe! In particular, pyp seems truly magical. If I had known about pyp before, I might not have developed pypipe. However, the lack of prior knowledge about pyp has given pypipe its own unique identity. Learning from these earlier projects, I can see there's still room for growth in pypipe. Thank you for the valuable insights, especially regarding the potential implementation of the auto-import feature.


You're welcome!

> However, the lack of prior knowledge about pyp has given pypipe its own unique identity.

This is why I think it is not always bad to reinvent things. Reinventions can have their own advantages, often through a somewhat different focus. There can be value in creating (a prototype of) your own solution before you see how others have solved your problem. (Or even where others draw the line around your problem, which may be different from where you do and override it.) Reinventing and sharing something is also a reliable way to learn about prior art. :-)

However, you usually don't want to be the one doing the reinventing. :-) It is not the best use of one's time and resources. A better starting point is to know the state of the art well and have some disagreements with it.

Let me share a personal cautionary tale. It doesn't really apply to pypipe but may still be useful or interesting. I once took it too far with static site generators. In order to make it a learning experience, I deliberately avoided studying existing ones in depth. I wrote mine in a niche programming language with what I thought was a fresh perspective. (It kind of was. A big part of it was leaning heavily on an in-memory SQLite database.) I repeated the mistakes of early content management systems, like heavy indirection and too much code in templates. Soon my static site generator had users! And I was just realizing how flawed it was. Oh no. The users were few, but they used the generator for real things, like an event. Some knew about my wheel-reinventing approach and weren't dissuaded by it.

I found myself stuck with a subpar design to polish for the 1.0 release. When it was finished, I had to do a fairly difficult and only partial reworking for version 2.0. To encourage moving to the new version and not frustrate those with complex projects, I wrote a migration tool. The most serious projects based on the generator stayed on version 1.x.

So, I advise against going this far. It is probably best to do the research and learn about the state of the art first. I try to do it now.

Again, this is much less of a concern with pypipe. It doesn't create the same kind of user lock-in as a static site generator. At most, you will have to ask your users to upgrade their shell scripts when something changes.


https://github.com/alecthomas/pawk is also great, & auto imports any module that you use


pawk is awesome. What concise code! I absolutely adore the simplicity of pawk.


This is fun! I was curious about pulling something like this off in Golang, and wipped together a dirty PoC using Traefik's yaegi here [0]

[0] https://gist.github.com/leonjza/9d53b30a6b85ff837a27170a185a...


There is already a - much more powerful - tool for this: NimbleText. It has a UI, a CLI interface and even a web version. Also, check out NimbleSet as well.


Are you sure? NimbleText looks like a tool for text manipulation and code generation, Pypipe pipelines between python programs or commands.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: