For Loops and Space in Names, take 2

Thu Dec 11 06:27:05 UTC 2008

On 10Dec2008 20:52, Robert Wuest <rwuestfc at wuest.org> wrote:
| On Thu, 2008-12-11 at 10:02 +1100, Cameron Simpson wrote:
| > Because the $1 is unquoted here you probably don't need the "+" in the
| > regexps. But you probably should quote $1 because of this:
| > 
| >   [/Users/cameron]fleet*> x='t*'
| >   [/Users/cameron]fleet*> echo "$x"
| >   t*
| >   [/Users/cameron]fleet*> echo $x
| >   test.txt them tmp
| 
| Done.  But I do need the + in this one: "s/[_]+/_/g"

Yes. I was thinking of spaces actually, and had not paid enough
attention to your other chars. But just so I can nit, you don't need the
[] there:-) But, for performance, notice that you have here:

  s/_+/_/g      # one or more underscores

which will also pointlessly replace "_" with "_". Maybe:

  s/__+/_/g     # two or more underscores!

In this instance, the performance is negligible essentially because you
are invoking sed once per filename. The compute cost to load an entire
executable, start it, have the executable itself parse your sed script
(which of course sed must do to run it) vastly outweighs the tiny cost
of replacing "_" with "_" a few times.

But in a wider context, like this:

  sed '.....' < really-enormous-data-file

then this kind of thing can matter.

| Also, need to reorder the expressions. I ran some tests and some thing
| like "a    +++,,    b" would translate to "a__b" rather than "a_b"

You could strip, translate then squash. Example lower down.

| Also, I now use set -f in the fixname function (to facilitate handling
| original names with a * in them), but I quoted it anyway. 
| > i.e. if $x is not quoted then glob characters in a filename will get expanded
| > (supposing you have a name like "t*", and some other files starting with "t").

That would work, provided you're not worried about spaces being
collapsed. Which regrettably you would be for the $1 part of the mv
command. Of course, if you've quoted things you're ok, but then you
don't need -f either.

| > 
| > |         sed -r \
| > |             -e "s/[ ]+/_/g" \
| > |             -e "s/[_]+/_/g" \
| > |             -e "s/'//g" \
| > |             -e "s/[+]//g" \
| > |             -e "s/,//g" \
| > |             -e "s/_-_/-/g" \
| > |             -e "s/\&/and/g"`;

You could go:

  sed -r "s/[',]+//g            # might be faster than s/[',]//g because of
                                # fewer replacement operations
                                # (bigger matches)
          s/[ _+][ _+]+/_/g     # bigger matches again; one pass instead
                                # of three
          s/_-_/_/g
          s/\&/and/g"

| > |     # only do rename if the name is changed
| > |     if [ \"$1\" != \"${newname}\" ]; then
| > 
| > You don't need to backslash the "s in this comparison.
| 
| Not sure why I did this. Hmmmm.

Probably leaked from the mv command you're printing.

| > 
| > |         echo mv \"$1\" \"${newname}\"
| > 
| > This is actually dangerous. By leaving the (expanded) $1 in double
| > quotes, it is subject to parameter and command substitution in the
| > shell that runs the commands you generate. For example, supposing I ship
| > you a file whole name contains:
| > 
| >   `rm -rf $HOME`
| > 
| > That _will_ get through your script, and _will_ run the rm command in
| > the subshell. In both the $1 and $newname parts.
| 
| I hadn't thought of that.  It definitaly does what you say if I make a
| file named `ls -l`
| 
| I want to remove ` anyway, now that I think about it, so added an s
| expression for that.  And $, *, and %.

Yah. My problem with that approach is that it requires exhaustive
knowledge of the shell. It is not a _defensive_ approach. A defensive
approach, instead of saying "these things are bad, remove them" says
"these things are good, remove everything else". It's a general
principle. For example, you might go:

  sed -r 's/\&/and/g            # get the special cases first
          s/[^a-zA-Z0-9]+/_/g   # clean up everything else
         '

Obviously the "good characters" list needs some work, but you see the
different approach?

| > But it gets better. Even though you remove all single quotes form
| > $newname, and are thus safe to emit:
| > 
| >   mv ..... '$newname'
| > 
| > the $1 is still unsafe, and cannot easily be made safe. (Actually,
| > bash's "printf %q" formatter may do the trick for you, and since you're
| > already in nonportable GNU sed land, you may as well step right in with
| > nonportable bash too:-)
| 
| I'm not sure what %q does.  It's not in the info or man documentation.
| I now use single quotes to avoid expansion.  That seems to work on a few
| test cases.

I gather (learnt this just the other day, since I'm a
bash-extensions-phobe because I want my scripts to be maximally
portable) that the %q format specifier for printf is a bash extension to
quote a string in a shell safe fashion. "Man bash" says:

  [...] and %q causes printf to output the corresponding argument in a format
  that can be reused as shell input [...]

So I can imagine using it in constructing "eval"able shell commands,
such as you might pass to "sh -c", "su -c" or "ssh". Eg:

  qfoo=`printf "%q" "$foo"`
  sh -c "mv $qfoo bah"

knowing that $qfoo is now "escaped" for use in the shell, and thus will
be an exact representation of "$foo" after the shell picks it up and
parses it. You could use it to generate your mv commands with echo
(although for that I would use printf again instead of echo, since echo
has a tendency to eat backslashes...)

My personal history for prepping a string for shell parsing has been to
replace all ' with '\'' and then to put ' on each end, thus:

  foo="john o'goody"            # what a human might write
  bah='john o'\''goody'         # what my escaper would generate

| > Personally, I would change the script mode. Instead of trying to emit
| > safe shell syntax for piping to "sh", I would instead give it two modes
| > of operation. Default is to recite commands and a -x option would make
| > it actually move files around.
| > 
| > Like this:
| > 
| >   setx()
| >   { ( set -x
| >       "$@"
| >     )
| >   }
| >   trace=echo
| > 
| >   # handle leading -x option
| >   if [ "x$1" = x-x ]
| >   then
| >     shift
| >     trace=setx
| >   fi
| > 
| > and then down where you emit the mv commands go:
| > 
| >   $trace mv "$1" "$newname"
| > 
| > which should be robust against weird names in $1 or $newname.
| 
| Done, or at least something similar
| 
| > 
| > BTW, have you worried about having mv move a "bad" filename onto an
| > already existing "good" name? A "mv -i" might avoid a little pain.
| 
| Done, with an if [ -a ....

Don't you mean -e?

[...snip...]
| > 
| > |     for i in *; do 
| > |         fixname "$i"
| > |     done
| > | else
| > |     while [ "z$1" != "z" ]; do
| > |         fixname "$1"
| > |         shift
| > |     done
| > 
| > This loop is better written:
| > 
| >   for i; do
| >     fixname "$i"
| >   done
| > 
| 
| That's cool.  I never knew this.  Thanks.  Done

You can also do this:

  [ $# -gt 0 ] || set -- *
  for i; do
    fixname "$i"
  done

and avoid the if/then i.e. only use only loop.

[...snip...]
| I wrote this 8 years ago and once it worked, I've not looked at it again
| until today.

Every so often I get to revisit my own old scripts and shudder at their
ineptness:-(

| #!/bin/bash
| # 
| # fixmp3names
| # Copyright (c) 2000, 2008 Robert Wuest
| # Permission is granted to use, modify and, distribute according
| # to the terms of the GPL
| #
| # With thanks to Cameron Simpson from the Fedora users list!
| #
| # This script fixes a lot of the anomalies in file names
| # usually weird stuff from mp3 files, but it's really generic 
| #
| # if called with no args it processes all files in the current directory
| # or works on the names passed on the command line
| # probably doesn't work across directories
| #
| # default just writes the mv commands to standard out
| # if you want it to actually do something, pipe the output to sh, as in:
| #   fixmp3names | sh
| # or use -x as:
| #   fixnames -x
| #
| # Version 2
| 
| IFS=$'\n'
| 
| 
| oper=f_echo
| 
| function f_mv()
| {   ( set -x
|     mv $1 $2
|     )
| }
| 
| function f_echo()
| {
|     echo mv \'$1\' \'$2\'
| }
| 
| function fixname() 
| {   ( set -f

I'm pretty sure -f prevents filename expansion (globbing), but not word
separation. So if $1 is "fred      nerk" and you go:

  ${oper} $1 ${newname}

it will call your function with 3 arguments: "fred" "nerk" and whatever
was in $newname. Watch:

  [/Users/cameron]fleet*> sh -c 'fred="fred    nerk"; set -f; echo $fred; echo "$fred"; set -- $fred; echo $#'
  fred nerk
  fred    nerk
  2

See how the unquoted $fred is still broken up into words, and echo gets
two arguments "fred" and "nerk", and writes them out with a single
space. Versus getting one argument from "$fred" and preserving the
spaces.

So you still need

|     newname=`echo "$1" | \

You might try:

  printf "%s\n" "$1"

instead of using echo. Make yourself a file called 'fred\c'. Echo can
(platform dependent, and bash _does_ behave this way) interpret escape
sequences.

Try going:

  echo 'fred\c' | od -c

You may get F R E D \ C \n (well, lower case; upper case for clarity here).
Or you may get F R E D (specificly, no trailing newline - handy for
prompting).

Other echoes don't do the trailing \c thing, but _do_ accept
a leading "-n" option for the same purpose. It's all very unreliable:-(
You can thank divergent development (AT&T SysV vs UCB BSD respectively).

Anyway, the printf incantation above avoids this issue.

|         sed -r \
|             -e "s/\\\\$//g" \
|             -e "s/\\o140//g" \
|             -e "s/,//g" \
|             -e "s/[+]//g" \
|             -e "s/'//g" \
|             -e "s/%//g" \
|             -e "s/\\*//g" \
|             -e "s/[ ]/_/g" \
|             -e "s/[_]+/_/g" \
|             -e "s/_-_/-/g" \
|             -e "s/\&/and/g"`;

See my earlier comments about a defensive approach to this.

|     # only do rename if the name is changed
|     if [ $1 != ${newname} ]; then

You need to quote here. The test (aka "[") command expects its arguments
to be distinct strings. So:

  if [ "$1" = "$newname" ]; then

|         # leave, with message, if destination name exists
|         if [ -a ${newname} ]; then

-e, surely? Have you tested this bit by hand?

|             echo "# '${newname}' exists, skipping '$1'"

If I'm writing a program I like errors to go to stderr, and to prefix
all error messages with the program name (very useful for reading
intermixed output later). So I'd be going:

  echo "$cmd: $newname exists, skipping $1" >&2

and set:

  cmd=`basename "$0"`

at the top of the script for this purpose.

|             return

It's usually worth treating shell functions like command, and arranging
to return a non-zero status on error. So I'd be going:

  return 1

here. It has no effect in your script as written, but it's useful in
other contexts.

|         fi
|         ${oper} $1 ${newname}
|     else
|         echo "# '$1' and '${newname}' are the same file"

Might be better to say "name unchanged" or something.

|     fi
|     )

Oh! This is in a subshell. You want "exit", not "return" then.

[...]
| if [ $# = 0 ]; then
|     for i in *; do 
|         fixname "$i"
|     done
| else
|     for i; do
|         fixname "$i"
|     done
| fi

See the single-loop suggestion earlier.

| exit

Might be an idea to track success/failure and exit with a meaningful
status. The approach I usually use goes like this:

  xit=0

  for i; do
    fixname "$i" || xit=1
  done

  exit $xit

Of course, that requires fixname to return such a status and so forth.

Cheers,
-- 
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

Judging by my employee ID# my employer thinks I am a small filing cabinet,
so I dont think they give a care about my opinions.
        - Michael Jones <michaelj at nafohq.hp.com>