Text processing

Fri Feb 3 16:06:02 UTC 2006

Dan Track wrote:
> Hi
> 
> I've got the following output
> 
> Col1    Col2   Col3       Col5
> 1         000    001        Yes
> 2         000    001
> 3         000    001
> 4         Yes                 Yes
> 4         000    001
> 4         000    001
> 5         000    001
> 5         Yes    001
> 6         000    001        Yes
> 
> As you can see the column widths vary in size. What I need to do is to
> find out The number in Col1 that is associated with all those "Yes"
> occurrences in Col5. How can I do this.
> I've tried the following
> cat file | tr -s ' ' ' ' | tr -s '\t' ' ' | cut -d ' ' -f 6
> 
> But I get a result like this
> 
> Hi
> 
> I've got the following output
> 
> Col1 Col2 Col3 Col5
> 1 000 001 Yes
> 2 000 001
> 3 000 001
> 4 Yes Yes
> 4 000 001
> 4 000 001
> 5 000 001
> 5 Yes 001
> 6 000 001 Yes
> 
> As you can see one of the "Yes" statements has moved into the third
> column, so that's a wrong move.
> 
> Any help would be appreciated

The problem here I think is that some of your columns are empty, so for 
instance:

Col1    Col2   Col3       Col5
4         Yes                 Yes

appears the same as:

Col1    Col2   Col3       Col5
4       Yes    Yes

to most Unix text-processing tools that separate fields based on whitespace.

If you're actually looking for lines where the last field is "Yes", you 
could just do:

$ awk '$NF == "Yes"' file

If all you want is the number in the first field, you'd have:

$ awk '$NF == "Yes" { print $1 }' file

Paul.