## Unix Tools

Unfortunately, it is impossible to cover everything you may need at any point in your future. The best I can accomplish is get you familiar with the tools, present why they are useful, and show you where to get more information.

The three unix tools that I have used the most are `grep`, `sed`, and `awk`, in that order. Honestly, I've only started using `awk` in the last year! OK, so I find them useful; but what are they good for (*absolutely everything!*)

- `grep` is a search tool for finding lines in files that contain some string or expression.
- `sed` is an in-line string editor. 
- `awk` is an in-line programable string parser/analysis tool

Okay, but your favourite text editor has a find/replace tool, and maybe you can write *python* code to read a data file and do what calculations are necessary. Why use these tools?

Because of their simplicity, capability of running on large files, and the ability to save the command to use over and over again! 

### First some online resources

Where should you go once you have completed this workshop, and you are using these tools and you don't know how to get it to do what you want it to do? 

- [Here](http://www.panix.com/~elflord/unix/grep.html) is a decent **`grep`** online tutorial
- [Here](http://www.grymoire.com/Unix/Sed.html) is how I learned **`sed`** (and ultimately `grep` too!)
- [Here](http://www.vectorsite.net/tsawk_1.html#m2) is everything I needed to learn **`awk`**
- [Here](http://www.grymoire.com/Unix/Regular.html) is a very good discussion of **regular expressions** which are the backbone of both `grep` and `awk`
- [Here](https://weblearn.ox.ac.uk/portal/hierarchy/mpls/physics/astrophysics/ctia/page/29eb3ad3-b4b5-4db9-96ad-04d3f6745e47) is a **weblearn forum** where we can discuss and questions or snags we get into when using these tools.

And remember! You can always do `man COMMAND` in Unix, and with any luck the manual will be clear enough.

First let's grab the needed files. If you haven't looked at the pre-unix tutorial, then first grab that file:

In [None]:
wget https://bitbucket.org/mlarichardson/introduction-to-unix/raw/953ea44e6ec21ee393611c4dc93db611666ea658/files_for_practice.tar.gz

Or click [Here](https://bitbucket.org/mlarichardson/introduction-to-unix/raw/953ea44e6ec21ee393611c4dc93db611666ea658/files_for_practice.tar.gz) and save it in the directory you wish to work in. Then:

In [None]:
tar -xvzf files_for_practice.tar.gz

Finally, [here](https://bitbucket.org/mlarichardson/introduction-to-unix/raw/953ea44e6ec21ee393611c4dc93db611666ea658/more_files.tgz) are some more files (actual data that I use and parse regularly):

In [None]:
wget https://bitbucket.org/mlarichardson/introduction-to-unix/raw/953ea44e6ec21ee393611c4dc93db611666ea658/more_files.tgz

### `grep`

`grep` takes arguments like any other unix command:

`grep <-OPTIONS> <EXPRESSION> <FILES>`

Lets go (`cd`) into the `files_for_practice/random_files/` directory and try some simple commands (but do them separately!):

In [None]:
cd files_for_practice/random_files/
head users.txt
grep rob users.txt
grep -n rob users.txt
grep -n -A 5 rob users.txt
grep -n -B 5 rob users.txt
grep -n -C 5 rob users.txt
grep '^r' users.txt
grep 't$' users.txt
grep '^r' *
grep '^r.*[0-9]$' *
grep '[xz]' users.txt
grep -v '[aei]' users.txt
grep -v '[aei]' users.txt | grep '[0-9]'
grep -v '[aei]' users.txt | grep '[14-9]' > OU_Numbered.txt
cat OU_Numbered.txt
grep -e ra -e lu users.txt
grep '[a-z]\{11\}' users.txt
grep '\([a-z]\)\1.*\1' users.txt

This covered a lot, but it basically covered everything! This basically encapsulates all aspects of `grep` that I use.

#### Options:
- `-n` adds the line number to be outputted
- `-v` is a negation, selecting only lines that **DONT** match the expression
- `-e` allows you to enter separate search fields, any of which will return the line.
- `-i` makes it case incensitive. Not useful here since the whole file is lower case. 
- `-A #N` prints out the next #N lines as well as the searched for line. Remember this with "After"
- `-B #N` prints out the previous #N lines as well as the searched for line. Remember this with "Before"
- `-C #N` prints out the surrounding #N lines (above and below) as well as the searched for line. Remember this with "Context"

#### Regular Expression:

What falls into the EXPRESSION argument is called a regular expression, and it follows many rules (see the link at the top). Almost all symbols mean what they are with a few exceptions:
- `.` is a wildcard, and means literally ANYTHING. To truly search for a "." enter "\."
- `\` is an escape character. It is used (usually) to turn a *special* character into its literal version (there are some exceptions). 
- `^` means the beginning of a line
- `$` means the end of a line
- `[...]` means anything inside the brackets, with "-" being able to connect different symbols based on their ascii value
- `*` means repeat the previous character, from 0 to infinity times
- `+` is the same as "\*" but must be found at least once. 
- `\{ , \}` means a set number of times. Depending on your version, it only seems to really work for the minimum number.
    - [a-z]\\{11\\} means find any lines with 11 consecutive letters
- `\( , \)` means save whatever argument you just found. You typically include "[...]" inside of these.
    - \([a-z]\)\1.*\1 means find any words where a leter is doubled in the word, and also appears again later in the word.
- `\1` means use a saved value. If you use multiple saved values you reference them with "\1" and "\2" etc. 


### Let's look at real data ...

Go up two directories and into the `more_files/` directory: 

In [None]:
cd ../../more_files
wc -l *

Let's find the largest Halo:

In [None]:
grep '[0-9]\{6\} part' Halo_File.txt
grep '[0-9]\{6\} part' Halo_File.txt

Looks like it's Halo 601. Let's see its details:

In [None]:
grep '^Halo *601 ' Halo_File.txt

We'll come back to this in a bit. For now, what if I want to know just the position of all of the halos, which I want to plot with some other plotter. I need to parse this file and extract only the necessary parts... `sed` or `awk` can come in handy here, and I'll start by looking at `sed`.

Let's first grab a few of the Position lines (making the already existing file "Head_of_Position.txt") which we will use below.

In [None]:
grep pos Halo_File.txt | head > Head_of_Position.txt

### `sed`

`sed` is a string editor. It allows you to change the strings, line by line, that are passed to it. It uses the same regular expression format as above for `grep`. I'm afraid I *only* used `sed` in ONE way (but a powerful one!), with the `s` option. There are more ways to use it, and you can use the resources above to read more about this.

The `s` option is for "substitution" strings:

`sed 's/OLD_REG-EXP1/NEW-EXPRESSION/' `

The "/" delimiter is the usual choice, but it can be anything! 

Let's try a few things:

In [None]:
cat Head_of_Position.txt
sed 's/has pos/has Position =/' Head_of_Position.txt
sed 's/has pos//' Head_of_Position.txt
cat Head_of_Position.txt | sed 's/has pos//'
cat Head_of_Position.txt | sed 's/abcdefg/ABCDEFGH/'
cat Head_of_Position.txt | sed 's/^.*xc//'
cat Head_of_Position.txt | sed 's/-[xyz]c//'
cat Head_of_Position.txt | sed 's/-[xyz]c//g'
cat Head_of_Position.txt | sed 's/-\([xyz]\)c/\1-Centre/g'
cat Head_of_Position.txt | sed 'sBHalo *BH=B'

Notice that the arguments are always passed to `sed` completely within quotes (they need not be single quotes, but there is a difference which I will cover in a session on Bash scripting). Also, the string always begins with the *s*, to signify that you are doing a string substitution. 

Inside the OLD_REG_EXP1 is standard regular expression format, as was seen in the `grep` discussion. Note that you can use a saved value from the OLD expression in the NEW expression. 

By default, if there are multiple sub-strings that `sed` can find, it will only select on the left-most. By adding the `g` option you can have it repeat for multiple strings on the same line.

Also remember that the delimiter "/" does not need to be "/", but in fact can be anything. However, if the delimiter is appearing in any of the string arguments it must be escaped with a "\".

### Task: Make a comma-separated file with x,y,z,r,M

Now we are going to combine a few different commands to parse the Halo file to get a nice and neat comma-separated variable file with position, size, and mass. 

These are on different lines, so it isn't trivial.

First start with a header

In [None]:
echo "x,y,z,r,m" > Halos.csv

Then lets first grab the positions:

In [None]:
grep pos Halo_File.txt | head
grep pos Halo_File.txt | sed 's/^.*xc *//' | head
grep pos Halo_File.txt | sed 's/^.*xc *//' | sed 's/ -[yz]c */,/g' | head
grep pos Halo_File.txt | sed 's/^.*xc *//' | sed 's/ -[yz]c */,/g' | sed 's/ ) *$/,/' | head
grep pos Halo_File.txt | sed 's/^.*xc *//' | sed 's/ -[yz]c */,/g' | sed 's/ ) *$/,/' > tmp_pos
head tmp_pos

Now grab the sizes:

In [None]:
grep rvir Halo_File.txt | head
grep rvir Halo_File.txt | sed 's/^.*rvir *//' | head
grep rvir Halo_File.txt | sed 's/^.*rvir *//' | sed 's/ and.*$/,/' | head
grep rvir Halo_File.txt | sed 's/^.*rvir *//' | sed 's/ and.*$/,/' > tmp_r
head tmp_r

And finally masses

In [None]:
grep mvir Halo_File.txt | head
grep mvir Halo_File.txt | sed 's/^.*mvir *//' > tmp_m

Now we need to combine all of these files such that all first lines are added together to make a single first line. For this we use `paste`

In [None]:
paste tmp_pos tmp_r tmp_m | head
paste -d \\ tmp_pos tmp_r tmp_m | head
paste tmp_pos tmp_r tmp_m >> Halos.csv

**NOTE** I have used "`>>`" because I am **APPENDING** to the already existing "Halos.csv" file. 

You can remove the "tmp" files if you like (rm -i tmp_\*). By default `paste` changes the newline character to a tab character, which is fine for a csv file. However, I think no character is better, which we accomplish with the `-d` option for "delimiter", and the `\\` sets none. We could have also put in the comma with `paste`, instead of putting it to the end of the above `sed` commands.

## Time for `awk`?

`awk` is basically a glorified `grep` with the ability to do a lot more very easily

Try this:

In [None]:
awk '/pos/' Halo_File.txt | head
grep pos Halo_File.txt | head

So what's the difference? It's in being able to control what is printed!

In [None]:
awk '/pos/ {print $2}' Halo_File.txt | head
awk '/pos/ {print $7, $9, $11}' Halo_File.txt | head
awk '/pos/ {print $7","$9","$11}' Halo_File.txt | head

That is a lot easier than the sed sequence above! Each column on each line is given a variable value, with the number given by the column, and accessed with the "$" symbol. But we can actually do operations with these variables: 

In [None]:
awk '/pos/ {print sqrt($7**2 + $9**2 + $11**2}' Halo_File.txt | head
awk '/mass/ {mass += $8} END {print mass}' Halo_File.txt | head
awk '$5~/123/' Halo_File.txt | head
awk 'BEGIN {minx=1e30;miny=1e30;minz=1e30} 
     /pos/ {if ($7<minx) minx=$7 ; if ($9<miny) miny=$9 ; if ($11<minz) minz=$11 ; } 
      END {print "mins are", minx, miny, minz}' Halo_File.txt

`awk` has 3 sections: The BEGIN, specified with the "`BEGIN`" key, useful for initializations (not needed for summation variables). The middle, which does not require the search string (if it is omitted `awk` works on every line). And the END, specified with "`END`" to do any final work. 

Finally, we can put everything within the quotes into a file (without the quotes) called, say, AWK_FILE.txt and then call awk with the file:

In [None]:
awk -f AWK_FILE.txt Halo_File.txt

Here is an example `AWK_FILE.txt` that can do most of the above:

In [None]:
     /Halo/ {id=$2; if ($5 == "particles") np[id]=$4 ;
                    if ($4 == "mass"     ) mass[id]=$5 ;
                    if ($4 == "radius"   ) r[id]=$7 ;
                    if ($4 == "pos"      ) {x[id]=$7 ; y[id]=$9 ; z[id]=$11 }
            }
     /^ *$/ {printf("%5d %6d %9.5f %8.6f %8.6f %8.6f %8.6f\n",id,np[id],mass[id],x[id],y[id],z[id],r[id])}


### One last tool : history

History lets you look through (some) previous commands you've done. If you have multiple terminals, they may not all communicate and see what each other is doing. Also, if a terminal is abruptly closed, you may lose the history. That said ...

In [None]:
history | tail
history | grep awk | tail
history | grep sed | tail