12. Txt Manipulation.md

Introduction

cat and echo

Working with Large and Compressed Files

sed and awk

File Manipulation Utilities

grep and strings

Miscellaneous Text Utilities

Knowledge Check (Verified Certificate track only)

Summary

Introduction

  • Display and append to file contents using **cat** and **echo**.
  • Edit and print file contents using **sed** and **awk**.
  • Search for patterns using **grep**.
  • Use multiple other utilities for file and text manipulation.

cat and echo

Command Line Tools for Manipulating Text Files

file manipulation operations - browse through and parse text files, and/or extract data from them

**cat** (concatenate)

  • Used to read and print files, as well as for simply viewing file contents

    **$ cat <filename>**

  • The main purpose of **cat** is often to combine (concatenate) multiple files together.

  • The **tac** command (**cat** spelled backwards) prints the lines of a file in reverse order. Each line remains the same, but the order of lines is inverted.

    **$ tac file**

    **$ tac file1 file2 > newfile**

Command Usage
**cat file1 file2** Concatenate multiple files and display the output; i.e. the entire content of the first file is followed by that of the second file
**cat file1 file2 > newfile** Combine multiple files and save the output into a new file
**cat file >> existingfile** Append a file to the end of an existing file
**cat > file** Any subsequent lines typed will go into the file, until CTRL-D is typed
**cat >> file** Any subsequent lines are appended to the file, until CTRL-D is typed

Using **cat** Interactively

  • **cat** can be used to read from standard input (such as the terminal window) if no files are specified. You can use the > operator to create and add lines into a new file, and the **>>** operator to append lines (or files) to an existing file.
  • To create a new file, at the command prompt type **cat > <filename>** and press the **Enter** key.
  • This command creates a new file and waits for the user to edit/enter the text. After you finish typing the required text, press **CTRL-D** at the beginning of the next line to save and exit the editing.
  • Another way to create a file at the terminal is **cat > <filename> << EOF**. A new file is created and you can type the required input. To exit, enter **EOF** at the beginning of a line.
    • Note that ==EOF== is case sensitive. One can also use another word, such as ==STOP==.

**echo** (displays (echoes) text)

**$ echo string**

  • **echo** can be used to display a string on standard output (i.e. the terminal) or to place in a new file (using the > operator) or append to an already existing file (using the **>>** operator).
  • The **–e** option, along with the following switches, is used to enable special character sequences, such as the newline character or horizontal tab:
    • **\n** represents newline
    • **\t** represents horizontal tab.
  • **echo** is particularly useful for viewing the values of environment variables (built-in shell variables).
    • For example, echo $USERNAME will print the name of the user who has logged into the current terminal.
Command Usage
**echo string > newfile** The specified string is placed in a new file
**echo string >> existingfile** The specified string is appended to the end of an already existing file
**echo $variable** The contents of the specified environment variable are displayed

Working with Large and Compressed Files

Working with Large Files

  • Directly opening a large file in an editor will cause issues, due to high memory utilization, as an editor will usually try to read the whole file into memory first.

  • use **less** to view the contents of such a large file, scrolling up and down page by page, without the system having to place the entire file in memory before starting.

  • Viewing somefile can be done by typing either of the two following commands:

    **$ less somefile**

    **$ cat somefile | less**

**head**

  • **head** reads the first few lines of each named file (10 by default) and displays it on standard output.

  • Number of lines can be increased or decreased using **-n**.

    **$ head –n**

**tail**

  • **tail** prints the last few lines of each named file and displays it on standard output. By default, it displays the last 10 lines.

  • **tail** is especially useful when you are troubleshooting any issue using log files, as you probably want to see the most recent lines of output.

  • Number of lines can be increased or decreased using **-n**.

    **$ tail –n**

  • To continually monitor new output in a growing log file:

    **$ tail -f somefile.log**

    This command will continuously display any new lines of output in somefile.log as soon as they appear. Thus, it enables you to monitor any current activity that is being reported and recorded.

Viewing Compressed Files

  • When working with compressed files, many standard commands cannot be used directly.
  • For many commonly-used file and text manipulation programs, there is also a version especially designed to work directly with compressed files.
  • These associated utilities have the letter “z” prefixed to their name.
  • For example, we have utility programs such as zcatzlesszdiff and zgrep.
  • Note that if you run zless on an uncompressed file, it will still work and ignore the decompression stage. There are also equivalent utility programs for other compression methods besides gzip
Command Description
**$ zcat compressed-file.txt.gz** To view a compressed file
**$ zless somefile.gzor$ zmore somefile.gz** To page through a compressed file
**$ zgrep -i less somefile.gz** To search inside a compressed file
**$ zdiff file1.txt.gz file2.txt.gz** To compare two compressed files
  • For example, we have bzcat and bzless associated with bzip2, and xzcat and xzless associated with xz.

sed and awk

**sed** (stream editor**)**

  • **sed** is a powerful text processing tool. It is used to modify the contents of a file or input stream, usually placing the contents into a new file or output stream.

  • **sed** can filter text, as well as perform substitutions in data streams.

  • Data from an input source/file (or stream) is taken and moved to a working space.

  • The entire list of operations/modifications is applied over the data in the working space and the final contents are moved to the standard output space (or stream).

    ==Command== ==Usage==
    **sed -e command <filename>** Specify editing commands at the command line, operate on file and put the output on standard out (e.g. the terminal)
    **sed -f scriptfile <filename>** Specify a scriptfile containing sed commands, operate on file and put output on standard out
    **echo "I hate you" | sed s/hate/love/** Use sed to filter standard input, putting output on standard out

    The **-e** option allows you to specify multiple editing commands simultaneously at the command line. It is unnecessary if you only have one operation invoked.

**sed** Basic Operations

  • **pattern** is the current string and **replace_string** is the new string:

    ==Command== ==Usage==
    **sed s/pattern/replace_string/ file** Substitute first string occurrence in every line
    **sed s/pattern/replace_string/g file** Substitute all string occurrences in every line
    **sed 1,3s/pattern/replace_string/g file** Substitute all string occurrences in a range of lines
    **sed -i s/pattern/replace_string/g file** Save changes for string substitution in the same file
  • Use the **-i** option with care, because the action is not reversible. It is always safer to use **sed** without the **–i** option and then replace the file yourself, as shown in the following example:

    **$ sed s/pattern/replace_string/g file1 > file2**

    The above command will replace all occurrences of pattern with replace_string in file1 and move the contents to file2. The contents of file2 can be viewed with cat file2.

    If you approve, you can then overwrite the original file with mv file2 file1.

**awk**

  • **awk** is used to extract and then print specific contents of a file and is often used to construct reports.

  • It was created at Bell Labs in the 1970s and derived its name from the last names of its authors: Alfred Aho, Peter Weinberger, and Brian Kernighan.

  • awk has the following features:

    • It is a powerful utility and interpreted programming language.
    • It is used to manipulate data files, and for retrieving and processing text.
    • It works well with fields (containing a single piece of data, essentially a column) and records (a collection of fields, essentially a line in a file).
    ==Command== ==Usage==
    **awk ‘command’  file** Specify a command directly at the command line
    **awk -f scriptfile file** Specify a file that contains the script to be executed

    **awk** Basic Operations

    • The input file is read one line at a time, and, for each line, **awk** matches the given pattern in the given order and performs the requested action.
    • The **-F** option allows you to specify a particular field separator character.
    • For example, the **/etc/passwd** file uses “**:**” to separate the fields, so the **-F:** option is used with the **/etc/passwd** file.
    • The command/action in **awk** needs to be surrounded with apostrophes (or single-quote ('))
      ==Command== ==Usage==
      **awk '{ print $0 }' /etc/passwd** Print entire file
      **awk -F: '{ print $1 }' /etc/passwd** Print first field (column) of every line, separated by a space
      **awk -F: '{ print $1 $7 }' /etc/passwd** Print first and seventh field of every line
  • Lab 13.1: Using **sed**

    Search for all instances of the user command interpreter (shell) equal to **/sbin/nologin** in **/etc/passwd** and replace them with **/bin/bash**.

    • Solution

      To get output on standard out (terminal screen):

      **student:/tmp> sed s/'\/sbin\/nologin'/'\/bin\/bash'/g /etc/passwd**

      or to direct to a file:

      **student:/tmp> sed s/'\/sbin\/nologin'/'\/bin\/bash'/g /etc/passwd > passwd_new**

      Note this is kind of painful and obscure because we are trying to use the forward slash (/) as both a string and a delimiter between fields. One can do instead:

      **student:/tmp> sed s:'/sbin/nologin':'/bin/bash':g /etc/passwd**

      where we have used the colon (:) as the delimiter instead. (You are free to choose your delimiting character!) In fact when doing this we do not even need the single quotes:

      **student:/tmp> sed s:/sbin/nologin:/bin/bash:g /etc/passwd**

      works just fine.

File Manipulation Utilities

sort

  • **sort** is used to rearrange the lines of a text file, in either ascending or descending order according to a sort key.

  • The default sort key is the order of the ASCII characters (i.e. essentially alphabetically).

    ==Syntax== ==Usage==
    **sort <filename>** Sort the lines in the specified file, according to the characters at the beginning of each line
    **cat file1 file2 | sort** Combine the two files, then sort the lines and display the output on the terminal
    **sort -r <filename>** Sort the lines in reverse order
    **sort -k 3 <filename>** Sort the lines by the 3rd field on each line instead of the beginning
  • When used with the **-u** option, **sort** checks for unique values after sorting the records (lines). It is equivalent to running **uniq** (which we shall discuss) on the output of sort.

uniq

  • **uniq** removes duplicate consecutive lines in a text file and is useful for simplifying the text display. Because **uniq** requires that the duplicate entries must be consecutive, one often runs sort first and then pipes the output into **uniq**; if sort is used with the **-u** option, it can do all this in one step.

  • To remove duplicate entries from multiple files at once, use the following command:

    **sort file1 file2 | uniq > file3**

    or

    **sort -u file1 file2 > file3**

  • To count the number of duplicate entries, use the following command:

    **uniq -c filename**

paste

  • **paste** accepts the following options:
    • **-d** delimiters, which specify a list of delimiters to be used instead of tabs for separating consecutive values on a single line. Each delimiter is used in turn; when the list has been exhausted, **paste** begins again at the first delimiter.
    • **-s**, which causes paste to append the data in series rather than in parallel; that is, in a horizontal rather than vertical fashion.

Using paste

  • **paste** can be used to combine fields from different files, as well as combine lines from multiple files.

  • For example, line one from file1 can be combined with line one of file2, line two from file1 can be combined with line two of file2, and so on.

  • To paste contents from two files one can do:

    **$ paste file1 file2**

  • The syntax to use a different delimiter is as follows:

    **$ paste -d, file1 file2**

  • Common delimiters are ‘space’, ’tab’, ‘|’, ‘comma’, etc.

join

  • **join** is an enhanced version of **paste**. It first checks whether the files share common fields and then joins the lines in two files based on a common field.

Using join

  • To combine two files on a common field, at the command prompt type 

    **join file1 file2** and press the **Enter** key.

split

  • **split** is used to break up (or split) a file into equal-sized segments for easier viewing and manipulation, and is generally used only on relatively large files.

  • By default, **split** breaks up a file into 1000-line segments.

  • The original file remains unchanged, and a set of new files with the same name plus an added prefix is created.

  • By default, the x prefix is added. To split a file into segments, use the command **split infile**.

  • To split a file into segments using a different prefix, use the command 

    **split infile <Prefix>**.

Regular Expressions and Search Patterns

  • Regular expressions are text strings used for matching a specific pattern, or to search for a specific location, such as the start or end of a line or a word. Regular expressions can contain both normal characters or so-called meta-characters, such as ***** and $.

    ==Search Patterns== ==Usage==
    .(dot) Match any single character
    **a|z** Match a or z
    **$** Match end of a line
    **^** Match beginning of a line
    ***** Match preceding item 0 or more times
  • Lab 13.2: Parsing Files with **awk** (and **sort** and **uniq**)

    Generate a column containing a unique list of all the shells used for users in /etc/passwd.

    You may need to consult the manual page for /etc/passwd as in:

    student:/tmp> man 5 passwd

    Which field in /etc/passwd holds the account’s default shell (user command interpreter)?

    How do you make a list of unique entries (with no repeats)?

    • Solution

      The field in /etc/passwd that holds the shell is number 7. To display the field holding the shell in /etc/passwd using awk and produce a unique list:

      **$ awk -F: '{print $7}' /etc/passwd | sort -u**

      or

      **$ awk -F: '{print $7}' /etc/passwd | sort | uniq**

      For example:

      **$ awk -F: '{print $7}' /etc/passwd | sort -u**

      **/bin/bash
      /bin/sync
      /sbin/halt
      /sbin/nologin
      /sbin/shutdown
      **

grep and strings

grep

  • **grep** is extensively used as a primary text searching tool.
  • It scans files for specified patterns and can be used with regular expressions, as well as simple strings.
    ==Command== ==Usage==
    **grep [pattern] <filename>** Search for a pattern in a file and print all matching lines
    **grep -v [pattern] <filename>** Print all lines that do not match the pattern
    **grep [0-9] <filename>** Print the lines that contain the numbers 0 through 9
    **grep -C 3 [pattern] <filename>** Print context of lines (specified number of lines above and below the pattern) for matching the pattern. Here, the number of lines is specified as 3

strings

  • strings is used to extract all printable character strings found in the file or files given as arguments. It is useful in locating human-readable content embedded in binary files; for text files one can just use **grep**.

  • For example, to search for the string my_string in a spreadsheet:

    **$ strings book1.xls | grep my_string**

  • Lab 13.3: Using **grep**

    In the following we give some examples of things you can do with the **grep** command; your task is to experiment with these examples and extend them.

    1. Search for your username in file /etc/passwd .
    2. Find all entries in /etc/services that include the string ftp:
    3. Restrict to those that use the tcp protocol.
    4. Now restrict to those that do not use the tcp protocol, while printing out the line number
    5. Get all strings that start with ts or end with st.
    • Solution
      1. **student:/tmp> grep student /etc/passwd**
      2. **student:/tmp> grep ftp /etc/services**
      3. **student:/tmp> grep ftp /etc/services | grep tcp**
      4. **student:/tmp> grep -n ftp /etc/services | grep -v tcp**
      5. **student:/tmp> grep -e ^ts -e st$ /etc/services**

Miscellaneous Text Utilities

tr

  • The **tr** utility is used to translate specified characters into other characters or to delete them. The general syntax is as follows:

    **$ tr [options] set1 [set2]**

  • The items in the square brackets are optional. 

  • **tr** requires at least one argument and accepts a maximum of two.

  • The first, designated **set1** , lists the characters in the text to be replaced or removed.

  • The second, **set2**, lists the characters that are to be substituted for the characters listed in the first argument.

  • Sometimes these sets need to be surrounded by apostrophes (or single-quotes (’)) in order to have the shell ignore that they mean something special to the shell. It is usually safe (and may be required) to use the single-quotes around each of the sets.

    ==Command== ==Usage==
    **tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ** Convert lower case to upper case
    **tr '{}' '()' < inputfile > outputfile** Translate braces into parenthesis
    **echo "This is for testing" | tr [:space:] '\t'** Translate white-space to tabs
    **echo "This   is   for    testing" | tr -s [:space:]** Squeeze repetition of characters using -s
    **echo "the geek stuff" | tr -d 't'** Delete specified characters using -d option
    **echo "my username is 432234" | tr -cd [:digit:]** Complement the sets using -c option
    **tr -cd [:print:] < file.txt** Remove all non-printable character from a file
    **tr -s '\n' ' ' < file.txt** Join all the lines in a file into a single line

tee

  • **tee** takes the output from any command, and, while sending it to standard output, it also saves it to a file.

  • In other words, it tees the output stream from the command: one stream is displayed on the standard output and the other is saved to a file.

  • For example, to list the contents of a directory on the screen and save the output to a file, at the command prompt type 

    **ls -l | tee newfile** 

    and press the Enter key.

    Typing **cat newfile** will then display the output of **ls –l**.

wc

  • **wc** (word count) counts the number of lines, words, and characters in a file or list of files.
    ==Option== ==Description==
    **–l** Displays the number of lines
    **-c** Displays the number of bytes
    **-w** Displays the number of words

cut

  • **cut** is used for manipulating column-based files and is designed to extract specific columns.

  • The default column separator is the **tab** character. A different delimiter can be given as a command option.

  • For example, to display the third column delimited by a blank space, at the command prompt type 

    **ls -l | cut -d" " -f3** 
    and press the 
    Enter key.

  • Lab 13.4: Using **tee**

    The tee utility is very useful for saving a copy of your output while you are watching it being generated.

    Execute a command such as doing a directory listing of the **/etc** directory:

    **student:/tmp> ls -l /etc**

    while both saving the output in a file and displaying it at your terminal.

    • Solution

      student:/tmp> ls -l /etc | tee /tmp/ls-output
      student:/tmp> less /tmp/ls-output
  • Lab 13.5: Using **wc**

    Using **wc** (word count), find out how many lines, words, and characters there are in all the files in /var/log that have the .log extension.

    • Solution

      **student:/tmp> wc /var/log/*.log**

Knowledge Check (Verified Certificate track only)

![[Screenshot_from_2022-06-28_11-35-21.png]]

![[Screenshot_from_2022-06-28_11-35-59.png]]

![[Screenshot_from_2022-06-28_11-37-33.png]]

![[Screenshot_from_2022-06-28_11-39-13.png]]

![[Screenshot_from_2022-06-28_11-45-43.png]]

![[Screenshot_from_2022-06-28_11-46-17.png]]

![[Screenshot_from_2022-06-28_11-40-45.png]]

![[Screenshot_from_2022-06-28_11-42-18.png]]

![[Screenshot_from_2022-06-28_11-43-36.png]]

![[Screenshot_from_2022-06-28_11-44-34.png]]

![[Screenshot_from_2022-06-28_11-46-59.png]]

![[Screenshot_from_2022-06-28_11-47-57.png]]

Summary

  • The command line often allows the users to perform tasks more efficiently than the GUI.
  • **cat**, short for concatenate, is used to read, print, and combine files.
  • **echo** displays a line of text either on standard output or to place in a file.
  • **sed** is a popular stream editor often used to filter and perform substitutions on files and text data streams.
  • **awk** is an interpreted programming language, typically used as a data extraction and reporting tool.
  • **sort** is used to sort text files and output streams in either ascending or descending order.
  • **uniq** eliminates duplicate entries in a text file.
  • **paste** combines fields from different files. It can also extract and combine lines from multiple sources.
  • **join** combines lines from two files based on a common field. It works only if files share a common field.
  • **split** breaks up a large file into equal-sized segments.
  • Regular expressions are text strings used for pattern matching. The pattern can be used to search for a specific location, such as the start or end of a line or a word.
  • **grep** searches text files and data streams for patterns and can be used with regular expressions.
  • **tr** translates characters, copies standard input to standard output, and handles special characters.
  • **tee** saves a copy of standard output to a file while still displaying at the terminal.
  • **wc** (word count) displays the number of lines, words, and characters in a file or group of files.
  • **cut** extracts columns from a file.
  • **less** views files a page at a time and allows scrolling in both directions.
  • **head** displays the first few lines of a file or data stream on standard output. By default, it displays 10 lines.
  • **tail** displays the last few lines of a file or data stream on standard output. By default, it displays 10 lines.
  • **strings** extracts printable character strings from binary files.
  • The **z** command family is used to read and work with compressed files.