Useful Terminal Commands for work with data

18/10/2013 Ola Løvholm Comments 0 Comment

Please, see my terminal-tools tag for more articles on nifty terminal tools.

Enclose values in quotes, and add comma

Say you work with a large datafile where you get all values printed on consecutive lines, and you want to copy these lines into an array, list or other data structure where strings need to be quoted, and values needs to be separated by commas. Here is a sed script that may help you with that task.

sed 's/\(.*\)/"\1",/g' all_uuid.csv > all_ready.txt

What this CLI snippet does is to use the sed, stream editor, and pass in the s – substitute – argument followed first by a backslash delimiter and then by a eat all regular expression enclosed in a group. This reads the whole line and is finished by yet another backslash delimiter. What comes between the first and second delimiter is the input, and now comes the output. We write the initial quote mark, then a \1 which is referring to the regular expression group in the input, this is followed by a closing quote and a comma. We add the g-argument to continue this operation for all matches, in other words all lines. Then we pass in the file we want to alter, and sends in the > operator to print the output of this command into the file all_ready.txt.

Copy a specific number of lines from one file to another

While working with large data you want to try something new. The problem is that it takes too much time running the process on a too large data-selection. How can you cut down on the length without opening the text editor. With the head command you can print the beginning of a file, and by adding a number as argument you can define exactly how many lines to add. The lines are printed to stdout, but by using the greater than operator you can route this output into a new file.

head -50000 oldfile > newfile

The head command reads the file from the beginning and is usually reading a default number of lines. We pass in an argument asking for 50.000 lines then the file which we want to read, and pass in the greater-than operator to write the output into a new file.

Getting the total number of lines of a textfile

Working with data, and transfer from one system to another, you will often find that the data is saved onto flat files. Flat files can be comma separated (the well known CSV-files), tab-separated or separated by other means. The general pattern is that attributes of the data (the columns) are stored separated by a predefined delimiter, and the occurrences (the rows) are separated by a new line. For this command to work you need the format to be of this type. Use e.g. head to verify that occurences are separated by new lines. When that is done, run the command below, and if you have a header line in your flat file substract 1 from the number of lines. This can easily be compared with the result of count(*) on the SQL database after the data is imported.

wc -l filename

Getting the total number of files in a directory

In this example we use the command from above, but this time we pipe in the output of the directory list (ls) with the each file on a new line argument (-l). With piping we can send the result of one command as the argument for a second. This kind of chaining makes *nix systems very convenient as you can do complex stuff from combining simple commands. Anyhow, here the wc -l from above gets a list of files and directories in a directory and prints the count of these.

ls -l [directory-name] | wc -l

Illustrational image by Travis Isaacs, licensed under a Creative Commons attribution licence. Found on Flickr.

lovholm.net