Lab Session: Initial data processing commands

Objective

Introduction to the first processing commands (cat, head, tail, wc, cut), the pipe principle and redirections.
Introduction to more complex commands (grep, tr, sort, uniq, curl, wget).

Exercises

Preamble

Find all files in the CLDungeon directory (and its sub-directories) that contain the $> (grep) character sequence.
Answer
```
$> pwd
..../CLDungeon
$> grep -r "$>" .
```
For all files found in the previous question, replace the sequence $> with the sequence #> (grep + tr + redirections + copies, or sed).
Answer
```
$> grep -rl "$>" . | xargs -IX sed -i -e 's/$>/#>/' X
```

Initial data processing

The following questions are to be answered on the command line and concern the CSV file provided, which you must download beforehand.
The file is an extract from a NoSQL database in which are stored the data from sensors installed in the Espace Fauriel building.
For each command, measure and record the time it takes to run².
You should also note down every command you use to answer the questions.

Download data file

$> curl -u cps22023:cps22023 -O https://ci.mines-stetienne.fr/cps2/course/cls/assets/data/20200423-062902-metrics-daily.csv.gz

or, if you don’t have curl on your system

$> wget --user cps22023 --password cps22023 https://ci.mines-stetienne.fr/cps2/course/cls/assets/data/20200423-062902-metrics-daily.csv.gz

What is the size of the downloaded file? (ls)
Unzip this file (gunzip)
Now, how big is it?
What is a priori the file’s encoding (file)?
Without opening the file with a text editor, determine the number of lines and the number of records it contains (wc). You can also obtain the number of words and characters it contains and compare this with the result of the 4^th question (obtain the size of the decompressed file).
Answer
```
$> wc -l 20200423-062902-metrics-daily.csv
$> wc -w 20200423-062902-metrics-daily.csv
$> wc -c 20200423-062902-metrics-daily.csv
```
Discover and comment on the file structure (how the data is organized).

Answer

The file is organized with one record per line. Each record has 8 fields. Field 2 corresponds to the date of the record. Fields 3 to 5 correspond to values (humidity, luminosity, temperature). Field 6 is a device identifier. Field 7 is the room in which it is installed. Field 8 specifies that it is a sensor.
(@) How many entries are there for mobile sensors? A mobile sensor has a location which includes the string Mobile (grep, wc).
Answer
```
$> grep -i mobile 20200423-062902-metrics-daily.csv | wc -l
```
(@) Give the unique identifiers of the sensors that emitted data stored in this file (cut, sort, uniq).
Answer
```
$> tail -n+2 20200423-062902-metrics-daily.csv | cut -d',' -f 6 | sort -u
```

(@) Give the unique identifiers of mobile sensors (grep, cut, sort, uniq).

Answer

$> tail -n+2 20200423-062902-metrics-daily.csv | grep -i mobile | cut -d',' -f 6 | sort -u

The second field of each record (time) represents the date of the record. Can you guess what this representation is? Look it up and check your hypothesis: date representation
Answer
```
Dates are represented in nanoseconds from Epoch.
```

What is the date of the first and last record in the file (head, tail, cut, date)? You can propose two commands to obtain both values, but it’s possible to do it in a single command (sed).

Answer

First record

$> DTNS=$(tail -n+2 20200423-062902-metrics-daily.csv | cut -d',' -f2 | head -n1)
$> N=$((${#DTNS} - 9))
$> date --date="@$(echo ${DTNS} | cut -c-$N)"

Last record

$> DTNS=$(tail -n+2 20200423-062902-metrics-daily.csv | cut -d',' -f2 | tail -n1 | cut -c-10)
$> date --date="@${DTNS}"

Or in a single shot

$> sed -n -e '2p;$p' 20200423-062902-metrics-daily.csv | cut -d',' -f2 | cut -c-10 | xargs -IX date -d '@X' +"%F %R"

or

$> sed -n -e '2p;$p' 20200423-062902-metrics-daily.csv | cut -d',' -f2 | rev | cut -c10- | rev | sed 's/^/@/' | date -f - +"%F %R"

Without modifying the original file, create the following new files:
1) (@) a file containing only mobile sensor records (file name is mobiles.csv).
2) a file containing records from non-mobile sensors (file name is fixes.csv).
3) one file per sensor containing all data for that sensor (the file name is the sensor name, with the extension csv);
These files must have the same format as the original file, without the name and type fields (grep, cut, redirections).
Answer
For each sensor
$> grep <id-sensor> 20200423-062902-metrics-daily.csv > <id-sensor>.csv
or, automatically with xargs (you have to go through a subshell to be able to use the replacement pattern with the redirection):
$> cut -d',' -f8 20200423-062902-metrics-daily.csv | xargs -IC bash -c "grep 'C' 20200423-062902-metrics-daily.csv > 'C.csv'"
mobiles.csv
$> grep -i mobile 20200423-062902-metrics-daily.csv > mobiles.csv
fixes.csv
$> grep -vi mobile 20200423-062902-metrics-daily.csv > fixes.csv

(*) From the original file, list the sensors that recorded humidity data (HMDT) and the sensors that recorded brightness data (LUMI).

Answer

HMDT

$> tail -n+2 20200423-062902-metrics-daily.csv | egrep "^metrics,[[:digit:]]{1,},[^,]+,{6}" | cut -d',' -f9 | sort -u

HMDT

$> tail -n+2 20200423-062902-metrics-daily.csv | cut -d',' -f3,9 | grep -v "^," | cut -d',' -f2 | sort -u

LUMI

$> tail -n+2 20200423-062902-metrics-daily.csv | cut -d',' -f4,9 | grep -v "^," | cut -d',' -f2 | sort -u

(difficult) Propose a method (without writing commands) for each of the mobiles.csv and fixes.csv files created above, to obtain one file for each day of the statement.

Answer

The principle is to transform the file so that the dates are in YYYY-MM-DD format, then filter on these dates. You can draw inspiration from what was done in question 8.

You’ll notice that it’s very slow (it’s the reading of the file line by line and the transformation into dates that is slow).

An alternative solution is to filter the dates into a temporary file ft1, then transform the dates in this temporary file into YYYY-MM-DD format.

Next, you create another temporary file ft2, which is the side-by-side setting of ft1 and the original file.

All that remains is to loop over the dates and filter each date in the ft2 file, creating the daily files.
(optional) Using your favorite programming language, create a program that answers the question marked with an asterisk (only the moisture part). Compare execution times.

Working remotely

Sometimes it’s necessary, or simply practical, to carry out processing on a remote machine. You’re going to try this out by testing your scripts on larger data sets, deposited on a remote machine.

The remote computer is actually a virtual machine made available to us by the IT department. The computer name is ens-srv-eftcl.emse.fr.

Connect

To access the remote machine, you need to be able to connect to it (i.e., have an account there — login and password). In this case, the machine uses the school’s LDAP directory to authenticate you.

Nowadays, the most common way of accessing remote resources is via the SSH protocol, using tools such as ssh on the command line or PuTTY in graphical mode.

Connecting to the remote computer using ssh command

$> ssh ens-srv-eftcl.emse.fr
luisgustavo.nardin@ens-srv-eftcl.emse.fr's password:
Welcome to Ubuntu 20.04.4 LTS (GNU/Linux 5.4.0-100-generic x86_64)
...
...
luisgustavo.nardin@ens-srv-eftcl:~$

You’ll notice that the prompt is composed of the user name (luisgustavo.nardin), the machine name (ens-srv-eftcl) and the current directory (~). It ends with the $ character instead of the usual >.

Each user has a different home directory on the remote machine. But resources (disk and CPU) are shared!

Unfortunately, this machine is only accessible from the school (wired network, eduroam WiFi network and emse-invite WiFi network).

Copying files remotely

If you’ve written scripts to answer the above questions, you can copy them to the remote machine and then run them on the data there.

Copying is also done using the SSH protocol, but this time through the scp tool (secure copy).

Copying files to a remote computer

$> scp my_script.sh ens-srv-eftcl.emse.fr: (1)
$> scp other_script.py ens-srv-eftcl.emse.fr:Python/ (2)

1	Note the `:` at the end of the remote machine name
2	It is possible to copy directly into a pre-existing directory on the remote machine (here the `Python` directory in the user’s folder).

By default, copied files will be placed in your user directory on the remote machine. You may need to readjust rights to make them executable.

Find data

Data files are stored in the /opt/bdat-shell directory. Here you’ll find the file you’ve already been working on, and another, larger file containing more data.

Resources on this machine are limited. You must not copy data files to your home directory or elsewhere.

Data directory contents

luisgustavo.nardin@ens-srv-eftcl:~$ ls -l /opt/bdat-shell
total 2984184
-rw-r--r-- 1 luisgustavo.nardin utilisa.^du^domaine  102429832 Feb 28 15:52 20200423-062902-metrics-daily.csv
-rw-r--r-- 1 luisgustavo.nardin utilisa.^du^domaine 2953363792 Feb 28 15:38 20200501-065201-metrics.csv
luisgustavo.nardin@ens-srv-eftcl:~$

Work !

You can now work remotely and test your commands on the data provided.

If you need to edit files, you can use the nano or vim editors. They work on the command line and do not require a graphical interface.

Redo the TP questions identified by (@), with the file 20200501-065201-metrics.csv as your data source. As your commands remain unchanged, you’ll be particularly interested in execution times. As the original file is also available remotely, you can compare execution times on these two files.

Notes

Date representation

In general, in the computer world, dates are given in seconds since January 1^st 1970 (Epoch).

Dates in the file you’re manipulating are given in nanoseconds elapsed since Epoch.

With the date command you can obtain today’s date in many formats. You can also convert a given date into another format:

date +%s will give the current date in seconds from Epoch.
`date -d '@1550827169' "%F %R"` will convert the date given in seconds from Epoch into YYYY/MM/DD HH:MM format (here it will be +2019-02-22 10:19).

The date man page will tell you more if you want to go further (man).

This number is encoded on 32 bits. When will this be a problem?

If you’re using a Mac, the date command is not the same as GNU’s and doesn’t allow quite the same options as the one used here. For example, it is impossible to change the representation of all the dates contained in a file. It is possible, however, to change the representation of a given date with the -r option, as shown below.

$> date -r 1234567890 +"%F %R"
2009-02-14 00:31
$>

Measure command execution time

The time command lets you find out the execution time of another command.

It is used as a prefix to the command whose execution time you wish to measure:

$> time grep "^\d\+,.*" 20200423-062902-metrics-daily.csv

real    0m32,973s
user    0m36,260s
sys     0m1,216s
$>

Here’s an extract from the time(1) man page, which explains the meaning of each value. A thorough reading of this manual page is important for the correct measurement of compound commands (pipes).

The time utility shall invoke the utility named by the utility operand with arguments supplied as the argument operands and write a message to standard error that lists timing statistics for the utility. The message shall include the following information:

The elapsed (real) time between invocation of utility and its termination.

The User CPU time, equivalent to the sum of the tms_utime and tms_cutime fields returned by the times() function defined in the System Interfaces volume of POSIX.1‐2008 for the process in which utility is executed.

The System CPU time, equivalent to the sum of the tms_stime and tms_cstime fields returned by the times() function for the process in which utility is executed.

— time(1)
POSIX Programmer's Manual