Compare files ignoring a field or column using Process Substitution

Lets say the data contains multiple fields/columns separated by space or comma or some other delimiter. And we want to compare two files ignoring a specific column. Lets divide work in two small issues. First is to ignore the provided field/column.

If we simply want to ignore the first column, we can use one of the following cut constructs.

cut -d',' -f 1 --complement datafile
cut -d',' -f 2- fileName.csv

If we want to ignore a specific one we can use awk in following manner which is much more generalized because you can specify which column to ignore, be it first, third or last.

This can be used as

awk -F',' -v FieldToIgnore=3 -f ignoreField.awk datafile

Next part is to diff the output after ignoring (read removing) the column. That is where process substitution comes handy. Here are two examples.

# ignore 1st column from two csv datafiles while comparing
diff -u <(cut -d, -f 2- datafile1) <(cut -d, -f 2- datafile2)
# ignore column 3 from two csv datafiles while comparing
diff -u <(awk -F',' -v FieldToIgnore=3 -f ignoreField.awk datafile1) <(awk -F',' -v FieldToIgnore=3 -f ignoreField.awk datafile2)

So instead of giving it two real files, we give it two redirected streams. Same solution can be used to pre-process files differently (e.g. ignore any comments or empty lines or compare two unsorted files).

See below for more information on Process Substitution.
http://www.tldp.org/LDP/abs/html/process-sub.html
http://wiki.bash-hackers.org/syntax/expansion/proc_subst

postfix : Configure outgoing relay server

Update /etc/postfix/main.cf and add the name of your outgoing/relaying mailhost as “relayhost”. Ensure that the relay server is accepting your email first.

e.g. if the outgoing relay is mailhost.xyzserver.com sendmail configuration should look like following.

# INTERNET OR INTRANET

# The relayhost parameter specifies the default host to send mail to
# when no entry is matched in the optional transport(5) table. When
# no relayhost is given, mail is routed directly to the destination.
#
# On an intranet, specify the organizational domain name. If your
# internal DNS uses no MX records, specify the name of the intranet
# gateway host instead.
#
# In the case of SMTP, specify a domain, host, host:port, [host]:port,
# [address] or [address]:port; the form [host] turns off MX lookups.
#
# If you're connected via UUCP, see also the default_transport parameter.
#
#relayhost = $mydomain
#relayhost = [gateway.my.domain]
#relayhost = [mailserver.isp.tld]
#relayhost = uucphost
#relayhost = [an.ip.add.ress]
relayhost = mailhost.xyzserver.com

After that restart postscript.

service postscript restart

 

Find the process monopolizing the CPU without using “top”

Lets say you are on a system where top is not available (or other tools similar to it). Sound incomprehensible but believe me. There are systems which do not have any of those great tools available. So how do you find the process eating up most CPU? The humble ps command provides pcpu which is CPU percentage used by a process. Here is how.

ps -eo pcpu,pid,ruser,args | sort -r -k1 | less

This will give in reverse sort order the “pid” that is taking up most of pcpu and the ruser (real user) with args. So there you have it.

bash : grep for pattern from certain location in the file

Syntax for grep to search for a pattern in a file is very well-known. But there are times when one has to grep for the pattern from a certain location or after a certain offset in the file. For example if we are searching for a pattern in a log file which could appear multiple times. Each time we grep, it will provide us all the matching lines from top to bottom of the file and then we have to find which lines were new since our last run. Using dd, the file can be sliced and then grep can be applied for the pattern on that slice. Lets see an example. Continue reading

bash : search multiple file patterns using single find command

One reader asked how find can be used to find various file patterns. For example in a directory which could be littered with various logs and other files, how do I use a single find command to find all shell scripts, perl scripts and say php scripts. Simple answer is to use multiple -name arguments combined with -o (or ORed) and if needed with -a (or ANDed). Other find conditions (like -mtime, -type etc) can be combined as well. Continue reading

Howto rollover a file when size exceeds using unix find

Here is a one liner to rollover a file to file.old when it exceeds the size using find command. Lets say we have a script in cron that runs and prints messages in a log file. Overtime the log file will grow and we would want to rollover the log file to log.old. Many solutions exist by finding the size and comparing it. Here is one elegant solution in one liner. Thanks to my colleague Vlad who gave the idea for using find and exec, and I added the automatic substitution or brace expansion from my knowledge-base. Happy sharing of knowledge.

find /var/log/ -name myapp.log -size +1M -exec mv {}{,.old} ;

Continue reading

Unix : find affected by current working directory

On may Unix variants, find first looks for current working directory before proceeding with what it was asked to find. Ubuntu 10.10 and Debian Squezze not affected and I did not check older versions, but debian 5.0.6 or Lenny is affected and list includes Solaris 10 and Solaris 11 Express. It is very easy to fall in this pitfall if you have some automated package installation which may invoke some scripts for starting applications at the end of installation while cleaning up the temporary directory the package was running from. I wasted couple of hours in going over all my scripts to understand what was going on. The ls command was working but find was not able to get me the list of files to process from unrelated directories. So I ended up redirecting find’s error out to standard out and viola, solution presents itself. That redirection should have been on top of my list. It tells you that “find : cannot get the current working directory”. Why it needs that? I don’t know. Linux has this fixed for some time now, but for some reason SunOS is still using the old find variant including Solaris 11 Express which is the latest version out. Maybe some historical reasons. If anyone know, please share.

So the solution to the problem was that before invoking the command that will continue to run and may need to call find, start it in a directory that will persist after package installation is complete, e.g. / or /tmp.
Continue reading

bash : Count number of recurrence of lines

Say, we have a file or data that has many duplicate rows or entries and we want to find how many time each one has repeated and maybe want to know which is repeated most of the time. Here is an elegant script that can do that in single line.

sort input.file | uniq -c | sort -n -r

Explanation:
First sort will sort the records in the file. Then uniq -c will count how many times each record is duplicated. And finally sort -n -r will sort the output of uniq -c in reverse order giving us the records that repeated most often to the least often.
Continue reading