Some thoughts about SBOM

An complete overview of software dependencies as a basic requirement for software projects nowadays. So, I was wondering what we can achieve with on-board utilities on Linux, when it comes to the software bill of material (SBOM).

If you haven’t come in contact with SBOM, you will. After the log4j incident it is invertible to have in depth knowledge on the software supply chain. Besides the obvious security reasons, the software license management motivates us to look into this topic. Open Source and proprietary software projects can be affected by including incompatible licenses. Commercial and complex frameworks exists, but I’m interested in what can be done with on board tools. This would allow us to some basic SBOM analysis as standard for foss or small proprietary projects with limited budget.

The rest of the article uses Debian 11 as base. The idea is to get it into some CI flow, for example on a gitlab instance.

TL;DR

Analyse a binary

There are two simple ways to analyse a binary and get required shared object files, ldd and objdump.

ldd

Security note: As the the manpage of ldd states, do not use ldd on unknown binaries!
Let us have a look on another project I wrote some time ago, rklogger

ldd rklogger
  linux-vdso.so.1 (0x00007fff0ddfe000)
  libboost_program_options.so.1.74.0 => /lib/x86_64-linux-gnu/libboost_program_options.so.1.74.0 (0x00007f8566b23000)
  libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f8566956000)
  libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f856693c000)
  libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f8566767000)
  libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f8566745000)
  libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f8566601000)
  /lib64/ld-linux-x86-64.so.2 (0x00007f8566c00000)

An even nicer and hierarchical output can be produced with lddtree in the pax-utils package.

rklogger => ./rklogger (interpreter => /lib64/ld-linux-x86-64.so.2)
    libboost_program_options.so.1.74.0 => /lib/x86_64-linux-gnu/libboost_program_options.so.1.74.0
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
            ld-linux-x86-64.so.2 => /lib64/ld-linux-x86-64.so.2
    libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6

This gives us a list of files, so we can easily use apt-file to actually get the packages:

for i in `ldd rklogger | sed -e 's/.*=> *//' | sed -e 's/ \(.*\)$//'`; do apt-file -l -a amd64 search $i; done | sort -u > packages.txt

We use some sed to clean the output and sort to make it unique.

libboost-program-options1.74.0
libc6
libc6-amd64-cross
libc6-amd64-i386-cross
libc6-amd64-x32-cross
libgcc-s1
libstdc++6

This looks already pretty good, besides we get some additional hits. The next step would be to check with the debian copyright file.

for i in `cat packages.txt`; do cat /usr/share/doc/$i/copyright | grep "License:"; done | sort -u

This gives us quite a long list.

License: Apache-2.0
License: Apache-2.0 and BSL-1.0
License: BSD2
License: BSD2 and BSL-1.0
License: BSD3_DEShaw
License: BSD3_DEShaw and BSL-1.0
License: BSD3_Google
License: BSL-1.0
License: BSL-1.0 and CrystalClear
License: BSL-1.0 and HP and SGI
License: BSL-1.0 and Jam
License: BSL-1.0 and Kempf
License: BSL-1.0 and MIT
License: BSL-1.0 and NIST
License: BSL-1.0 and OldBoost1
License: BSL-1.0 and OldBoost2
License: BSL-1.0 and Python
License: BSL-1.0 and SGI
License: BSL-1.0 and Zlib
License: Caramel
License: CrystalClear
License: HP
License: Jam
License: Kempf
License: MIT
License: NIST
License: OldBoost1
License: OldBoost2
License: OldBoost3
License: Python
License: SGI
License: Spencer
License: Zlib

Technically one should parse the copyright files since it contains restrictions based on files inside the package. But it looks, unfortunately, so as if it only covers the source code. In case there are multiple object files one could just guess what source files might be used as input.

Analyse source code

To analyse the source code, we need to get the includes. One could just grep them. But, that might miss some information in case alternative include directories are used, or some includes are included by preprocessor choice. Some possibilities to get this informations would be

The output could be again handled with apt-file and the copyright files, but one alternative is licensecheck. For example for the object file XXXXX

for i in `cat CMakeFiles/XXXXX.dir/depend.make | sed -e 's/.*: //' | tail -n +4`; do licensecheck $i; done > liccheck.txt
cat liccheck.txt | sed -e 's/.*: //' | sort  | uniq -c

This is not the fastes tool, but it looks through all included sources files and scans for the license. This is how it looks like on the eigen benchmark in my benchmark repository:

     11 Boehm GC License Mozilla Public License 2.0
     19 BSD 3-clause "New" or "Revised" License
      1 GNU Lesser General Public License (modified-code-notice clause) GNU Lesser General Public License v2.1 or later
      1 GNU Lesser General Public License v2.1 or later
    227 Mozilla Public License 2.0
     14 *No copyright* Mozilla Public License 2.0
     11 *No copyright* UNKNOWN

Here uniq gives use some overview how many included headers fall under which license.

To be continued….

Further work

Next steps would be: