Simple Graph Cleaner

In this tutorial we develop a graph cleaning utility that takes in a graph file and “cleans” it to be a simple graph. That is, it will symmetrize the graph, remove self loops and multiple edges, and output an edge list with only one direction (the “upper” triangle).

Building the Tool

For this example, we will not evaluate whether the input is already symmetric or not. Instead, we will perform an expensive approach of always symmetrizing it. We start by loading the input into a COO. We then use template parameter to indicate we want to symmetrize it and remove self loops.

COO<uint64_t, uint64_t, shared_ptr<uint64_t>,
        true, true, true> coo { argv[1] };

If we want the remove self loops and multi-edges from a directed graph, but not symmetrize it, we would replace the first two true values with false (template parameters 4 and 5 would become false).

Once we have this COO, we can convert it into a CSR. We will use the CSR to then detect and remove redundant, multiple edges.

CSR<uint64_t, uint64_t,
        shared_ptr<uint64_t>, shared_ptr<uint64_t>> csr { coo };

Next, we need to build a new CSR that does not have redundant edges. Here we use a provided method, new_csr_without_dups, that generates a new, clean CSR. Any new template parameters can be passed to this function for the resulting CSR to take.

auto clean = csr.new_csr_without_dups();

As a last step, we convert back to a new COO and write it out as an ascii file using write.

The full program is as follows:

#include <iostream>
#include <memory>
#include <string>
using namespace std;

#include "pigo.hpp"
using namespace pigo;

// Setup a simple timer
#include <omp.h>
double g_t_;
inline void tlog_init() { g_t_ = omp_get_wtime(); }
inline void tlog(string msg) {
    cerr << omp_get_wtime()-g_t_ << " " << msg << endl;
    tlog_init();
}

int main(int argc, char** argv) {
    // Open and load the file
    if (argc != 3) {
        cerr << "Usage: " << argv[0] << " in-file out-file" << endl;
        return 1;
    }

    tlog_init();

    COO<uint64_t, uint64_t, shared_ptr<uint64_t>,
            true, true, true> coo { argv[1] };
    tlog("read input into coo");

    CSR<uint64_t, uint64_t,
            shared_ptr<uint64_t>, shared_ptr<uint64_t>> csr { coo };
    tlog("built csr");

    coo.free();
    tlog("free read input coo");

    auto clean = csr.new_csr_without_dups();
    tlog("built clean csr");

    csr.free();
    tlog("free original csr");

    COO<uint64_t, uint64_t, shared_ptr<uint64_t>> out_coo { clean };
    tlog("built output coo");

    clean.free();
    tlog("free clean csr");

    out_coo.write(argv[2]);
    tlog("wrote output file");

    out_coo.free();
    tlog("free out coo");

    return 0;
}

Building

After ensuring pigo.hpp is in the same directory, this program can be compiled on Linux with g++ -std=c++11 -O3 -o clean_graph clean_graph.cpp -fopenmp and on Mac, after brew install libomp, with clang++ -std=c++11 -O3 -o clean_graph clean_graph.cpp -Xclang -fopenmp /usr/local/lib/libomp.dylib.

Evaluation

Suppose that you have a messy input file, e.g., the following file:

5 6
4 4
4 4
3 4
6 5
1 2
1 1
1 2
1 2
2 1
1 0
0 2
0 1

Note that this file contains self loops and multiple edges, some of which are present if the file is directed, and others are only present when it is undirected (e.g., 5 6 and 6 5).

After running ./clean_graph messy.el clean.el, the following file is saved:

0 1
0 2
1 2
3 4
5 6