First time I've heard of the disjoint-set data structure. For a DSF implementation, Wikipedia [0] gives the wildest asymptotic complexity I've seen, using the inverse Ackermann function! "Almost constant time" indeed.
But you have to run it on a sequence, or perhaps even more accurately, something that is putatively a sequence but may have an infinite loop in it. Tortoise & hare requires an unambiguous "next" function, which a graph does not have. Modifying it to not require that so it can run on a graph may be possible but that would constitute a different algorithm for sure, with different complexity analysis, etc.
It is useful for determining whether a given traversal of a graph has encountered a cycle, rather than determining whether the entire graph has any cycles. So different from what the article was about, but still a nice tool to have in the box.
The article isn’t interested in the “online” aspect, but in finding all (what turns out to be called) non-bridge edges, in order to visualize the loops.
The problem of finding cycles in (pseudo randomly generated) graphs is actually one of the first examples of an asymmetric Proof-of-Work puzzle [1]. Asymmetry means that while finding a cycle takes a lot of memory and effort, a solution can be verified instantly.
Several of the suggested algorithms were in fact implemented in the various Cuckoo Cycle solvers [2]. But by maintaining directed paths, my union-find based algorithm did identify the entire cycle and not just the cycle completing edge [1] (implemented in [3]).
This algorithm suffers from three problems though. One, it only finds one among a set of interconnected cycles. Second, it's rather slow, being latency bound by the random memory accesses. Third, it uses much more memory than needed.
So repeatedly trimming edges with only one endpoint is a much faster approach. And for the random graphs that the PoW generates, it can be done using only 1 bit per edge to record if it's been trimmed.
Once sufficiently trimmed, a loop tracing algorithm (like [4] for the Cuckatoo Cycle variant) can find all remaining (simple) cycles.
I initially took issue with "first asymmetric proof of work", thinking that in practice any NP-complete problem would surely be fine, even if we can’t prove that it can’t be solved quickly. But is it difficult to find NP-complete problems which can't also be quickly approximated? And I suppose it might also be hard to find NP-complete problems in a "Goldilocks zone" of difficulty to make for practical proofs of work, since their complexity increases so sharply.
For Proof-of-Work applications, at least in blockchains, you need puzzles that
* can be pseudo randomly generated from a seed (usually the hash of a partial blockchain header)
* have a known optimal solution method with a (nearly) constant running time that's at most about one second (for progress freeness this should be a small fraction of the target block time)
NP-hard problems rarely satisfy both these constraints.
Finding fixed length cycles in random (bipartite) graphs with billions of edges does.
Approximation doesn't help, since you can require optimal solutions, but yes, it's very difficult to come up with hard instances. After all, you need to know the correct answer, so you can't just randomly generate instances. In addition, SAT solvers are ridiculously good these days.
After trimming, you can also branch on edges: either it is in a cycle, in which you find one, otherwise you can remove it and further reduce the instance by trimming more edges.
I had to tackle a problem much like this one when analysing the polygons in planar graphs (analysis of planar chemical networks).
A fun approach I found was to use a Delaunay triangulation (the dual of a Voronoi diagram; draw the Voronoi polygons and join their centres if they share an edge). Then join the triangles into larger polygons by removing edges that are in the Delaunay triangulation but not in your original planar graph. Once you've got no more edges to remove you've got all the polygons in the graph and convenient triangles for rendering / analysis.
For planar graphs on the surface of a torus, like they faced here, I just ended up adding extra images of the central unit cell, and then removing the excess. Not elegant (lots of weird corner cases) but convenient for visualisation.
Unfortunately, as noted in your link, the SSSR solution is not unique. Also unfortunate, Daylight (the inventor of SMILES) added SSSR as an optional atom identifier in their SMARTS query specification without realizing this. Because OpenEye doesn't like SSSR, they actually leave out this feature in their implement of SMARTS, breaking from standard.
This looks like a dynamic programming problem taking space proportional to |ring length|*|key length|. By that analysis, graphs don't really come into it.
What's the graph-theoretical approach? Does it have nicer characteristics?
Funny I've had to do this occasionally and I always do the following: for each edge E, attempt to construct a clockwise loop from that edge by traversing each vertex and picking the edge that is "most clockwise" compared to the incoming edge. Converse for counter clockwise. This gives you, for each edge, the two loops the edge is a part of (if they exist).
It sounds horribly slow, but isn't because the loop finding process for one edge can mark all edges it traverses as either being on the same loop, or not having a loop, for the orientation its checking. These edges can be skipped henceforth.
This does not find all loops, only the smallest disjoint loops (as every edge is at most part of two disjoint loops). But basically results in a similar planar "face" partition the author describes.
Its not clear to me whether my approach does not work (perhaps the angle compares fail on a torus? I've never had to deal with torii), or simply isn't suited because it misses some loops, but I thought I'd share it anyway :-)
Notwithstanding the limitation of assuming the graph is embedded in a coordinate space where you can try to measure angles, I think this algorithm still doesn't work.
I may be missing a detail, but it seems to fail on a graph consisting of a loop with some extra edges branching off the loop toward its interior and some edges branching off toward the exterior. The left hand walk hits a dead end in one branch, and the right hand walk dead ends in the other, no matter where on the loop you start (except for some special cases).
Hmm I'm having a hard time visualizing your counter example, I'll try to think of it some more. An important detail I left out though is that in my problem sets edges do not cross, which I suppose simplifies the entire problem a lot :-)
Aha! Yes to fix these cases a bit of bookkeeping is needed: a protruding edge like that would force the "walking" to back up, because its a dead end indeed. However, it also immediately means such an edge is not part of any loops, can be marked as such, and ignored upon retracing the steps.
It's an edge case that should be handled, but does not prevent the general algo from working.
Ah yeah, if you're not trying to identify all loops but just check if some loops exist then that works. I think you can dispense with the "walking around taking clockwise/anticlockwise edges" part though. If you repeatedly do the "remove all nodes with only 1 edge going to them" step then you've already identified whether or not at least one loop exists.
Repeatedly removing those edges would help indeed! I am interested in finding the loops, but in the example provided the "dead end" edges do not contribute to the loop. Officially a loop is not allowed to contain the same vertex twice (besides start/end), otherwise any sequence of vertices could be considered a loop just by repeating them in reverse order.
However, indeed it remains true that it does not find all loops, only the smallest set of loops that cover all edges.
one of the more interesting reads ive had on here so firstly thanks for that! love all the explanations of the intermediate solutions. ive been working lately with graphs to deduce network topologies and finding loops is one of my problems. excited to learn about the final answer, and happy to see it involves spanning tree :') kind of saves me reading through how that protocol works (or should work) so i can get on with implementing it! saving me likely all sorts of wrong solutions <3 thanks!
Same. It might also be pertinent to the radicle protocol, which was posted a few days ago and which I'm still slogging through. This has led to one of the most satisfying wikipedia dives in a long time, and I haven't even finished the original article. The wikipedia entry for DFS is very well written. Intuitively I would have structured find via path splitting, so I'm glad that the article covered all three find variations. Then that led to MSTs, and Kruskal's algorithm, then radix search (still reading). This thread is the first time hearing about CLR, so I'm definitely going to add that to my reading list, alongside Purely Functional Data Structures.
I'm curious: If I were to share the same link again now, I'll be forwarded to the latest submission instead. Interestingly, this link was shared twice in the past few days (click the "past" link above - and, yes, one of the submitters was I). How, then, was it that neither of the recent submitters, including myself, were forwarded to the latest at the time, but instead new submissions were created?
Tarjan's SCC is elegant with a heavily imperative flavour; has anyone come up with an elegantly pure (yet still linear) functional variation?
EDIT: looks like Okasaki has a linear-time functional; I'm pragmatic enough that 2 passes are (even small constant passes would be) fine for me — my graphs are all tiny enough that I'm not about to store them on tape.
I'm not too deep into CS theory so I'm not sure if this is an answer to your question but... SCC algorithms are useful when building routeable road networks from source data (like OSM) in order to filter out isolated sub-networks (like an airfield). We use dotnet but have something that looks possibly similar to what you wrote(?):
The methods ReachableVertices and ReverseReachableVertices are effectively depth-first searches on the graph.
The challenge - in terms of performance - is actually not finding a single SCC but returning the full set of potentially thousands SCCs in large networks. I originally implemented an iterative (imperative) version of Gabow's "Path-Based" algorithm but it was too slow on continent-sized datasets. We did some experimentation with a parallel variant of Tarjan which didn't improve things much.
Eventually I settled for something "good enough" - randomly pick a vertex, calculate the SCC with the above code, return it if it's large enough (>1k vertices) otherwise filter it out, keep going until I've accounted for >90% of the graph. The remaining <10% gets filtered out. That's basically imperative (using a while loop) but could probably be functionalised.
EDIT I've taken a better look at the linked paper and what I wrote is definitely not an answer to your question. But I'll leave it in case anyone's interested.
If the graph is directed, do an SCC decomposition in linear time using a graph library and then any SCC with size more than one has at least a loop, trivially extracted by following any edges in the SCC.
If the graph is undirected, compute a spanning tree in linear time using a graph library and then any edge outside the tree forms a loop, again trivially extractable from the edge and the paths from its endpoints to their least common ancestor in the tree.
Since those algorithms are already asymptotically optimal, it's just an engineering problem of finding the solution with the best constant factor given the precise data structures and goal in question.
The most obvious way is to just run a DFS, if you ever come back to a node you’ve seen then you have a cycle (and by storing parent pointers you can get the cycle back). That’s like CS 101 level.
However in the article it’s actually a graph being modified. If you don’t want to run a full dfs every time, there are better ways. This problem is called “online cycle detection” and you can google some other algorithms.
You obviously didn’t read the article. It’s about undirected graphs, and about visualizing the loops, not just testing for their existence. The final solution is indeed based on spanning trees.
[0]:https://en.wikipedia.org/wiki/Disjoint-set_data_structure
https://en.wikipedia.org/wiki/Kruskal%27s_algorithm
[0] https://en.wikipedia.org/wiki/Cycle_detection#Floyd's_tortoi...
I always thought it's a linked list algo
Several of the suggested algorithms were in fact implemented in the various Cuckoo Cycle solvers [2]. But by maintaining directed paths, my union-find based algorithm did identify the entire cycle and not just the cycle completing edge [1] (implemented in [3]).
This algorithm suffers from three problems though. One, it only finds one among a set of interconnected cycles. Second, it's rather slow, being latency bound by the random memory accesses. Third, it uses much more memory than needed.
So repeatedly trimming edges with only one endpoint is a much faster approach. And for the random graphs that the PoW generates, it can be done using only 1 bit per edge to record if it's been trimmed.
Once sufficiently trimmed, a loop tracing algorithm (like [4] for the Cuckatoo Cycle variant) can find all remaining (simple) cycles.
[1] https://www.semanticscholar.org/paper/Cuckoo-Cycle%3A-A-Memo...
[2] https://github.com/tromp/cuckoo
[3] https://github.com/tromp/cuckoo/blob/master/src/cuckoo/cycle...
[4] https://github.com/tromp/cuckoo/blob/master/src/cuckatoo/gra...
* can be pseudo randomly generated from a seed (usually the hash of a partial blockchain header)
* have a known optimal solution method with a (nearly) constant running time that's at most about one second (for progress freeness this should be a small fraction of the target block time)
NP-hard problems rarely satisfy both these constraints. Finding fixed length cycles in random (bipartite) graphs with billions of edges does.
For planar graphs on the surface of a torus, like they faced here, I just ended up adding extra images of the central unit cell, and then removing the excess. Not elegant (lots of weird corner cases) but convenient for visualisation.
https://depth-first.com/articles/2020/08/31/a-smallest-set-o...
https://daylight.com/dayhtml/doc/theory/theory.smarts.html
BTW, there is a very clever algorithm for finding SSSR
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2765087/
What's the graph-theoretical approach? Does it have nicer characteristics?
It sounds horribly slow, but isn't because the loop finding process for one edge can mark all edges it traverses as either being on the same loop, or not having a loop, for the orientation its checking. These edges can be skipped henceforth.
This does not find all loops, only the smallest disjoint loops (as every edge is at most part of two disjoint loops). But basically results in a similar planar "face" partition the author describes.
Its not clear to me whether my approach does not work (perhaps the angle compares fail on a torus? I've never had to deal with torii), or simply isn't suited because it misses some loops, but I thought I'd share it anyway :-)
I may be missing a detail, but it seems to fail on a graph consisting of a loop with some extra edges branching off the loop toward its interior and some edges branching off toward the exterior. The left hand walk hits a dead end in one branch, and the right hand walk dead ends in the other, no matter where on the loop you start (except for some special cases).
Edit: strengthening the proposed counterexample
https://imgur.com/a/ihPrwrX
I think this is roughly the counterexample the person above was suggesting.
It's an edge case that should be handled, but does not prevent the general algo from working.
However, indeed it remains true that it does not find all loops, only the smallest set of loops that cover all edges.
you'd get the same pitfall of torus topology with 2 perpendicular loops
If you don‘t need to be efficient, any problem is relatively easy to solve.
> When in doubt, use brute force. —KLT
---
1: https://www.chiark.greenend.org.uk/~sgtatham/putty/
EDIT: looks like Okasaki has a linear-time functional; I'm pragmatic enough that 2 passes are (even small constant passes would be) fine for me — my graphs are all tiny enough that I'm not about to store them on tape.
EDIT2: my bad, not Okasaki himself, King & Launchbury: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&d...
(but their sol'n requires laziness to avoid generating an infinite intermediate; has anyone done it linearly, functionally, and eagerly?)The challenge - in terms of performance - is actually not finding a single SCC but returning the full set of potentially thousands SCCs in large networks. I originally implemented an iterative (imperative) version of Gabow's "Path-Based" algorithm but it was too slow on continent-sized datasets. We did some experimentation with a parallel variant of Tarjan which didn't improve things much.
Eventually I settled for something "good enough" - randomly pick a vertex, calculate the SCC with the above code, return it if it's large enough (>1k vertices) otherwise filter it out, keep going until I've accounted for >90% of the graph. The remaining <10% gets filtered out. That's basically imperative (using a while loop) but could probably be functionalised.
-----------------------------------------------------
EDIT I've taken a better look at the linked paper and what I wrote is definitely not an answer to your question. But I'll leave it in case anyone's interested.
If the graph is directed, do an SCC decomposition in linear time using a graph library and then any SCC with size more than one has at least a loop, trivially extracted by following any edges in the SCC.
If the graph is undirected, compute a spanning tree in linear time using a graph library and then any edge outside the tree forms a loop, again trivially extractable from the edge and the paths from its endpoints to their least common ancestor in the tree.
Since those algorithms are already asymptotically optimal, it's just an engineering problem of finding the solution with the best constant factor given the precise data structures and goal in question.
However in the article it’s actually a graph being modified. If you don’t want to run a full dfs every time, there are better ways. This problem is called “online cycle detection” and you can google some other algorithms.
> Spews out computer science jargon that 99% of the world's population wouldn't even recognise, let alone understand.
Hmm.