Reducing redundancy in a collection of text files.

Wed Dec 28 01:00:40 UTC 2022

Okay, I have two related issues, one regarding comparing text files
and one regarding the contents of a single text file, and in both
cases, I'm mostly working with transcripts of conversations I had with
an AI language model that I'm trying to clean up.

For the first issue, mostly caused by sometimes saving a transcript at
a dozen points in the conversation, let's say we have two versions of
a file A and B.

Ideally, B contains everything contained in A plus some extra content
not found in A. Since A has no unique content, it can be deleted
safely.

By extention, ideally, if I have a dozen versions of a given file, the
above would hold for every link in the chain, and I could just do a wc
on the files and delete all but the longest file.

Problem is, I can't be sure A doesn't have contents not found in B,
and on top of that, the file names aren't always descriptive, so it
isn't obvious when I should even try comparing the contents of two
files.

I suspect diff has an option or set of options to detect when one or
both of a pair of files have unique contents, but diff's lack of batch
processing would make using such a bit of a pain even just running it
on the file pairs I know to be similar.

Is there either a utility that will compare every pair of files in a
directory looking for contents found in one but not the other,
deleting files with no unique content or a way to have a bash script
loop through a directory with diff to do something similar?

Does something like

for file 1 in *.txt file2 in *.txt; do
diff $file1 $file2
done

or nesting fore loops of this sort even work in bash? I honestly don't
know as I don't think I've ever written a script that had to loop
through a cartesian product of input files instead of a single set.

The other issue is that the AI language model in question likes
repeating itself... I might get a dozen responses that are half new
and half quoting part of the previous response, leading to a deozen
copies of some paragraphs.

I know the uniq command can find and remove duplicate lines in a file,
but it only works if the duplicates are adjacent, and sorting the file
to make the duplicates adjacent would destroy any semblance of the
files having an order... plus, I'm more interested in finding
duplicates at the paragraph level, not the line level and while some
of the files only have line breaks at the end of the paragraph, others
have line breaks mid paragraph... Also, it would be nice if, instead
of just deleting the duplicate paragraphs, the tool I use to automate
tracking them down replaced the duplicates with a marker indicating
the starting line number of the original and the first 40 or so
characters of the paragraph to facilitate wanting to either move the
duplicated paragraph to one of the later occurances or deciding to
keep some of the duplicates for one reason or another.

Anyone know of any tools for locating repeated content in a file
without the limitations of uniq?

And for either issue, I would prefer a command line solution.