How to Remove Duplicate Lines From a Text List

Duplicate lines in text lists accumulate from copy-pasting, merging sources, and exporting data from systems that don't enforce uniqueness. Removing them before importing, processing, or sending the list prevents downstream errors and false counts.

Where duplicates come from

The most common sources:

  • Merging two lists that overlap (combining two email lists, two keyword lists, two ID exports)
  • Copying data from multiple pages of a paginated report
  • Exporting the same data twice with slightly different date ranges
  • Data from a system that doesn't enforce uniqueness at the storage level
  • Manual data entry where the same item was added twice

Basic deduplication

For a sorted list, duplicates appear adjacent to each other, which makes them easy to spot visually. For an unsorted list, duplicates can be anywhere — which is why a tool that handles the detection automatically is faster than manual review.

Deduplication can be case-sensitive or case-insensitive. Alice and alice are different strings but often represent the same record. Whether to treat them as duplicates depends on your data — email addresses should be deduplicated case-insensitively, while code identifiers should usually be case-sensitive.

Whitespace also matters. "Alice " (trailing space) and "Alice" look identical in a text editor but aren't. Trim before deduplicating to avoid false non-matches.

Ordering considerations

Deduplication preserves one instance of each duplicate. Which instance is kept depends on the algorithm:

  • First occurrence: keep the first, discard subsequent. Best when insertion order matters and you trust the original source order.
  • Last occurrence: keep the last, discard earlier ones. Useful when the list is ordered by time and you want the most recent version of each record.
  • Sorted: sort first, then deduplicate. Produces alphabetically ordered output. Good for keyword lists, domain lists, and other order-independent data.
Counting before and after: after deduplication, the count of remaining lines tells you how many unique items you had. If you started with 1,000 lines and end with 340, you had 660 duplicates — which might itself be useful information about your data quality.

Command line options

For large files, command-line tools are faster than browser-based tools:

# Sort and deduplicate (Unix/macOS/Linux)
sort -u input.txt > output.txt

# Deduplicate without sorting (preserves original order)
awk '!seen[$0]++' input.txt > output.txt

# Case-insensitive deduplication
sort -f -u input.txt > output.txt

For browser-based deduplication without writing code, a text sorter and deduplicator handles this in a few seconds — paste, click, copy the result.