How to Remove Duplicate Lines From a Text List
Duplicate lines in text lists accumulate from copy-pasting, merging sources, and exporting data from systems that don't enforce uniqueness. Removing them before importing, processing, or sending the list prevents downstream errors and false counts.
Where duplicates come from
The most common sources:
- Merging two lists that overlap (combining two email lists, two keyword lists, two ID exports)
- Copying data from multiple pages of a paginated report
- Exporting the same data twice with slightly different date ranges
- Data from a system that doesn't enforce uniqueness at the storage level
- Manual data entry where the same item was added twice
Basic deduplication
For a sorted list, duplicates appear adjacent to each other, which makes them easy to spot visually. For an unsorted list, duplicates can be anywhere — which is why a tool that handles the detection automatically is faster than manual review.
Deduplication can be case-sensitive or case-insensitive. Alice and alice are different strings but often represent the same record. Whether to treat them as duplicates depends on your data — email addresses should be deduplicated case-insensitively, while code identifiers should usually be case-sensitive.
Whitespace also matters. "Alice " (trailing space) and "Alice" look identical in a text editor but aren't. Trim before deduplicating to avoid false non-matches.
Ordering considerations
Deduplication preserves one instance of each duplicate. Which instance is kept depends on the algorithm:
- First occurrence: keep the first, discard subsequent. Best when insertion order matters and you trust the original source order.
- Last occurrence: keep the last, discard earlier ones. Useful when the list is ordered by time and you want the most recent version of each record.
- Sorted: sort first, then deduplicate. Produces alphabetically ordered output. Good for keyword lists, domain lists, and other order-independent data.
Command line options
For large files, command-line tools are faster than browser-based tools:
# Sort and deduplicate (Unix/macOS/Linux)
sort -u input.txt > output.txt
# Deduplicate without sorting (preserves original order)
awk '!seen[$0]++' input.txt > output.txt
# Case-insensitive deduplication
sort -f -u input.txt > output.txt
For browser-based deduplication without writing code, a text sorter and deduplicator handles this in a few seconds — paste, click, copy the result.