一万多个 PDF 文件，有的损坏无法打开，有的加密了无法打开，如何用工具或命令扫描出非正常的 PDF 文件？

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

V2EX 提问指南

这是一个创建于 262 天前的主题，其中的信息可能已经有所发展或是发生改变。

一万多个 PDF 文件，有的损坏无法打开，有的加密了无法打开，如何用工具或命令扫描出非正常的 PDF 文件？
- 只要能使用 PDF 阅读软件打开不弹窗 PDF 文件已损坏的，就算正常文件。

10 条回复 • 2024-03-16 17:54:49 +08:00

Seanfuck

262 天前

有批量转图片的工具，能转就正常

Apol1oBelvedere

262 天前

@Seanfuck 希望有个直接检测而不需要耗时转换的办法。

lianyue

262 天前

chatgpt :

一万多个 PDF 文件在 document 目录里面有子目录，有的损坏无法打开，有的加密了无法打开，如何用用 golang 实现扫描出非正常的 PDF 文件

得到的答案：
https://go.dev/play/p/cLaB6i4M3nj

Mithril

262 天前

如果你用的阅读软件支持命令行调用，可以写个脚本循环打开看看能否调用成功就行了。

不支持的话，随便用什么编程语言，调用个 PDF 库尝试打开一下，然后随便读取点啥。能读出来就算打开成功。

hulooq

262 天前

gemini-pro 写的代码：

```python
import os
import PyPDF2

def scan_pdf_files(directory):
"""Scans a directory of PDF files and identifies non-normal PDF files.

Args:
directory: The directory to scan.

Returns:
A list of non-normal PDF files.
"""

non_normal_pdf_files = []

# Iterate over the files in the directory.

for file in os.listdir(directory):

# Check if the file is a PDF file.

if file.endswith(".pdf"):

# Try to open the PDF file.

try:
pdf_file = open(os.path.join(directory, file), "rb")
except:
# The PDF file could not be opened, so it is non-normal.

non_normal_pdf_files.append(file)
continue

# Try to read the PDF file.

try:
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
except:
# The PDF file could not be read, so it is non-normal.

non_normal_pdf_files.append(file)
continue

# Check if the PDF file is encrypted.

if pdf_reader.isEncrypted:
# The PDF file is encrypted, so it is non-normal.

non_normal_pdf_files.append(file)

# Return the list of non-normal PDF files.

return non_normal_pdf_files

if __name__ == "__main__":

# Get the directory to scan from the user.

directory = input("Enter the directory to scan: ")

# Scan the directory for non-normal PDF files.

non_normal_pdf_files = scan_pdf_files(directory)

# Print the list of non-normal PDF files to the console.

print("The following PDF files are non-normal:")

for file in non_normal_pdf_files:
print(file)

```

z4oSkDNGGC2svsix

262 天前

在一万多个 PDF 文件中快速找出损坏的文件可以通过以下步骤进行：

使用脚本或工具批量检测 PDF 文件的完整性：你可以编写脚本或使用现有的工具来批量检查 PDF 文件的完整性。例如，你可以使用 Python 编写一个脚本，利用 PyPDF2 库或其他 PDF 处理库来打开每个文件并检查其完整性。如果文件损坏，这些库通常会引发异常。

使用命令行工具：一些命令行工具可以帮助你批量检测 PDF 文件的完整性。例如，使用 PDFtk 工具的 dump_data 命令可以快速检查 PDF 文件的完整性。

利用 PDF 阅读器批量打开文件：有些 PDF 阅读器支持批量打开文件。你可以使用这样的阅读器来尝试批量打开所有文件，如果某些文件损坏，可能会显示错误或无法打开。

使用专业的文件恢复软件：有些文件恢复软件可以扫描整个文件系统并识别损坏的文件。尽管这种方法可能需要更长的时间，并且可能会找到其他类型的文件，但它是一种可行的方法。

利用文件系统搜索功能：有些操作系统提供高级的文件搜索功能，可以通过文件的大小、修改日期等属性来筛选文件。你可以使用这些功能来快速定位 PDF 文件，并手动检查其中是否存在损坏的文件。

无论你选择哪种方法，都建议在操作前备份所有文件，以免意外删除或修改了原始文件。同时，也要注意处理 PDF 文件时可能会消耗大量的系统资源和时间，尤其是在处理数量庞大的文件时。

z4oSkDNGGC2svsix

262 天前

https://trinket.io/python3/7c4aa4981c

deorth

262 天前 via Android

得加钱

gamexg

262 天前

会什么语言就用什么语言找下可用的 pdf 库，然后尝试打开并读取下内容试试
不要界面，1 小时内差不多搞定

Apol1oBelvedere

259 天前

谢谢以上的帮助，此问题总结心得如下：

1 、superuser.com 上有很多相同问题，提供了思路。

2 、采用开源工具 https://github.com/pdfcpu/pdfcpu ，一行命令可以检测完全无法打开的损坏的 PDF （通过关键字 corrupt 筛选错误），用损坏 pdf 测试可行：
for /r "C:\Path\to\Dir" %f in (*.pdf) do @(pdfcpu validate "%f" 2>&1 | findstr /C:"corrupt" > nul && echo "%f is corrupt")