背景
批量embedding 入向量库巨慢,只能一个个搞,不同格式的文件(pdf ,ppt,doc,txt等),有不同处理方式,特别pdf巨慢,因为调用OCR,所以想优化一下。
问题分析
这里面有很多难受的地方
1 |
|
要使用CUDA多进程,需要使用spawn
Python3 关于Contexts and start methods,关于开启多进程的三种用法
spawn
The parent process starts a fresh Python interpreter process. The child process will only inherit those resources necessary to run the process object’s
run()
method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver.Available on POSIX and Windows platforms. The default on Windows and macOS.
fork
The parent process uses
os.fork()
to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.Available on POSIX systems. Currently the default on POSIX except macOS.
Note
The default start method will change away from fork in Python 3.14. Code that requires fork should explicitly specify that via
get_context()
orset_start_method()
.Changed in version 3.12: If Python is able to detect that your process has multiple threads, the
os.fork()
function that this start method calls internally will raise aDeprecationWarning
. Use a different start method. See theos.fork()
documentation for further explanation.forkserver
When the program starts and selects the forkserver start method, a server process is spawned. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process. The fork server process is single threaded unless system libraries or preloaded imports spawn threads as a side-effect so it is generally safe for it to use
os.fork()
. No unnecessary resources are inherited.Available on POSIX platforms which support passing file descriptors over Unix pipes such as Linux.
spawn
导致一些诡异的情况
-
以上写法其实模型会多次加载,不像
fork
会直接继承父进程的内存对象,spwan
它会从头跑一遍,main以上的代码 -
开启多进程的方法必须在
if __name__ == '__main__':
(Protect Entry Point)下面,Python文档有提到Safe importing of main module
Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such as starting a new process).
For example, using the spawn or forkserver start method running the following module would fail with a
RuntimeError
:1
2
3
4
5
6
7from multiprocessing import Process def foo(): print('hello') p = Process(target=foo) p.start()
Instead one should protect the “entry point” of the program by using
if __name__ == '__main__':
as follows:1
2
3
4
5
6
7
8
9
10from multiprocessing import Process, freeze_support, set_start_method def foo(): print('hello') if __name__ == '__main__': freeze_support() set_start_method('spawn') p = Process(target=foo) p.start()
(The
freeze_support()
line can be omitted if the program will be run normally instead of frozen.)This allows the newly spawned Python interpreter to safely import the module and then run the module’s
foo()
function.Similar restrictions apply if a pool or manager is created in the main module.
那这就非常蛋疼,就很难把他当成一个moudle去调用。
为什么我们需要保护入口点?
当使用“ spawn ”方法从主进程启动新进程时,我们必须保护入口点。原因是因为我们的Python脚本将由子进程自动为我们加载和导入。这是在新的子进程中执行我们的自定义代码和函数所必需的。如果入口点没有通过检查顶级环境的 if 语句习惯用法进行保护,则脚本将直接再次执行,而不是按预期运行新的子进程。保护入口点可确保程序仅启动一次,主进程的任务仅由主进程执行,而子进程不会执行。
这是使用fork,不会出现的情况。参考pytorch GPU,MULTIPROCESSING BEST PRACTICES,这里面其实也没怎么说清楚。第一个问题还可以用序列化,queue解决,但是第二个问题就不知道怎么搞
柳暗花明
1 |
|
参考langchain 的HuggingFaceEmbeddings源码,里面还有multi_process参数,可以多进程GPU encode
1 |
|
上面只是加快处理文件的其中一个环节embedding encode而已。但是这里面有值得参考的地方,可以解决loader慢的问题。
1 |
|
可以看出这里的使用get_context
可能可以解决上面的第二个问题,当时没有好好看文档
set_start_method () 函数来设置全局的启动方法,它只能在程序的主模块中调用一次,不能在子进程中调用,会影响后续创建的所有进程的启动方式。
get_context () 函数来获取一个特定的启动方法的上下文对象。它可以在任何地方调用,不受全局的启动方法的限制。它也接受一个字符串参数,表示要使用的启动方法的名称
根据这个issue ,还是有问题报freeze_support的错。
1 |
|