Wiki source code of Confluence Import Process

Last modified by Vincent Massol on 2026/04/08 13:49

Show last authors
1 Confluence XML is implemented as an input filter stream. [[ConfluenceInputFilterStream>>https://github.com/xwiki-contrib/confluence/blob/master/confluence-xml/src/main/java/org/xwiki/contrib/confluence/filter/internal/input/ConfluenceInputFilterStream.java]] is instantiated with [[input properties>>https://github.com/xwiki-contrib/confluence/tree/master/confluence-xml/src/main/java/org/xwiki/filter/confluence/input/ConfluenceInputProperties.java]] describing what to import and how.
2
3 ConfluenceInputFilterStream then sets up [[ConfluenceXMLPackage>>https://github.com/xwiki-contrib/confluence/tree/master/confluence-xml/src/main/java/org/xwiki/contrib/confluence/filter/input/ConfluenceXMLPackage.java]], which will extract the confluence package and index it.
4
5 ConfluenceXMLPackage is built in such a way it is able to handle huge export package:
6
7 * the XML parsing is streamed
8 * instead of keeping everything in RAM, individual objects are written in individual Apache Commons Configuration Properties files in a temporary directory (we could probably use some database engine like SQLite for this, it would be possibly even more efficient)
9
10 ConfluenceInputFilterStream is built in the same spirit: it browses things from the package and send them streamed using the filter stream API. If the output filter that is used is also built like this, the whole process is a pipeline that can handle huge imports, and it is normally the case.
11
12 More precisely, ConfluenceInputFilterStream:
13
14 * Imports users
15 * Imports groups
16 * Browses spaces
17
18 For each space:
19
20 * imports the home page
21 * imports the orphans (which are pages with no parents which are not the home page)
22 * imports the space blog
23 * imports the space templates
24 * import permissions from the space permissions and the home page permission
25
26 For each page:
27
28 * imports revisions
29 ** imports page metadata (dates, author, title, ...)
30 ** imports page permissions
31 ** imports content (by instantiating the corresponding syntax, see [[details about the Confluence XHTML Parser>>doc:xwiki:documentation.extensions.dev.confluence.xhtml-parsing.WebHome]], it is also possible for a page to be in old confluence syntax or to contain plain text that's not to be converted (for pages that are used as some data storage)
32 ** imports comments
33 ** imports attachments
34 ** imports labels as tags
35 * imports children (recursive operation)
36
37 And then, the confluence package is closed (property files are removed, by default synchronously but a parameter can make this process asynchronous so you don't need to wait for the clean up to end for the job to end. There is also a parameter to avoid the cleanup altogether, this is useful for debugging purposes mostly: this lets you inspect the extracted properties and skip the parsing phase if you are to run several imports of the same package).
38
39 The starting point is the [[##read## method in ##ConfluenceInputFilterStream##>>https://github.com/xwiki-contrib/confluence/blob/ecd749e56f3671df9d08ca57912d49d4d38ff42a/confluence-xml/src/main/java/org/xwiki/contrib/confluence/filter/internal/input/ConfluenceInputFilterStream.java#L268]]. This is where any new developer should probably go to start hacking on Confluence XML.

Get Connected