Wiki source code of Confluence Import Process

Last modified by Vincent Massol on 2026/04/08 13:49

Hide last authors
Raphaël Jakse 1.1 1 Confluence XML is implemented as an input filter stream. [[ConfluenceInputFilterStream>>https://github.com/xwiki-contrib/confluence/blob/master/confluence-xml/src/main/java/org/xwiki/contrib/confluence/filter/internal/input/ConfluenceInputFilterStream.java]] is instantiated with [[input properties>>https://github.com/xwiki-contrib/confluence/tree/master/confluence-xml/src/main/java/org/xwiki/filter/confluence/input/ConfluenceInputProperties.java]] describing what to import and how.
2
3 ConfluenceInputFilterStream then sets up [[ConfluenceXMLPackage>>https://github.com/xwiki-contrib/confluence/tree/master/confluence-xml/src/main/java/org/xwiki/contrib/confluence/filter/input/ConfluenceXMLPackage.java]], which will extract the confluence package and index it.
4
5 ConfluenceXMLPackage is built in such a way it is able to handle huge export package:
6
7 * the XML parsing is streamed
8 * instead of keeping everything in RAM, individual objects are written in individual Apache Commons Configuration Properties files in a temporary directory (we could probably use some database engine like SQLite for this, it would be possibly even more efficient)
9
10 ConfluenceInputFilterStream is built in the same spirit: it browses things from the package and send them streamed using the filter stream API. If the output filter that is used is also built like this, the whole process is a pipeline that can handle huge imports, and it is normally the case.
11
12 More precisely, ConfluenceInputFilterStream:
13
14 * Imports users
15 * Imports groups
16 * Browses spaces
17
18 For each space:
19
20 * imports the home page
21 * imports the orphans (which are pages with no parents which are not the home page)
22 * imports the space blog
23 * imports the space templates
24 * import permissions from the space permissions and the home page permission
25
26 For each page:
27
28 * imports revisions
29 ** imports page metadata (dates, author, title, ...)
30 ** imports page permissions
Raphaël Jakse 4.3 31 ** imports content (by instantiating the corresponding syntax, see [[details about the Confluence XHTML Parser>>doc:xwiki:documentation.extensions.dev.confluence.xhtml-parsing.WebHome]], it is also possible for a page to be in old confluence syntax or to contain plain text that's not to be converted (for pages that are used as some data storage)
Raphaël Jakse 1.1 32 ** imports comments
33 ** imports attachments
34 ** imports labels as tags
35 * imports children (recursive operation)
36
Raphaël Jakse 3.1 37 And then, the confluence package is closed (property files are removed, by default synchronously but a parameter can make this process asynchronous so you don't need to wait for the clean up to end for the job to end. There is also a parameter to avoid the cleanup altogether, this is useful for debugging purposes mostly: this lets you inspect the extracted properties and skip the parsing phase if you are to run several imports of the same package).
Raphaël Jakse 1.1 38
39 The starting point is the [[##read## method in ##ConfluenceInputFilterStream##>>https://github.com/xwiki-contrib/confluence/blob/ecd749e56f3671df9d08ca57912d49d4d38ff42a/confluence-xml/src/main/java/org/xwiki/contrib/confluence/filter/internal/input/ConfluenceInputFilterStream.java#L268]]. This is where any new developer should probably go to start hacking on Confluence XML.

Get Connected