Wiki source code of Explore Confluence Exports

Last modified by Raphaël Jakse on 2026/03/24 10:18

Show last authors
1 You might need to explore a confluence package to understand a bug, or to figure out how to implement a new feature. For instance, we needed to figure out how to implement space template conversion, and notice that templates are actually PageTemplate objects that look like Page objects.
2
3 Exploring a Confluence package can be challenging mainly because of it size: it can be huge. We also don't have much tooling to explore confluence packages, so we'll have to do with standard tooling.
4
5 = Transferring data =
6
7 Your Confluence package could be living on a remote server, and transferring a huge file might be your first challenge. Here are a few strategies you can use to reduce the amount of transferring which is needed.
8
9 First, the confluence package is a zip file which size might be mostly taken by attachments that don't compress well. Often enough, you are only interested in the entities.xml file (or, more rarely, the exportDescriptor.properties file). You can see their sizes like with ##unzip -l## (and specify a specific file or else you'll be spammed with info about all the files in the archive):
10
11 {{code language="none"}}
12 unzip -l 5.9.zip entities.xml
13 Archive: 5.9.zip
14 Length Date Time Name
15 --------- ---------- ----- ----
16 288084 2016-05-10 17:04 entities.xml
17 --------- -------
18 288084 1 file
19
20 unzip -l 5.9.zip exportDescriptor.properties
21 Archive: 5.9.zip
22 Length Date Time Name
23 --------- ---------- ----- ----
24 569 2016-05-10 17:04 exportDescriptor.properties
25 --------- -------
26 569 1 file
27
28 {{/code}}
29
30 If small enough, you can print the whole file with ##unzip -p##:
31
32 {{code language="none"}}
33 unzip -p 5.9.zip exportDescriptor.properties
34 {{/code}}
35
36 {{code language="none"}}
37 #Tue May 10 17:04:43 CEST 2016
38 buildNumber=6169
39 ao.data.version.com.atlassian.mywork.mywork-confluence-host-plugin=4.0.1
40 ao.data.version.min.com.atlassian.confluence.plugins.confluence-space-ia=5.0
41 exportType=all
42 createdByBuildNumber=6442
43 backupAttachments=true
44 defaultUsersGroup=confluence-users
45 ao.data.list=com.atlassian.mywork.mywork-confluence-host-plugin, com.atlassian.confluence.plugins.confluence-space-ia
46 ao.data.version.min.com.atlassian.mywork.mywork-confluence-host-plugin=1.1.30
47 ao.data.version.com.atlassian.confluence.plugins.confluence-space-ia=13.0.4
48
49 {{/code}}
50
51 The entities.xml file can be huge, though. Here, you have a few options:
52
53 * grep the relevant parts, especially if you are on a restricted connection, or if you are using [[Admin Tools>>https://extensions.xwiki.org/xwiki/bin/view/Extension/Admin%20Tools%20Application]] to run commands (see next section)
54 * create an archive containing only entities.xml
55
56 The entities.xml, although huge, should compress very well and this might allow you to get the full file:
57
58 {{code language="none"}}
59 unzip -p 5.9.zip entities.xml | gzip > 5.9.entities.xml.gz
60
61 {{/code}}
62
63 = Grep the relevant parts =
64
65 If you have access to a Confluence package from the command line (locally, on a remote ssh server, or through the Admin tools extension), grepping the relevant parts can be a very efficient way to get what you need.
66
67 For instance, if you need the body content of a page, or other things related to a particular page, you can grep its page id (which you can find from a migrated document in its ConfluencePageClass object):
68
69 {{code language="none"}}
70 unzip -p 5.9.zip entities.xml | grep -C5 196616
71 {{/code}}
72
73 {{code language="none"}}
74 <property name="bodyType">2</property>
75 </object>
76 <object class="BodyContent" package="com.atlassian.confluence.core">
77 <id name="id">458755</id>
78 <property name="body"><![CDATA[<p><br />You can start a discussion by simply leaving a comment on a page, like this one.</p><p>Why not give it a try?</p><p>Go to the bottom of this page and start typing in the comment area. When you're finished just press save!&nbsp;</p><p>Don't just confine your comments to the bottom of the page - highlight some text on the page to add an inline comment like this:</p><p><ac:image><ri:attachment ri:filename="Step8-01.png" /></ac:image></p><p><strong>Hint:</strong> You can mention another user in a page or&nbsp; comment by typing @ and then the user's name. <br />The user will be notified that you mentioned them.</p><h1 style="text-align: center;"><span style="color: rgb(51,51,51);"><br /></span></h1><h1 style="text-align: center;"><span style="color: rgb(51,51,51);"><br /></span></h1><h1 style="text-align: center;"><ac:link><ri:page ri:content-title="Learn the wonders of autoconvert (step 7 of 9)" /><ac:link-body><ac:image ac:height="40" ac:width="106"><ri:attachment ri:filename="prev.jpg"><ri:page ri:content-title="Let's edit this page (step 3 of 9)" /></ri:attachment></ac:image></ac:link-body></ac:link>&nbsp;<ac:link><ri:page ri:content-title="Welcome to Confluence" /><ac:link-body><ac:image><ri:attachment ri:filename="home.jpg"><ri:page ri:content-title="Let's edit this page (step 3 of 9)" /></ri:attachment></ac:image></ac:link-body></ac:link>&nbsp;<ac:link><ri:page ri:content-title="Share your page with a team member (step 9 of 9)" /><ac:link-body><ac:image><ri:attachment ri:filename="next.jpg"><ri:page ri:content-title="Let's edit this page (step 3 of 9)" /></ri:attachment></ac:image></ac:link-body></ac:link></h1><p><span style="color: rgb(51,51,51);"><br /></span></p>]]></property>
79 <property name="content" class="Page" package="com.atlassian.confluence.pages"><id name="id">196616</id>
80 </property>
81 <property name="bodyType">2</property>
82 </object>
83 <object class="BodyContent" package="com.atlassian.confluence.core">
84 <id name="id">458753</id>
85 --
86 <id name="id">426027</id>
87 <property name="destinationPageTitle"><![CDATA[Learn the wonders of autoconvert (step 7 of 9)]]></property>
88 <property name="lowerDestinationPageTitle"><![CDATA[learn the wonders of autoconvert (step 7 of 9)]]></property>
89 <property name="destinationSpaceKey"><![CDATA[ds]]></property>
90 <property name="lowerDestinationSpaceKey"><![CDATA[ds]]></property>
91 <property name="sourceContent" class="Page" package="com.atlassian.confluence.pages"><id name="id">196616</id>
92 </property>
93 <property name="creationDate">2015-10-20 11:05:06.966</property>
94 <property name="lastModificationDate">2016-05-10 15:00:04.075</property>
95 </object>
96 <object class="OutgoingLink" package="com.atlassian.confluence.links">
97 --
98 <id name="id">426032</id>
99 <property name="destinationPageTitle"><![CDATA[Share your page with a team member (step 9 of 9)]]></property>
100 <property name="lowerDestinationPageTitle"><![CDATA[share your page with a team member (step 9 of 9)]]></property>
101 <property name="destinationSpaceKey"><![CDATA[ds]]></property>
102 <property name="lowerDestinationSpaceKey"><![CDATA[ds]]></property>
103 <property name="sourceContent" class="Page" package="com.atlassian.confluence.pages"><id name="id">196616</id>
104 </property>
105 <property name="creationDate">2015-10-20 11:05:06.966</property>
106 <property name="lastModificationDate">2016-05-10 15:00:04.075</property>
107 </object>
108 <object class="OutgoingLink" package="com.atlassian.confluence.links">
109 <id name="id">426033</id>
110 <property name="destinationPageTitle"><![CDATA[Welcome to Confluence]]></property>
111 <property name="lowerDestinationPageTitle"><![CDATA[welcome to confluence]]></property>
112 <property name="destinationSpaceKey"><![CDATA[ds]]></property>
113 <property name="lowerDestinationSpaceKey"><![CDATA[ds]]></property>
114 <property name="sourceContent" class="Page" package="com.atlassian.confluence.pages"><id name="id">196616</id>
115 </property>
116 <property name="creationDate">2015-10-20 11:05:06.966</property>
117 <property name="lastModificationDate">2016-05-10 15:00:04.075</property>
118 </object>
119 <object class="OutgoingLink" package="com.atlassian.confluence.links">
120 <id name="id">426034</id>
121 <property name="destinationPageTitle"><![CDATA[Tell people what you think in a comment (step 8 of 9)]]></property>
122 <property name="lowerDestinationPageTitle"><![CDATA[tell people what you think in a comment (step 8 of 9)]]></property>
123 <property name="destinationSpaceKey"><![CDATA[ds]]></property>
124 <property name="lowerDestinationSpaceKey"><![CDATA[ds]]></property>
125 <property name="sourceContent" class="Page" package="com.atlassian.confluence.pages"><id name="id">196616</id>
126 </property>
127 <property name="creationDate">2015-10-20 11:05:06.966</property>
128 <property name="lastModificationDate">2016-05-10 15:00:04.075</property>
129 </object>
130 <object class="OutgoingLink" package="com.atlassian.confluence.links">
131 --
132 <id name="id">426023</id>
133 <property name="destinationPageTitle"><![CDATA[Let's edit this page (step 3 of 9)]]></property>
134 <property name="lowerDestinationPageTitle"><![CDATA[let's edit this page (step 3 of 9)]]></property>
135 <property name="destinationSpaceKey"><![CDATA[ds]]></property>
136 <property name="lowerDestinationSpaceKey"><![CDATA[ds]]></property>
137 <property name="sourceContent" class="Page" package="com.atlassian.confluence.pages"><id name="id">196616</id>
138 </property>
139 <property name="creationDate">2015-10-20 11:05:06.966</property>
140 <property name="lastModificationDate">2016-05-10 15:00:04.075</property>
141 </object>
142 <object class="OutgoingLink" package="com.atlassian.confluence.links">
143 --
144 <property name="version">1</property>
145 <property name="creationDate">2015-10-20 11:05:06.966</property>
146 <property name="lastModificationDate">2016-05-10 15:00:04.075</property>
147 <property name="versionComment"><![CDATA[]]></property>
148 <property name="originalVersionId"/><property name="contentStatus"><![CDATA[current]]></property>
149 <property name="containerContent" class="Page" package="com.atlassian.confluence.pages"><id name="id">196616</id>
150 </property>
151 </object>
152 <object class="Attachment" package="com.atlassian.confluence.pages">
153 <id name="id">196624</id>
154 <property name="space" class="Space" package="com.atlassian.confluence.spaces"><id name="id">360449</id>
155 --
156 </element>
157 <element class="Page" package="com.atlassian.confluence.pages"><id name="id">196618</id>
158 </element>
159 <element class="Page" package="com.atlassian.confluence.pages"><id name="id">196617</id>
160 </element>
161 <element class="Page" package="com.atlassian.confluence.pages"><id name="id">196616</id>
162 </element>
163 <element class="Page" package="com.atlassian.confluence.pages"><id name="id">196620</id>
164 </element>
165 <element class="Page" package="com.atlassian.confluence.pages"><id name="id">196611</id>
166 </element>
167 --
168 <element class="Attachment" package="com.atlassian.confluence.pages"><id name="id">196634</id>
169 </element>
170 </collection>
171 </object>
172 <object class="Page" package="com.atlassian.confluence.pages">
173 <id name="id">196616</id>
174 <property name="position">7</property>
175 <property name="parent" class="Page" package="com.atlassian.confluence.pages"><id name="id">196614</id>
176 </property>
177 <collection name="ancestors" class="java.util.List"><element class="Page" package="com.atlassian.confluence.pages"><id name="id">196614</id>
178 </element>
179 {{/code}}
180
181 The ##-C## (Context) flag of grep says how many lines before and after the match are output. We see in this example that we get a lot of garbage, but our body content is there. -C5 is usually good enough because the body content is  usually all in one line, and a BodyContent object doesn't have a lot of properties and usually contains the owning content in its content property.
182
183 Grep parameters of interests are also:
184
185 * ##-A## for specifying the number of lines you want After the match
186 * ##-B## for specifying the number of lines you want Before the match
187 * ##-F## to say that you want exact matches and not the limited regex format grep uses by default
188 * ##-E## or ##-P## for using more powerful regex engines
189
190 = Open entities.xml in a text editor that doesn't break on huge content =
191
192 Many editors will crash when trying to open a huge entities.xml file. We've successfully opened multi-hundred megabyte files with KWrite, the simple version of Kate, the text editor from KDE. This makes it quite comfortable to work with the Confluence export. Of course, some operations are (very slow). A few advise to survive:
193
194 * You'll be performing a lot of searches. If your editor searches as you type, don't type your search string in the search box, type it elsewhere and then copy-paste it so only one search is performed.
195 * Search and replace can work, including regex-based search and replace, but some may freeze your editor. Be extra careful not to use regexes that are expensive to evaluate.
196 * Some replace operations might be very expensive as well depending on how your editor is implemented. For example, inserting or removing lines many times may freeze your editor.
197 * Replace All might be way less painful for you than validating each replace and repeatedly waiting for the UI to respond
198
199 = Open entities.xml in a Web browser =
200
201 if your entities.xml is big (a few kilos, a few megabytes), but not huge (empirically, under 20M), you can consider opening the entities.xml file in Firefox.
202
203 This will allow you to query or clean up the file using the Javascript DOM API, which is quite comfortable as well. For instance:
204
205 Get all pages objects:
206
207 {{code language="javascript"}}
208 document.querySelectorAll("object.Page")
209 {{/code}}
210
211 Remove all ConfluenceBandanaRecord objects:
212
213 {{code language="javascript"}}
214 [].foreach.call(document.querySelectorAll("object.Page"), o => o.remove())
215 {{/code}}
216
217 {{info}}
218 Tip: you can use this to clean up a confluence package to build a test case, by cleaning up all the irrelevant objects (for example, space permissions). Then, you can save the result. You'll probably have to do some manual cleanup after saving but this is a very efficient way to build a test case from an actual Confluence export.
219 {{/info}}
220
221 Print the titles of all current pages:
222
223 {{code language="javascript"}}
224 [].filter.call(document.querySelectorAll("object.Page"), o => o.querySelector("[name='contentStatus']").textContent == 'current').map(o => o.querySelector("[name='title']").textContent).join("\n")
225 {{/code}}
226
227 = Use XQuery (and XSL) =
228
229 With tools like [[Xidel>>https://www.videlibri.de/xidel.html]] (written in C, likely packaged in your linux distribution's repositories if you are using one of the big Linux distros) or [[Xee>>https://github.com/Paligo/xee]] (written in Rust, which you'll be able to download from its releases page), if your entities.xml is big but not huge, you can use XQuery to query your package and get information quite efficiently with XQuery, a query language tailored for XML.
230
231 {{warning}}
232 While XQuery might seem appealing, it doesn't currently seem suitable for parsing huge files. It might be that some queries could theoretically be evaluated without putting the whole XML tree in memory, but it doesn't seem like xidel or xee attempt to do this, so you'll need enough free RAM to store the whole XML tree of entities.xml. And then, the evaluation itself can be slow as well and you are often best served by using grep or a suitable text editor.
233 {{/warning}}
234
235 With this out of the way, if you are indeed able to use XQuery, is it nice because it allows you to script and document your operations and make them reproducible. It lets you automatize a lot of things that you would otherwise need to do manually repeatedly, which can be tedious.
236
237 We will not explain XQuery here, dedicated tutorial will be more suitable for this. However, the following examples will allow you to get started with XQuery on Confluence packages.
238
239 {{info}}
240 These examples are for Xidel. Put the query in some ##query.xquery## file next to the ##entities.xml## file you want to query and invoke ##xidel## like this:
241
242 {{code language="none"}}
243 xidel entities.xml --extract-kind=xquery --extract-file query.xquery --output-format=xml
244 {{/code}}
245 {{/info}}
246
247 {{info}}
248 If you want to use xee instead, run ##xee repl## (consider using rlwrap as well so you have history and more comfortable command editing inside the xee repl), and then type ##!load entities.xml##. You can then input your XQuery queries. Examples below will need to be adjusted a bit.
249 {{/info}}
250
251 Get the object of a page with a specific id:
252
253 {{code language="xquery"}}
254 (
255 for $x in doc("entities.xml")/hibernate-generic/object
256 where $x/id = "71150445"
257 return $x
258 )[position() = 1]
259 {{/code}}
260
261 Get the historical versions of a page:
262
263 {{code language="xquery"}}
264 for $x in doc("entities.xml")/hibernate-generic/object
265 where $x/id = "71150445"
266 return $x/collection[@name='historicalVersions']//id/text()
267
268 {{/code}}
269
270 Or:
271
272 {{code language="xquery"}}
273 for $x in doc("entities.xml")/hibernate-generic/object[id/text()='71150445']/collection[@name='historicalVersions']//id/text()
274 return $x
275
276 {{/code}}
277
278 Or:
279
280 {{code language="xquery"}}
281 for $revId in doc("entities.xml")/hibernate-generic/object[id/text()='71150445']/collection[@name='historicalVersions']//id/text()
282 return (
283 for $x in doc("entities.xml")/hibernate-generic/object[id/text()=$revId]
284 return {data($x/property[@name="version"]/text())}
285 )
286 {{/code}}
287
288 You can structure the output a bit:
289
290 {{code language="xquery"}}
291 for $revId in doc("entities.xml")/hibernate-generic/object[id/text()='71150445']/collection[@name='historicalVersions']//id/text()
292 return (
293 for $x in doc("entities.xml")/hibernate-generic/object[id/text()=$revId]
294 return <page><version>{data($x/property[@name="version"]/text())}</version><id>{data($x/id/text())}</id></page>
295 )
296
297 {{/code}}
298
299 {{info}}
300 XSL would be another way to query and manipulate your entities.xml files. We haven't explored this yet.
301 {{/info}}

Get Connected