levalencia commited on
Commit
924cb7d
·
1 Parent(s): 26b3eb7

Add extraction strategy selection and unique indices handling in app.py and FieldMapperAgent. Enhance Planner to accommodate new parameters for execution plans. Update AzureDIService to process tables and save extracted content in both markdown and JSON formats for improved traceability.

Browse files
logs/di_content/di_content_20250605_114836.txt ADDED
@@ -0,0 +1,1608 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ARGX DISCOVERY: EVALUATION OF LIABILITIES IN VH AND VL REGION OF ONE CONSTRUCT
2
+
3
+ Argenx - P3016_R010_v00
4
+
5
+
6
+ <figure>
7
+
8
+ RIC
9
+
10
+ biologics
11
+
12
+ </figure>
13
+
14
+
15
+ <!-- PageBreak -->
16
+
17
+
18
+ <figure>
19
+ </figure>
20
+
21
+
22
+ <!-- PageHeader="Table of contents" -->
23
+
24
+ l Project information
25
+
26
+ Scope
27
+
28
+ Test samples
29
+
30
+ l
31
+ Method
32
+
33
+ l
34
+ Results
35
+
36
+ · P018_3D6 VHO VL6 _hlgG1_LALAPG_F405L-FJB_2024-04-18_002
37
+
38
+ l
39
+ Conclusions
40
+
41
+
42
+ <figure>
43
+
44
+ RIC
45
+
46
+ </figure>
47
+
48
+
49
+ <!-- PageFooter="biologics" -->
50
+ <!-- PageNumber="2" -->
51
+ <!-- PageBreak -->
52
+
53
+
54
+ <figure>
55
+ </figure>
56
+
57
+
58
+ ## Project information
59
+
60
+ l
61
+
62
+ l
63
+
64
+ l
65
+
66
+ l
67
+
68
+
69
+ <table>
70
+ <tr>
71
+ <td>Sales quote:</td>
72
+ <td>SQ20202722</td>
73
+ </tr>
74
+ <tr>
75
+ <td>Project code:</td>
76
+ <td>P3016</td>
77
+ </tr>
78
+ <tr>
79
+ <td>LNB number:</td>
80
+ <td>2023.040</td>
81
+ </tr>
82
+ <tr>
83
+ <td>Project responsible:</td>
84
+ <td>Nathan Cardon</td>
85
+ </tr>
86
+ <tr>
87
+ <td>Report name:</td>
88
+ <td>P3016_R010_v01</td>
89
+ </tr>
90
+ </table>
91
+
92
+
93
+ <figure>
94
+
95
+ RIC
96
+
97
+ </figure>
98
+
99
+
100
+ <!-- PageFooter="biologics" -->
101
+ <!-- PageNumber="3" -->
102
+ <!-- PageBreak -->
103
+
104
+
105
+ <figure>
106
+ </figure>
107
+
108
+
109
+ Scope
110
+
111
+ l This report describes the results of a liability assessment of one argenx
112
+ discovery construct. Non-stressed and temperature-stressed samples
113
+ stored for multiple weeks at 37℃ were evaluated by reduced protein RPLC-
114
+ UV-MS and peptide map analysis. The focus of both analyses was on the
115
+ variable regions of the different constructs.
116
+
117
+
118
+ <figure>
119
+
120
+ RIC
121
+
122
+ </figure>
123
+
124
+
125
+ <!-- PageFooter="biologics" -->
126
+ <!-- PageNumber="4" -->
127
+ <!-- PageBreak -->
128
+
129
+
130
+ <figure>
131
+ </figure>
132
+
133
+
134
+ ### Test samples
135
+
136
+ The samples used in this study are listed below
137
+
138
+
139
+ <table>
140
+ <tr>
141
+ <th>Construct</th>
142
+ <th>Stress condition</th>
143
+ <th>Concentration (mg/mL)</th>
144
+ </tr>
145
+ <tr>
146
+ <td rowspan="3">P018_3D6 VHO VL6 _hIgG1_LALAPG_F405L- FJB_2024-04-18_002</td>
147
+ <td>T0W</td>
148
+ <td>1.00</td>
149
+ </tr>
150
+ <tr>
151
+ <td>T2W_37℃</td>
152
+ <td>1.00</td>
153
+ </tr>
154
+ <tr>
155
+ <td>T4W_37℃</td>
156
+ <td>1.00</td>
157
+ </tr>
158
+ </table>
159
+
160
+
161
+ <figure>
162
+
163
+ RIC
164
+
165
+ </figure>
166
+
167
+
168
+ <!-- PageFooter="biologics" -->
169
+ <!-- PageNumber="5" -->
170
+ <!-- PageBreak -->
171
+
172
+
173
+ <figure>
174
+ </figure>
175
+
176
+
177
+ ## Method: Reduced protein analysis by RPLC-UV-MS
178
+
179
+ l The samples were reduced by incubation with DTT while in denaturing conditions. The
180
+ samples were analyzed using a C8 RPLC column on a 1290 Infinity UHPLC system coupled
181
+ to a 6540 Q-TOF mass spectrometer (both from Agilent Technologies).
182
+
183
+ l RPLC was performed with trifluoroacetic acid (TFA) as ion pairing additive, and with H2O
184
+ and acetonitrile as mobile phases.
185
+
186
+ l Data acquisition and processing were performed with BioConfirm MassHunter 7.0
187
+ (Agilent Technologies). UV 280 nm and MS data were acquired simultaneously.
188
+
189
+
190
+ <figure>
191
+
192
+ RIC
193
+
194
+ </figure>
195
+
196
+
197
+ <!-- PageFooter="biologics" -->
198
+ <!-- PageNumber="6" -->
199
+ <!-- PageBreak -->
200
+
201
+
202
+ <figure>
203
+ </figure>
204
+
205
+
206
+ ## Method: Peptide map analysis in reducing conditions
207
+
208
+ l Prior to digestion, the samples were reduced using dithiothreitol (DTT) and alkylated with
209
+ iodoacetamide (IAA). The samples were digested using trypsin as protease.
210
+
211
+ l The digests were analyzed on RPLC-MS using a C18 RPLC column. RPLC was performed
212
+ with formic acid (FA) as additive, and with H2O and acetonitrile as mobile phases. The
213
+ analyses were performed on a 1290 Infinity UHPLC system (Agilent Technologies) coupled
214
+ to a 6545 Q-TOF Mass Spectrometer (Agilent Technologies) operated in MS and MS/MS
215
+ mode.
216
+
217
+ l Data processing was performed using BioConfirm 10.0 and MassHunter 7.0 (Agilent
218
+ Technologies).
219
+
220
+ l Measured signals were matched onto the sequence. Identification was based primarily
221
+ on MS-only data. Enzyme specified was trypsin (C-terminal cleavage at lysine or arginine)
222
+ and 0-2 missed cleavages were allowed. N-terminal cyclization (pyroglutamate from E/Q),
223
+ D isomerization, N/Q deamidation, M/W oxidation and N-glycosylation were considered
224
+ as variable modifications while cysteine carbamidomethylation (sample preparation
225
+ related) was considered as fixed modification. Peak areas from extracted ion
226
+ chromatograms (EICs) were used for quantifying modifications.
227
+
228
+ l Note that peptide map data processing was focused on the variable regions of the
229
+ constructs.
230
+
231
+
232
+ <figure>
233
+
234
+ RIC
235
+
236
+ </figure>
237
+
238
+
239
+ <!-- PageFooter="biologics" -->
240
+ <!-- PageNumber="7" -->
241
+ <!-- PageBreak -->
242
+
243
+ P018_3D6 VHO VL6_HIGG1_LALAPG
244
+
245
+ <!-- PageBreak -->
246
+
247
+
248
+ <figure>
249
+ </figure>
250
+
251
+
252
+ ## AA sequence of P018_3D6 VHO VL6 _hIgG1_LALAPG
253
+
254
+ l AA sequence of VH (MW (G0F): 50306.86Da):
255
+
256
+ EVOLLESGGGLVQPGGSLRLSCAASGFTFSSYSLSWVRQAPGKGLEWVSTIKARRGTTLYADSVKDRIFTISR
257
+ DNSKNTLYLQMNSLRAEDTAVYYCAKPLYSNLAGDFGSWGQGTTVTVSSASTKGPSVFPLAPSSKSTSGGT
258
+ AALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKV
259
+ DKKVEPKSCDKTHTCPPCPAPEAAGGPSVFLFPPKPKDTLMISRTPEVTCVVVDVSHEDPEVKFNWYVDG
260
+ VEVHNAKTKPREEQYNSTYRVVSVLTVLHQDWLNGKEYKCKVSNKALGAPIEKTISKAKGQPREPQVYTLP
261
+ PSRDELTKNQVSLTCLVKGFYPSDIAVEWESNGQPENNYKTTPPVLDSDGSFLLYSKLTVDKSRWQQGNVF
262
+ SCSVMHEALHNHYTQKSLSLSPGK
263
+
264
+ AA sequence of VL (MW: 23376.19Da):
265
+
266
+ DIQMTQSPSSLSASVGDRVTITCQASQSISSYLAWYQQKPGKAPKLLIYGGSRLQTGVPSRFSGSGSGTDFT
267
+ LTISSLOPEDFATYYCQQDYSWPLTFGQGTKVEIKRTVAAPSVFIFPPSDEQLKSGTASVVCLLNNFYPREAK
268
+ VQWKVDNALQSGNSQESVTEQDSKDSTYSLSSTLTLSKADYEKHKVYACEVTHQGLSSPVTKSFNRGEC
269
+
270
+
271
+ <figure>
272
+
273
+ RIC
274
+
275
+ <!-- PageFooter="biologics" -->
276
+
277
+ </figure>
278
+
279
+
280
+ <table>
281
+ <tr>
282
+ <td>Blue:</td>
283
+ <td>VH and VL</td>
284
+ </tr>
285
+ <tr>
286
+ <td>Blue:</td>
287
+ <td>CDR</td>
288
+ </tr>
289
+ <tr>
290
+ <td>Green:</td>
291
+ <td>N-glycosylation site</td>
292
+ </tr>
293
+ </table>
294
+
295
+
296
+ <!-- PageNumber="9" -->
297
+ <!-- PageBreak -->
298
+
299
+
300
+ <figure>
301
+ </figure>
302
+
303
+
304
+ ## Results: reduced protein RPLC-UV-MS
305
+
306
+
307
+ ### l
308
+
309
+
310
+ #### Overlays of the UV214nm profiles
311
+
312
+
313
+ <figure>
314
+
315
+ ×10 3
316
+ DAD1 - A:Sig=214.0,8.0 Ref=360.0,100.0 P3016_10OCT24_rRPLC_UV_MS_P018_3D6_VHO_VL_T0.d
317
+
318
+ 1\.
319
+
320
+ HC
321
+
322
+ T0W
323
+
324
+ 0.95-
325
+
326
+ 5
327
+
328
+ 0.9
329
+
330
+ T2W_37℃
331
+
332
+ 0.85-
333
+
334
+ T4W_37°℃
335
+
336
+ 0.8
337
+
338
+ LC
339
+
340
+ 0.75-
341
+
342
+ a: SG clip HC (E1-S140)
343
+
344
+ 0.7-
345
+
346
+ GG clip HC (E1-G141)
347
+
348
+ g
349
+
350
+ 1: LC + deamidation
351
+
352
+ 2: RG clip HC (G51-G450)
353
+
354
+ 0.65
355
+
356
+ b: LC + deamidation
357
+
358
+ NL clip HC (L104-G450)
359
+
360
+ 0.6
361
+
362
+ c: PK clip HC E1-P221
363
+
364
+ 3: HC isomer
365
+
366
+ 0.55-
367
+
368
+ LC
369
+
370
+ 4: unknown
371
+
372
+ 0.5
373
+
374
+ d & e: LC
375
+
376
+ 5: HC
377
+
378
+ 0.45
379
+
380
+ LC + oxidation (1x, 2x)
381
+
382
+ 6: HC + Deamidation
383
+
384
+ 0.4
385
+
386
+ f: LC + LC glycation (1x, 2x)
387
+
388
+ 7: HC + PyroE
389
+
390
+ 0.35-
391
+
392
+ DR clip LC (R18-C224)
393
+
394
+ DP clip HC (E1-D274)
395
+
396
+ 0.3
397
+
398
+ 0.25
399
+
400
+ g: LC
401
+
402
+ 8: CD clip HC (E1-C224)
403
+
404
+ 9: Thioether HC+LC, (73651.6 Da)
405
+ DG Clip HC: (E1-D284)
406
+
407
+ 0.2
408
+
409
+ 7
410
+
411
+ 0.15
412
+
413
+ 4
414
+
415
+ 0.1
416
+
417
+ 0.05
418
+
419
+ f
420
+
421
+ 2
422
+
423
+ 3
424
+
425
+ 6
426
+
427
+ 9
428
+
429
+ a
430
+
431
+ b
432
+
433
+ e
434
+
435
+ 1
436
+
437
+ 8
438
+
439
+ c
440
+
441
+ d
442
+
443
+ 0
444
+
445
+ 6
446
+
447
+ 6.5
448
+
449
+ 7
450
+
451
+ 7.5
452
+
453
+ 8
454
+
455
+ 8.5
456
+
457
+ 9
458
+
459
+ 9.5
460
+
461
+ 10
462
+
463
+ 10.5
464
+
465
+ 11
466
+
467
+ 11.5
468
+
469
+ 12
470
+
471
+ 12.5
472
+
473
+ 13
474
+
475
+ 13.5
476
+
477
+ 14
478
+
479
+ 14.5
480
+
481
+ 15
482
+
483
+ 15.5
484
+
485
+ 16
486
+
487
+ 16.5
488
+
489
+ 17
490
+
491
+ 17.5
492
+
493
+ 18
494
+
495
+ 18.5
496
+
497
+ 19
498
+
499
+ 19.5
500
+
501
+ 20
502
+
503
+ 20.5
504
+
505
+ 21
506
+
507
+ 21.5
508
+
509
+ 22
510
+
511
+ Response Units vs. Acquisition Time (min)
512
+
513
+ </figure>
514
+
515
+ *T0 chromatogram aligned to facilitate data interpretation.
516
+
517
+
518
+ <figure>
519
+
520
+ RIC
521
+
522
+ </figure>
523
+
524
+
525
+ <!-- PageFooter="biologics" -->
526
+
527
+ Color legend:
528
+ Black: clear identity on RPLC-UV-MS
529
+
530
+ Blue: modification with clear trend visible on RPLC-UV-MS
531
+ Blue: modification in variable region also confirmed in
532
+ peptide mapping
533
+
534
+ Grey: most likely method induced modification
535
+
536
+ <!-- PageNumber="10" -->
537
+ <!-- PageBreak -->
538
+
539
+
540
+ <figure>
541
+ </figure>
542
+
543
+
544
+ ## Results: reduced protein RPLC-UV-MS
545
+
546
+
547
+ ### l Deconvoluted spectra of HC and LC
548
+
549
+ +ESI Scan (rt: 13.0-13.7 min, 45 scans) Frag=200.0V P3016_10OCT24_rRPLC_UV_MS_P018_3D6_VHO_VL_TO.d Deconvoluted (Isotope Width=0.0)
550
+
551
+ x10 6
552
+
553
+ 2.5
554
+
555
+ 23377.04
556
+
557
+ Theoretical mass LC: 23376.2Da
558
+
559
+ 2.25
560
+
561
+ 2
562
+
563
+ 1.75
564
+
565
+ 1.5
566
+
567
+ 1.25
568
+
569
+ 1
570
+
571
+ 0.75
572
+
573
+ 0.5
574
+
575
+ 0.25
576
+
577
+ 0
578
+
579
+ 15585.42 18295.12
580
+
581
+ 11
582
+
583
+ 28052.48
584
+
585
+ 34559.61
586
+
587
+ x10 5
588
+
589
+ +ESI Scan (rt: 14.3-15.3 min, 58 scans) Frag=200.0V P3016_10OCT24_rRPLC_UV_MS_P018_3D6_VHO_VL_TO.d Deconvoluted (Isotope Width=0.0)
590
+
591
+
592
+ <figure>
593
+
594
+ Theoretical mass HC (G0F): 50306.9Da
595
+
596
+ G0F
597
+ 50308.16
598
+
599
+ 7
600
+
601
+ 6
602
+
603
+ 5
604
+
605
+ 4
606
+
607
+ 3
608
+
609
+ 2
610
+
611
+ 1
612
+
613
+ 0%
614
+
615
+ 0
616
+
617
+ 53210.42
618
+
619
+ 57837.17
620
+
621
+ 12000 14000 16000 18000 20000 22000 24000 26000 28000 30000 32000 34000 36000 38000 40000 42000 44000 46000 48000 50000 52000 54000 56000 58000
622
+ Counts vs. Deconvoluted Mass (amu)
623
+
624
+ </figure>
625
+
626
+
627
+ <figure>
628
+
629
+ RIC
630
+
631
+ </figure>
632
+
633
+
634
+ <!-- PageFooter="biologics" -->
635
+ <!-- PageNumber="11" -->
636
+ <!-- PageBreak -->
637
+
638
+
639
+ <figure>
640
+ </figure>
641
+
642
+
643
+ ## Results: tryptic peptide map
644
+
645
+
646
+ ### l N-terminal and C-terminal modifications for HC and LC
647
+
648
+
649
+ <table>
650
+ <tr>
651
+ <th colspan="6">Relative quantification by EIC (%)</th>
652
+ </tr>
653
+ <tr>
654
+ <th>AA Sequence</th>
655
+ <th>Seq Loc</th>
656
+ <th>Modification</th>
657
+ <th>T0</th>
658
+ <th>T2W_37℃</th>
659
+ <th>T4W_37℃</th>
660
+ </tr>
661
+ <tr>
662
+ <td rowspan="2">EVQLLESGGGLVQPGGSLR</td>
663
+ <td rowspan="2">HC(1-19)</td>
664
+ <td></td>
665
+ <td>99.0</td>
666
+ <td>96.1</td>
667
+ <td>93.0</td>
668
+ </tr>
669
+ <tr>
670
+ <td>PyroE</td>
671
+ <td>1.0</td>
672
+ <td>3.9</td>
673
+ <td>7.0</td>
674
+ </tr>
675
+ </table>
676
+
677
+
678
+ ### l Isomerization events in the variable parts of HC and LC
679
+
680
+
681
+ <table>
682
+ <tr>
683
+ <th colspan="6">Relative quantification by EIC (%)</th>
684
+ </tr>
685
+ <tr>
686
+ <th>AA Sequence</th>
687
+ <th>Seq Loc</th>
688
+ <th>Modification</th>
689
+ <th>T0</th>
690
+ <th>T2W_37℃</th>
691
+ <th>T4W_37℃</th>
692
+ </tr>
693
+ <tr>
694
+ <td rowspan="2">AEDTAVYYCAKPLYSNLAGDFGS WGQGTTVTVSSASTK</td>
695
+ <td rowspan="2">HC(88-125)</td>
696
+ <td></td>
697
+ <td>99.9</td>
698
+ <td>99.7</td>
699
+ <td>99.5</td>
700
+ </tr>
701
+ <tr>
702
+ <td>Isomerization</td>
703
+ <td>0.1</td>
704
+ <td>0.3</td>
705
+ <td>0.5</td>
706
+ </tr>
707
+ </table>
708
+
709
+
710
+ For information purposes the following color code is applied: Values showing an increase or decrease
711
+ between 1.0% to 10.0% compared to the T0 sample are in blue and underlined, whereas values showing an
712
+ increase of more than 10.0% compared to T0 are in red, bold and underlined.
713
+
714
+
715
+ <figure>
716
+
717
+ RIC
718
+
719
+ </figure>
720
+
721
+
722
+ <!-- PageFooter="biologics" -->
723
+ <!-- PageNumber="12" -->
724
+ <!-- PageBreak -->
725
+
726
+
727
+ <figure>
728
+ </figure>
729
+
730
+
731
+ ## Results: tryptic peptide map
732
+
733
+
734
+ ### l M/W oxidations in variable parts of HC and LC
735
+
736
+
737
+ <table>
738
+ <tr>
739
+ <th colspan="6">Relative quantification by EIC (%)</th>
740
+ </tr>
741
+ <tr>
742
+ <th>AA Sequence</th>
743
+ <th>Seq Loc</th>
744
+ <th>Modification</th>
745
+ <th>T0</th>
746
+ <th>T2W_37℃</th>
747
+ <th>T4W_37℃</th>
748
+ </tr>
749
+ <tr>
750
+ <td rowspan="2">NTLYLQMNSLR</td>
751
+ <td rowspan="2">HC(77-87)</td>
752
+ <td></td>
753
+ <td>99.5</td>
754
+ <td>99.5</td>
755
+ <td>99.5</td>
756
+ </tr>
757
+ <tr>
758
+ <td>Oxidation</td>
759
+ <td>0.5</td>
760
+ <td>0.5</td>
761
+ <td>0.5</td>
762
+ </tr>
763
+ <tr>
764
+ <td rowspan="2">DIQMTQSPSSLSASVGDR</td>
765
+ <td rowspan="2">LC(1-18)</td>
766
+ <td></td>
767
+ <td>99.5</td>
768
+ <td>99.5</td>
769
+ <td>99.5</td>
770
+ </tr>
771
+ <tr>
772
+ <td>Oxidation</td>
773
+ <td>0.5</td>
774
+ <td>0.5</td>
775
+ <td>0.5</td>
776
+ </tr>
777
+ </table>
778
+
779
+
780
+ For information purposes the following color code is applied: Values showing an increase or decrease
781
+ between 1.0% to 10.0% compared to the T0 sample are in blue and underlined, whereas values showing an
782
+ increase of more than 10.0% compared to T0 are in red, bold and underlined.
783
+
784
+
785
+ <figure>
786
+
787
+ RIC
788
+
789
+ </figure>
790
+
791
+
792
+ <!-- PageFooter="biologics" -->
793
+ <!-- PageNumber="13" -->
794
+ <!-- PageBreak -->
795
+
796
+
797
+ <figure>
798
+ </figure>
799
+
800
+
801
+ ## Results: tryptic peptide map
802
+
803
+
804
+ ### l N/Q deamidation in variable parts of HC and LC
805
+
806
+
807
+ <table>
808
+ <tr>
809
+ <th colspan="6">Relative quantification by EIC (%)</th>
810
+ </tr>
811
+ <tr>
812
+ <th>AA Sequence</th>
813
+ <th>Seq Loc</th>
814
+ <th>Modification</th>
815
+ <th>T0</th>
816
+ <th>T2W_37℃</th>
817
+ <th>T4W_37℃</th>
818
+ </tr>
819
+ <tr>
820
+ <td rowspan="2">EVQLLESGGGLVQPGGSLR</td>
821
+ <td rowspan="2">HC(1-19)</td>
822
+ <td></td>
823
+ <td>99.6</td>
824
+ <td>99.6</td>
825
+ <td>99.5</td>
826
+ </tr>
827
+ <tr>
828
+ <td>Deamidation</td>
829
+ <td>0.4</td>
830
+ <td>0.4</td>
831
+ <td>0.5</td>
832
+ </tr>
833
+ <tr>
834
+ <td rowspan="2">NTLYLQMNSLR*</td>
835
+ <td rowspan="2">HC(77-87)</td>
836
+ <td></td>
837
+ <td>99.0</td>
838
+ <td>98.8</td>
839
+ <td>98.6</td>
840
+ </tr>
841
+ <tr>
842
+ <td>Deamidation</td>
843
+ <td>1.0</td>
844
+ <td>1.2</td>
845
+ <td>1.4</td>
846
+ </tr>
847
+ <tr>
848
+ <td rowspan="2">AEDTAVYYCAKPLYSNLAGDFGS WGQGTTVTVSSASTK</td>
849
+ <td rowspan="2">HC(88-125)</td>
850
+ <td></td>
851
+ <td>98.7</td>
852
+ <td>96.1</td>
853
+ <td>93.4</td>
854
+ </tr>
855
+ <tr>
856
+ <td>Deamidation</td>
857
+ <td>1.3</td>
858
+ <td>3.9</td>
859
+ <td>6.6</td>
860
+ </tr>
861
+ <tr>
862
+ <td rowspan="2">VTITCQASQSISSYLAWYQQKPGK</td>
863
+ <td rowspan="2">LC(19-42)</td>
864
+ <td></td>
865
+ <td>99.9</td>
866
+ <td>99.8</td>
867
+ <td>99.5</td>
868
+ </tr>
869
+ <tr>
870
+ <td>Deamidation</td>
871
+ <td>0.1</td>
872
+ <td>0.2</td>
873
+ <td>0.5</td>
874
+ </tr>
875
+ </table>
876
+
877
+ *MS/MS confirmed the deamidation site as N84, see next slides.
878
+
879
+
880
+ For information purposes the following color code is applied: Values showing an increase or decrease
881
+ between 1.0% to 10.0% compared to the T0 sample are in blue and underlined, whereas values showing an
882
+ increase of more than 10.0% compared to T0 are in red, bold and underlined.
883
+
884
+
885
+ <figure>
886
+
887
+ RIC
888
+
889
+ </figure>
890
+
891
+
892
+ <!-- PageFooter="biologics" -->
893
+ <!-- PageNumber="14" -->
894
+ <!-- PageBreak -->
895
+
896
+
897
+ <figure>
898
+ </figure>
899
+
900
+
901
+ ## Results: tryptic peptide map
902
+
903
+ l
904
+
905
+
906
+ ### HC(77-87) deamidation site confirmation via MS/MS
907
+
908
+ N77:
909
+
910
+
911
+ <table>
912
+ <tr>
913
+ <th rowspan="2">b --- 1</th>
914
+ <th></th>
915
+ <th></th>
916
+ <th>y</th>
917
+ </tr>
918
+ <tr>
919
+ <th>N(+0.984016)</th>
920
+ <th>11</th>
921
+ <th>---</th>
922
+ </tr>
923
+ <tr>
924
+ <td>217.0819 2</td>
925
+ <td>T</td>
926
+ <td>10</td>
927
+ <td>1238.6562</td>
928
+ </tr>
929
+ <tr>
930
+ <td>330.1660 3</td>
931
+ <td>L</td>
932
+ <td>9</td>
933
+ <td>1137.6085</td>
934
+ </tr>
935
+ <tr>
936
+ <td>493.2293 4</td>
937
+ <td>Y</td>
938
+ <td>8</td>
939
+ <td>1024.5244</td>
940
+ </tr>
941
+ <tr>
942
+ <td>606.3134 5</td>
943
+ <td>L</td>
944
+ <td>7</td>
945
+ <td>861.4611</td>
946
+ </tr>
947
+ <tr>
948
+ <td>734.3719 6</td>
949
+ <td>Q</td>
950
+ <td>6</td>
951
+ <td>748.3770</td>
952
+ </tr>
953
+ <tr>
954
+ <td>865.4124 7</td>
955
+ <td>M</td>
956
+ <td>5</td>
957
+ <td>620.3185</td>
958
+ </tr>
959
+ <tr>
960
+ <td>979.4553 8</td>
961
+ <td>N</td>
962
+ <td>4</td>
963
+ <td>489.2780</td>
964
+ </tr>
965
+ <tr>
966
+ <td>066.4874 9</td>
967
+ <td>S</td>
968
+ <td>3</td>
969
+ <td>375.2350</td>
970
+ </tr>
971
+ <tr>
972
+ <td>179.5714 10</td>
973
+ <td>L</td>
974
+ <td>2</td>
975
+ <td>288.2030</td>
976
+ </tr>
977
+ <tr>
978
+ <td>--- 11</td>
979
+ <td>R</td>
980
+ <td>1</td>
981
+ <td>175.1190</td>
982
+ </tr>
983
+ </table>
984
+
985
+
986
+ <figure>
987
+
988
+ N84:
989
+
990
+ Visible
991
+ when
992
+ zooming
993
+
994
+ </figure>
995
+
996
+
997
+ <table>
998
+ <tr>
999
+ <th>b</th>
1000
+ <th colspan="3">y</th>
1001
+ </tr>
1002
+ <tr>
1003
+ <td>--- 1</td>
1004
+ <td>N</td>
1005
+ <td>11</td>
1006
+ <td>---</td>
1007
+ </tr>
1008
+ <tr>
1009
+ <td>216.0979 2</td>
1010
+ <td>T</td>
1011
+ <td>10</td>
1012
+ <td>1239.6402</td>
1013
+ </tr>
1014
+ <tr>
1015
+ <td>329.1819 3</td>
1016
+ <td>L</td>
1017
+ <td>9</td>
1018
+ <td>1138.5925</td>
1019
+ </tr>
1020
+ <tr>
1021
+ <td>492.2453 4</td>
1022
+ <td>Y</td>
1023
+ <td>8</td>
1024
+ <td>1025.5084</td>
1025
+ </tr>
1026
+ <tr>
1027
+ <td>605.3293 5</td>
1028
+ <td>L</td>
1029
+ <td>7</td>
1030
+ <td>862.4451</td>
1031
+ </tr>
1032
+ <tr>
1033
+ <td>733.3879 6</td>
1034
+ <td>Q</td>
1035
+ <td>6</td>
1036
+ <td>749.3611</td>
1037
+ </tr>
1038
+ <tr>
1039
+ <td>864.4284 7</td>
1040
+ <td>M</td>
1041
+ <td>5</td>
1042
+ <td>621.3025</td>
1043
+ </tr>
1044
+ <tr>
1045
+ <td>979.4553 8</td>
1046
+ <td>N(+0.984016)</td>
1047
+ <td>4</td>
1048
+ <td>490.2620</td>
1049
+ </tr>
1050
+ <tr>
1051
+ <td>1066.4874 9</td>
1052
+ <td>S</td>
1053
+ <td>3</td>
1054
+ <td>375.2350</td>
1055
+ </tr>
1056
+ <tr>
1057
+ <td>1179.5714 10</td>
1058
+ <td>L</td>
1059
+ <td>2</td>
1060
+ <td>288.2030</td>
1061
+ </tr>
1062
+ <tr>
1063
+ <td>--- 11</td>
1064
+ <td>R</td>
1065
+ <td>1</td>
1066
+ <td>175.1190</td>
1067
+ </tr>
1068
+ </table>
1069
+
1070
+
1071
+ +ESI Product lon (rt: 20.9 min) Frag=175.0V [email protected] (677.3463[z=2] -> ** ) P3016_PM_09OCT2024_R_MS2_P018_3D6_2024-04-18_002_T4w-2.d
1072
+
1073
+
1074
+ <figure>
1075
+
1076
+ x10 3
1077
+
1078
+ 216.0979
1079
+
1080
+ 4
1081
+
1082
+ 3
1083
+
1084
+ 862.4438
1085
+
1086
+ 329.1827
1087
+
1088
+ 1025.5063
1089
+
1090
+ 2
1091
+
1092
+ 492.2425
1093
+
1094
+ 749.3614
1095
+
1096
+ 621.3030
1097
+
1098
+ 1138.5930
1099
+
1100
+ 1
1101
+
1102
+ 375.2330
1103
+
1104
+ 447.2224
1105
+
1106
+ 813.6398
1107
+
1108
+ 0
1109
+
1110
+ 150
1111
+
1112
+ 200
1113
+
1114
+ 250
1115
+
1116
+ 300
1117
+
1118
+ 350
1119
+
1120
+ 400
1121
+
1122
+ 450
1123
+
1124
+ 500
1125
+
1126
+ 550
1127
+
1128
+ 600
1129
+
1130
+ 650
1131
+
1132
+ 700
1133
+
1134
+ 750
1135
+
1136
+ Counts vs. Mass-to-Charge (m/z)
1137
+
1138
+ 800
1139
+
1140
+ 850
1141
+
1142
+ 900
1143
+
1144
+ 950
1145
+
1146
+ 1000
1147
+
1148
+ 1050
1149
+
1150
+ 1100
1151
+
1152
+ 1150
1153
+
1154
+ </figure>
1155
+
1156
+
1157
+ <figure>
1158
+
1159
+ RIC
1160
+
1161
+ biologics
1162
+
1163
+ </figure>
1164
+
1165
+
1166
+ <!-- PageFooter="MS/MS data confirms N84 as the deamidation site." -->
1167
+ <!-- PageNumber="15" -->
1168
+ <!-- PageBreak -->
1169
+
1170
+
1171
+ <figure>
1172
+ </figure>
1173
+
1174
+
1175
+ ## Results: tryptic peptide map
1176
+
1177
+
1178
+ ### Thioether bond
1179
+
1180
+
1181
+ <table>
1182
+ <tr>
1183
+ <th colspan="6">Relative quantification by EIC (%)</th>
1184
+ </tr>
1185
+ <tr>
1186
+ <th>AA Sequence</th>
1187
+ <th>Seq Loc</th>
1188
+ <th>Modification</th>
1189
+ <th>T0</th>
1190
+ <th>T2W_37℃</th>
1191
+ <th>T4W_37℃</th>
1192
+ </tr>
1193
+ <tr>
1194
+ <td>SCDK</td>
1195
+ <td>HC(223-226)</td>
1196
+ <td></td>
1197
+ <td rowspan="2">99.8</td>
1198
+ <td rowspan="2">99.5</td>
1199
+ <td rowspan="2">98.8</td>
1200
+ </tr>
1201
+ <tr>
1202
+ <td>SFNRGEC</td>
1203
+ <td>LC(208-214)</td>
1204
+ <td></td>
1205
+ </tr>
1206
+ <tr>
1207
+ <td>SFNRGEC - SCDK</td>
1208
+ <td>HC(223-226) + LC(208-214)</td>
1209
+ <td>Thioether</td>
1210
+ <td>0.2</td>
1211
+ <td>0.5</td>
1212
+ <td>1.2</td>
1213
+ </tr>
1214
+ </table>
1215
+
1216
+
1217
+ For information purposes the following color code is applied: Values showing an increase or decrease
1218
+ between 1.0% to 10.0% compared to the T0 sample are in blue and underlined, whereas values showing an
1219
+ increase of more than 10.0% compared to T0 are in red, bold and underlined.
1220
+
1221
+
1222
+ <figure>
1223
+
1224
+ RIC
1225
+
1226
+ </figure>
1227
+
1228
+
1229
+ <!-- PageFooter="biologics" -->
1230
+ <!-- PageNumber="16" -->
1231
+ <!-- PageBreak -->
1232
+
1233
+
1234
+ <figure>
1235
+ </figure>
1236
+
1237
+
1238
+ ## Results: tryptic peptide map
1239
+
1240
+
1241
+ ### l Confirmation of clipping events via peptide mapping
1242
+
1243
+
1244
+ <table>
1245
+ <tr>
1246
+ <th colspan="6">Relative quantification by EIC (%)*</th>
1247
+ </tr>
1248
+ <tr>
1249
+ <th>AA Sequence</th>
1250
+ <th>Seq Loc</th>
1251
+ <th>Modification</th>
1252
+ <th>T0</th>
1253
+ <th>T2W_37℃</th>
1254
+ <th>T4W_37℃</th>
1255
+ </tr>
1256
+ <tr>
1257
+ <td>AEDTAVYYCAKPLYSNLAGDFGS WGQGTTVTVSSASTK</td>
1258
+ <td rowspan="3">HC(88-125)</td>
1259
+ <td></td>
1260
+ <td>98.6</td>
1261
+ <td>98.1</td>
1262
+ <td>96.8</td>
1263
+ </tr>
1264
+ <tr>
1265
+ <td>AEDTAVYYCAKPLYSN</td>
1266
+ <td>Clipping</td>
1267
+ <td>0.3</td>
1268
+ <td>0.4</td>
1269
+ <td>0.6</td>
1270
+ </tr>
1271
+ <tr>
1272
+ <td>LAGDFGSWGQGTTVTVSSASTK</td>
1273
+ <td>Clipping</td>
1274
+ <td>1.1</td>
1275
+ <td>1.5</td>
1276
+ <td>2.6</td>
1277
+ </tr>
1278
+ <tr>
1279
+ <td>STSGGTAALGCLVK</td>
1280
+ <td rowspan="3">HC(138-151)</td>
1281
+ <td></td>
1282
+ <td>99.9</td>
1283
+ <td>99.3</td>
1284
+ <td>98.7</td>
1285
+ </tr>
1286
+ <tr>
1287
+ <td>GGTAALGCLVK</td>
1288
+ <td>Clipping</td>
1289
+ <td>0.0</td>
1290
+ <td>0.2</td>
1291
+ <td>0.3</td>
1292
+ </tr>
1293
+ <tr>
1294
+ <td>GTAALGCLVK</td>
1295
+ <td>Clipping_2</td>
1296
+ <td>0.1</td>
1297
+ <td>0.5</td>
1298
+ <td>0.9</td>
1299
+ </tr>
1300
+ <tr>
1301
+ <td>SCDKTHTCPPCPAPEAAGGPSVFL FPPKPK</td>
1302
+ <td rowspan="2">HC(223-252)</td>
1303
+ <td></td>
1304
+ <td>99.6</td>
1305
+ <td>99.4</td>
1306
+ <td>99.0</td>
1307
+ </tr>
1308
+ <tr>
1309
+ <td>DKTHTCPPCPAPEAAGGPSVFLFP PKPK</td>
1310
+ <td>Clipping</td>
1311
+ <td>0.4</td>
1312
+ <td>0.6</td>
1313
+ <td>1.0</td>
1314
+ </tr>
1315
+ </table>
1316
+
1317
+ *Note that due to the differences in ionization efficiency between large and small peptides, the quantification of clipping
1318
+ events via peptide mapping is most likely an overestimation. Nevertheless, peptide mapping allows for the confirmation of
1319
+ clipping events and discern possible trends.
1320
+
1321
+
1322
+ For information purposes the following color code is applied: Values showing an increase or decrease
1323
+ between 1.0% to 10.0% compared to the T0 sample are in blue and underlined, whereas values showing an
1324
+ increase of more than 10.0% compared to T0 are in red, bold and underlined.
1325
+
1326
+
1327
+ <figure>
1328
+
1329
+ RIC
1330
+
1331
+ </figure>
1332
+
1333
+
1334
+ <!-- PageFooter="biologics" -->
1335
+ <!-- PageNumber="17" -->
1336
+ <!-- PageBreak -->
1337
+
1338
+
1339
+ <figure>
1340
+ </figure>
1341
+
1342
+
1343
+ ## P018_3D6 VHO VL6 _hIgG1_LALAPG: Conclusions
1344
+
1345
+ l Some minor liabilities were found in the variable parts of the
1346
+ P018_3D6 VHO VL6 _hIgG1_LALAPG construct
1347
+
1348
+ · On RPLC-UV-MS, some minor clipping events, that increased during the temperature stress, were
1349
+ observed. These clippings were further confirmed via peptide mapping. The most apparent clipping is the
1350
+ clipping between N103 and LL104 in CDR3 of the HC. This clipping slightly increased during temperature
1351
+ stress from 1.4 to 3.2%.
1352
+
1353
+ · A clear deamidation that increased during stress was found on peptide HC(88-125),
1354
+ AEDTAVYYCAKPLYSN 103 LAGDFGSWGQGTTVTVSSASTK where deamidation increased from 1.3% to 6.6%
1355
+ after 4 weeks at 37℃. This asparagine is part of the HC CDR3 domain. Some other minor deamidation
1356
+ events were also observed.
1357
+
1358
+ · Cyclization of the N-terminal glutamate on the HC increased up to 7.0% during temperature stress (both
1359
+ present in RPLC-UV-MS and peptide map).
1360
+
1361
+ · Only minor oxidation events were identified in the variable regions of both the heavy and light chain.
1362
+
1363
+ · Both RPLC-UV-MS and peptide mapping confirmed the presence of a thioether bound HC+LC.
1364
+
1365
+ · Light chain glycation was clearly observed on RPLC-UV-MS (1x and 2x).
1366
+
1367
+
1368
+ <figure>
1369
+
1370
+ RIC
1371
+
1372
+ </figure>
1373
+
1374
+
1375
+ <!-- PageFooter="biologics" -->
1376
+ <!-- PageNumber="18" -->
1377
+ <!-- PageBreak -->
1378
+
1379
+
1380
+ <figure>
1381
+ </figure>
1382
+
1383
+
1384
+ ## P018_3D6 VHO VL6 _hIgG1_LALAPG: Conclusions
1385
+
1386
+
1387
+ ### Graphical summary (T0->T4W)
1388
+
1389
+ Oxidation: 0.5%
1390
+
1391
+ Deamidation: 1.0 -> 1.4%
1392
+
1393
+ Deamidation: 1.3 -> 6.6%
1394
+
1395
+ PyroE: 1.0 -> 7.0%
1396
+
1397
+ l
1398
+ VH
1399
+
1400
+ EVOLLESGGGLVQPGGSLRLSCAASGFTFSSYSLSWVRQAPGKGLEWVSTIKARRGTTLY
1401
+ ADSVKDRFTISRDNSKNTLYLQMg3N34SLRAEDTAVYYCAKPLYSN103L104AGDFGSWGQ
1402
+ GTTVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTF
1403
+ PAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKKVEPKSCDKTHTCPPCPA,
1404
+ PEAAGGPSVFLFPPKPKDTLMISRTPEVTCVVVDVSHEDPEVKFNWYVDGVEVHNAKT
1405
+ KPREEQYNSTYRVVSVLTVLHQDWLNGKEYKCKVSNKALGAPIEKTISKAKGQPREPQVY
1406
+ TLPPSRDELTKNQVSLTCLVKGFYPSDIAVEWESNGQPENNYKTTPPVLDSDGSFLLYSKLT
1407
+ VDKSRWQQGNVFSCSVMHEALHNHYTQKSLSLSPGK
1408
+
1409
+ Clipping: 1.4 -> 3.2%
1410
+
1411
+ G0F most abundant
1412
+
1413
+ l
1414
+ VL
1415
+
1416
+ Oxidation: 0.5%
1417
+ DIQMATQSPSSLSASVGDRVTITCQASQSISSYLAWYQQKPGKAPKLLIYGGSRLQTGVPS
1418
+ RFSGSGSGTDFTLTISSLOPEDFATYYCQQDYSWPLTFGQGTKVEIKRTVAAPSVFIFPPSD
1419
+ EQLKSGTASVVCLLNNFYPREAKVQWKVDNALQSGNSQESVTEQDSKDSTYSLSSTLTLS
1420
+ KADYEKHKVYACEVTHQGLSSPVTKSFNRGEC
1421
+
1422
+
1423
+ <table>
1424
+ <tr>
1425
+ <td>Blue:</td>
1426
+ <td>VH and VL</td>
1427
+ </tr>
1428
+ <tr>
1429
+ <td>Blue:</td>
1430
+ <td>CDR</td>
1431
+ </tr>
1432
+ <tr>
1433
+ <td>Green:</td>
1434
+ <td>N-glycosylation site</td>
1435
+ </tr>
1436
+ </table>
1437
+
1438
+
1439
+ <figure>
1440
+
1441
+ RIC
1442
+
1443
+ </figure>
1444
+
1445
+
1446
+ <!-- PageFooter="biologics" -->
1447
+
1448
+ For information purposes the following color code is applied: Values showing an increase or decrease
1449
+ between 1.0% to 10.0% compared to the T0 sample are in blue and underlined, whereas values showing an
1450
+ increase of more than 10.0% compared to T0 are in red, bold and underlined.
1451
+
1452
+ <!-- PageNumber="19" -->
1453
+ <!-- PageBreak -->
1454
+
1455
+
1456
+ <figure>
1457
+ </figure>
1458
+
1459
+
1460
+ ### Author section RIC
1461
+
1462
+
1463
+ #### Author
1464
+
1465
+ Nathan Cardon
1466
+ Senior research associate | Project Responsible
1467
+
1468
+
1469
+ <figure>
1470
+
1471
+ RIC
1472
+
1473
+ </figure>
1474
+
1475
+
1476
+ <!-- PageFooter="biologics" -->
1477
+ <!-- PageNumber="20" -->
1478
+ <!-- PageBreak -->
1479
+
1480
+
1481
+ <figure>
1482
+ </figure>
1483
+
1484
+
1485
+ ### Signature section RIC
1486
+
1487
+
1488
+ #### Reviewer
1489
+
1490
+
1491
+ <table>
1492
+ <tr>
1493
+ <td>Mabelle Meersseman</td>
1494
+ <td>Date:</td>
1495
+ </tr>
1496
+ <tr>
1497
+ <td>Group Leader</td>
1498
+ <td>Signature:</td>
1499
+ </tr>
1500
+ <tr>
1501
+ <td>Approver</td>
1502
+ <td></td>
1503
+ </tr>
1504
+ <tr>
1505
+ <td>Koen Sandra Ph.D.</td>
1506
+ <td>Date:</td>
1507
+ </tr>
1508
+ <tr>
1509
+ <td>CEO</td>
1510
+ <td>Signature:</td>
1511
+ </tr>
1512
+ </table>
1513
+
1514
+
1515
+ <figure>
1516
+
1517
+ RIC
1518
+
1519
+ </figure>
1520
+
1521
+
1522
+ <!-- PageFooter="biologics" -->
1523
+ <!-- PageNumber="21" -->
1524
+ <!-- PageBreak -->
1525
+
1526
+
1527
+ <figure>
1528
+ </figure>
1529
+
1530
+
1531
+ ### Signature section client
1532
+
1533
+ Approver
1534
+
1535
+ Name:
1536
+
1537
+ Date:
1538
+
1539
+ Signature:
1540
+
1541
+
1542
+ <figure>
1543
+
1544
+ RIC
1545
+
1546
+ </figure>
1547
+
1548
+
1549
+ <!-- PageFooter="biologics" -->
1550
+ <!-- PageNumber="22" -->
1551
+ <!-- PageBreak -->
1552
+
1553
+
1554
+ <figure>
1555
+ </figure>
1556
+
1557
+
1558
+ ### Version control
1559
+
1560
+
1561
+ <table>
1562
+ <tr>
1563
+ <th>Version</th>
1564
+ <th>Date of issue</th>
1565
+ <th>Reason for version update</th>
1566
+ </tr>
1567
+ <tr>
1568
+ <td>00</td>
1569
+ <td>20 November 2024</td>
1570
+ <td>Draft</td>
1571
+ </tr>
1572
+ <tr>
1573
+ <td></td>
1574
+ <td></td>
1575
+ <td></td>
1576
+ </tr>
1577
+ <tr>
1578
+ <td></td>
1579
+ <td></td>
1580
+ <td></td>
1581
+ </tr>
1582
+ </table>
1583
+
1584
+
1585
+ <figure>
1586
+
1587
+ RIC
1588
+
1589
+ </figure>
1590
+
1591
+
1592
+ <!-- PageFooter="biologics" -->
1593
+ <!-- PageNumber="23" -->
1594
+ <!-- PageBreak -->
1595
+
1596
+
1597
+ <figure>
1598
+
1599
+ RIC
1600
+
1601
+ biologics
1602
+
1603
+ </figure>
1604
+
1605
+
1606
+ YOUR MOLECULE. OUR ANALYTICS. NO SECRETS.
1607
+
1608
+ <!-- PageFooter="www.RIC-biologics.com" -->
src/agents/__pycache__/field_mapper_agent.cpython-312.pyc CHANGED
Binary files a/src/agents/__pycache__/field_mapper_agent.cpython-312.pyc and b/src/agents/__pycache__/field_mapper_agent.cpython-312.pyc differ
 
src/agents/field_mapper_agent.py CHANGED
@@ -266,9 +266,82 @@ class FieldMapperAgent(BaseAgent):
266
  self.logger.error(f"Error extracting field value from page: {str(e)}", exc_info=True)
267
  return None
268
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
269
  def execute(self, ctx: Dict[str, Any]): # noqa: D401
270
  field = ctx.get("current_field")
271
- self.logger.info(f"Starting field mapping for: {field}")
 
272
 
273
  # Store context for use in extraction methods
274
  self.ctx = ctx
@@ -286,9 +359,6 @@ class FieldMapperAgent(BaseAgent):
286
  text = ctx["text"]
287
  self.logger.info(f"Using text from direct context (length: {len(text)})")
288
 
289
- if not field:
290
- self.logger.warning("No field provided in context")
291
- return None
292
  if not text:
293
  self.logger.warning("No text content found in context or index")
294
  return None
@@ -297,28 +367,43 @@ class FieldMapperAgent(BaseAgent):
297
  if "document_context" not in ctx:
298
  ctx["document_context"] = self._infer_document_context(text)
299
 
300
- self.logger.info(f"Processing field: {field}")
301
  self.logger.info(f"Using document context: {ctx['document_context']}")
302
 
303
- # Process entire document at once
304
- self.logger.info("Processing entire document...")
305
- value = self._extract_field_value_from_page(field, text, ctx["document_context"])
306
- if value:
307
- return value
308
-
309
- # If no value found, try the search-based approach as fallback
310
- self.logger.warning("No value found in document analysis, falling back to search-based approach")
311
-
312
- if index and "embeddings" in index:
313
- self.logger.info("Using semantic search with embeddings")
314
- search_query = f"{field} in {ctx['document_context']}"
315
- similar_chunks = self._find_similar_chunks_search(search_query, index)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
316
 
317
- if similar_chunks:
318
- self.logger.info(f"Found {len(similar_chunks)} relevant chunks, attempting value extraction")
319
- value = self._extract_field_value_search(field, similar_chunks, ctx["document_context"])
320
- if value:
321
- return value
322
-
323
- self.logger.warning(f"No candidate found for field: {field}")
324
- return f"<no candidate for {field}>"
 
266
  self.logger.error(f"Error extracting field value from page: {str(e)}", exc_info=True)
267
  return None
268
 
269
+ def _extract_with_unique_indices(self, text: str, context: str, unique_indices: List[str], fields_to_extract: List[str]) -> Optional[str]:
270
+ """Extract values using unique indices strategy."""
271
+ self.logger.info(f"Using unique indices strategy with indices: {unique_indices}")
272
+ self.logger.info(f"Fields to extract: {fields_to_extract}")
273
+
274
+ # Get filename from context if available
275
+ filename = self.ctx.get("pdf_meta", {}).get("filename", "")
276
+ filename_context = f"\nDocument filename: {filename}" if filename else ""
277
+
278
+ prompt = f"""You are an expert in {context}
279
+
280
+ Your task is to extract information from the document based on unique combinations of indices and their corresponding fields.
281
+
282
+ Unique Indices to look for: {', '.join(unique_indices)}
283
+ Fields to extract for each combination: {', '.join(fields_to_extract)}{filename_context}
284
+
285
+ Consider the following document:
286
+ {text}
287
+
288
+ Instructions:
289
+ 1. First, identify all unique combinations of the specified indices in the document
290
+ 2. For each unique combination found, extract the values for all specified fields
291
+ 3. Return the data in a tabular format where:
292
+ - Each row represents a unique combination
293
+ - Each column represents a field value
294
+ 4. Return ONLY the JSON value, no explanations
295
+ 5. Format the response as a valid JSON object with arrays for each field
296
+ 6. Keep the structure flat - do not nest values
297
+
298
+ Example response format:
299
+ {{
300
+ "index1": ["value1", "value2", "value3"],
301
+ "index2": ["value4", "value5", "value6"],
302
+ "field1": ["value7", "value8", "value9"],
303
+ "field2": ["value10", "value11", "value12"]
304
+ }}
305
+
306
+ Field values:"""
307
+
308
+ try:
309
+ self.logger.info("Calling LLM for unique indices extraction")
310
+
311
+ # Get cost tracker from context
312
+ cost_tracker = self.ctx.get("cost_tracker") if hasattr(self, 'ctx') else None
313
+
314
+ value = self.llm.responses(
315
+ prompt, temperature=0.0,
316
+ ctx={"cost_tracker": cost_tracker} if cost_tracker else None,
317
+ description="Unique Indices Field Extraction"
318
+ )
319
+
320
+ # Log cost tracking results if available
321
+ if cost_tracker:
322
+ self.logger.info(f"Unique indices extraction costs - Input tokens: {cost_tracker.llm_input_tokens}, Output tokens: {cost_tracker.llm_output_tokens}")
323
+ self.logger.info(f"Unique indices extraction cost: ${cost_tracker.calculate_current_file_costs()['openai']['total_cost']:.4f}")
324
+
325
+ if value and value.lower() not in ["none", "null", "n/a"]:
326
+ try:
327
+ json_value = json.loads(value)
328
+ self.logger.info(f"Successfully extracted values: {json.dumps(json_value, indent=2)}")
329
+ return json.dumps(json_value, indent=2)
330
+ except json.JSONDecodeError:
331
+ self.logger.error("Failed to parse LLM response as JSON")
332
+ return None
333
+ else:
334
+ self.logger.warning("LLM returned no valid value")
335
+ return None
336
+
337
+ except Exception as e:
338
+ self.logger.error(f"Error in unique indices extraction: {str(e)}", exc_info=True)
339
+ return None
340
+
341
  def execute(self, ctx: Dict[str, Any]): # noqa: D401
342
  field = ctx.get("current_field")
343
+ strategy = ctx.get("strategy", "original") # Default to original strategy
344
+ self.logger.info(f"Starting field mapping for: {field} using strategy: {strategy}")
345
 
346
  # Store context for use in extraction methods
347
  self.ctx = ctx
 
359
  text = ctx["text"]
360
  self.logger.info(f"Using text from direct context (length: {len(text)})")
361
 
 
 
 
362
  if not text:
363
  self.logger.warning("No text content found in context or index")
364
  return None
 
367
  if "document_context" not in ctx:
368
  ctx["document_context"] = self._infer_document_context(text)
369
 
 
370
  self.logger.info(f"Using document context: {ctx['document_context']}")
371
 
372
+ # Process based on selected strategy
373
+ if strategy == "unique_indices":
374
+ unique_indices = ctx.get("unique_indices", [])
375
+ fields_to_extract = ctx.get("fields_to_extract", [])
376
+
377
+ if not unique_indices or not fields_to_extract:
378
+ self.logger.warning("Missing unique indices or fields to extract")
379
+ return None
380
+
381
+ return self._extract_with_unique_indices(text, ctx["document_context"], unique_indices, fields_to_extract)
382
+ else:
383
+ # Original strategy
384
+ if not field:
385
+ self.logger.warning("No field provided in context")
386
+ return None
387
+
388
+ self.logger.info(f"Processing field: {field}")
389
+ self.logger.info("Processing entire document...")
390
+ value = self._extract_field_value_from_page(field, text, ctx["document_context"])
391
+ if value:
392
+ return value
393
+
394
+ # If no value found, try the search-based approach as fallback
395
+ self.logger.warning("No value found in document analysis, falling back to search-based approach")
396
+
397
+ if index and "embeddings" in index:
398
+ self.logger.info("Using semantic search with embeddings")
399
+ search_query = f"{field} in {ctx['document_context']}"
400
+ similar_chunks = self._find_similar_chunks_search(search_query, index)
401
+
402
+ if similar_chunks:
403
+ self.logger.info(f"Found {len(similar_chunks)} relevant chunks, attempting value extraction")
404
+ value = self._extract_field_value_search(field, similar_chunks, ctx["document_context"])
405
+ if value:
406
+ return value
407
 
408
+ self.logger.warning(f"No candidate found for field: {field}")
409
+ return f"<no candidate for {field}>"
 
 
 
 
 
 
src/app.py CHANGED
@@ -238,6 +238,23 @@ else: # page == "Execution"
238
  fields_str = st.text_input("Fields (comma‑separated)", "Protein Lot, Chain, Residue")
239
  desc_blob = st.text_area("Field descriptions / rules (YAML, optional)")
240
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
241
  def flatten_json_response(json_data, fields):
242
  """Flatten the nested JSON response into a tabular structure with dynamic columns."""
243
  logger = logging.getLogger(__name__)
@@ -327,6 +344,8 @@ else: # page == "Execution"
327
  doc_preview=preview,
328
  fields=field_list,
329
  field_descs=field_descs,
 
 
330
  )
331
 
332
  # Add a visual separator
 
238
  fields_str = st.text_input("Fields (comma‑separated)", "Protein Lot, Chain, Residue")
239
  desc_blob = st.text_area("Field descriptions / rules (YAML, optional)")
240
 
241
+ # Add strategy selector
242
+ strategy = st.radio(
243
+ "Select Extraction Strategy",
244
+ ["Original Strategy", "Unique Indices Strategy"],
245
+ help="Original Strategy: Process document page by page. Unique Indices Strategy: Process entire document at once using unique indices."
246
+ )
247
+
248
+ # Add unique indices input if Unique Indices Strategy is selected
249
+ unique_indices = None
250
+ if strategy == "Unique Indices Strategy":
251
+ unique_indices_str = st.text_input(
252
+ "Unique Fields (comma-separated)",
253
+ help="Enter the field names that uniquely identify each record (e.g., 'timepoint, Modification, peptide')"
254
+ )
255
+ if unique_indices_str:
256
+ unique_indices = [idx.strip() for idx in unique_indices_str.split(",") if idx.strip()]
257
+
258
  def flatten_json_response(json_data, fields):
259
  """Flatten the nested JSON response into a tabular structure with dynamic columns."""
260
  logger = logging.getLogger(__name__)
 
344
  doc_preview=preview,
345
  fields=field_list,
346
  field_descs=field_descs,
347
+ strategy=strategy,
348
+ unique_indices=unique_indices
349
  )
350
 
351
  # Add a visual separator
src/orchestrator/__pycache__/planner.cpython-312.pyc CHANGED
Binary files a/src/orchestrator/__pycache__/planner.cpython-312.pyc and b/src/orchestrator/__pycache__/planner.cpython-312.pyc differ
 
src/orchestrator/planner.py CHANGED
@@ -37,6 +37,8 @@ class Planner:
37
  fields: List[str],
38
  doc_preview: str | None = None,
39
  field_descs: Dict | None = None,
 
 
40
  ) -> Dict[str, Any]:
41
  """Return a JSON dict representing the execution plan."""
42
 
@@ -45,9 +47,14 @@ class Planner:
45
  "doc_preview": doc_preview or "",
46
  "fields": fields,
47
  "field_descriptions": field_descs or {},
 
 
48
  }
49
 
50
  logger.info(f"Building plan for fields: {fields}")
 
 
 
51
  logger.debug(f"User context: {user_context}")
52
 
53
  prompt = self.prompt_template.format_json(**user_context)
@@ -71,8 +78,11 @@ class Planner:
71
  # ensure minimal structure exists
72
  if "steps" in plan and "fields" in plan:
73
  logger.info("Plan successfully generated with required structure")
74
- # Add pdf_meta to the plan
75
  plan["pdf_meta"] = pdf_meta
 
 
 
76
  return plan
77
  else:
78
  missing_keys = []
@@ -93,7 +103,7 @@ class Planner:
93
 
94
  # ---------- fallback static plan ----------
95
  logger.info("Falling back to static plan")
96
- return self._static_plan(fields)
97
 
98
  # --------------------------------------------------
99
  @staticmethod
@@ -115,6 +125,8 @@ class Planner:
115
  field_descriptions = kwargs.get("field_descriptions", {})
116
  doc_preview = kwargs.get("doc_preview", "")
117
  pdf_meta = kwargs.get("pdf_meta", {})
 
 
118
 
119
  # Create a formatted string with the actual values
120
  formatted = self.s
@@ -128,6 +140,10 @@ class Planner:
128
  formatted = formatted.replace("a few kB of raw text from the uploaded document", f"document preview: {doc_preview[:1000]}...")
129
  if pdf_meta:
130
  formatted = formatted.replace("pdf_meta / field_descriptions for extra context", f"document metadata: {json.dumps(pdf_meta)}")
 
 
 
 
131
 
132
  return formatted
133
 
@@ -135,7 +151,7 @@ class Planner:
135
 
136
  # --------------------------------------------------
137
  @staticmethod
138
- def _static_plan(fields: List[str]) -> Dict[str, Any]:
139
  """Return a hard-coded plan to guarantee offline functionality."""
140
  logger.info("Generating static fallback plan")
141
  steps = [
@@ -148,4 +164,12 @@ class Planner:
148
  ],
149
  },
150
  ]
151
- return {"steps": steps, "fields": fields, "pdf_meta": {}} # Include empty pdf_meta in static plan
 
 
 
 
 
 
 
 
 
37
  fields: List[str],
38
  doc_preview: str | None = None,
39
  field_descs: Dict | None = None,
40
+ strategy: str = "Original Strategy",
41
+ unique_indices: List[str] | None = None,
42
  ) -> Dict[str, Any]:
43
  """Return a JSON dict representing the execution plan."""
44
 
 
47
  "doc_preview": doc_preview or "",
48
  "fields": fields,
49
  "field_descriptions": field_descs or {},
50
+ "strategy": strategy,
51
+ "unique_indices": unique_indices or [],
52
  }
53
 
54
  logger.info(f"Building plan for fields: {fields}")
55
+ logger.info(f"Using strategy: {strategy}")
56
+ if unique_indices:
57
+ logger.info(f"Unique indices: {unique_indices}")
58
  logger.debug(f"User context: {user_context}")
59
 
60
  prompt = self.prompt_template.format_json(**user_context)
 
78
  # ensure minimal structure exists
79
  if "steps" in plan and "fields" in plan:
80
  logger.info("Plan successfully generated with required structure")
81
+ # Add pdf_meta and strategy info to the plan
82
  plan["pdf_meta"] = pdf_meta
83
+ plan["strategy"] = strategy
84
+ if unique_indices:
85
+ plan["unique_indices"] = unique_indices
86
  return plan
87
  else:
88
  missing_keys = []
 
103
 
104
  # ---------- fallback static plan ----------
105
  logger.info("Falling back to static plan")
106
+ return self._static_plan(fields, strategy, unique_indices)
107
 
108
  # --------------------------------------------------
109
  @staticmethod
 
125
  field_descriptions = kwargs.get("field_descriptions", {})
126
  doc_preview = kwargs.get("doc_preview", "")
127
  pdf_meta = kwargs.get("pdf_meta", {})
128
+ strategy = kwargs.get("strategy", "Original Strategy")
129
+ unique_indices = kwargs.get("unique_indices", [])
130
 
131
  # Create a formatted string with the actual values
132
  formatted = self.s
 
140
  formatted = formatted.replace("a few kB of raw text from the uploaded document", f"document preview: {doc_preview[:1000]}...")
141
  if pdf_meta:
142
  formatted = formatted.replace("pdf_meta / field_descriptions for extra context", f"document metadata: {json.dumps(pdf_meta)}")
143
+ if strategy:
144
+ formatted = formatted.replace("strategy for extraction", f"extraction strategy: {strategy}")
145
+ if unique_indices:
146
+ formatted = formatted.replace("unique indices for extraction", f"unique indices: {json.dumps(unique_indices)}")
147
 
148
  return formatted
149
 
 
151
 
152
  # --------------------------------------------------
153
  @staticmethod
154
+ def _static_plan(fields: List[str], strategy: str = "Original Strategy", unique_indices: List[str] | None = None) -> Dict[str, Any]:
155
  """Return a hard-coded plan to guarantee offline functionality."""
156
  logger.info("Generating static fallback plan")
157
  steps = [
 
164
  ],
165
  },
166
  ]
167
+ plan = {
168
+ "steps": steps,
169
+ "fields": fields,
170
+ "pdf_meta": {},
171
+ "strategy": strategy
172
+ }
173
+ if unique_indices:
174
+ plan["unique_indices"] = unique_indices
175
+ return plan
src/services/azure_di_service.py CHANGED
@@ -15,47 +15,98 @@ class AzureDIService:
15
  self.log_dir = Path("logs/di_content")
16
  self.log_dir.mkdir(parents=True, exist_ok=True)
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  def extract_tables(self, pdf_bytes: bytes):
19
  try:
20
  self.logger.info("Starting document analysis with Azure Document Intelligence")
21
 
22
- # Analyze the entire document at once
23
- #poller = self.client.begin_analyze_document("prebuilt-layout", body=pdf_bytes)
24
-
25
- poller = self.client.begin_analyze_document(
26
  "prebuilt-layout",
27
  body=pdf_bytes,
28
- content_type="application/octet-stream",
29
  output_content_format=DocumentContentFormat.MARKDOWN
30
  )
31
- result = poller.result()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
- # Log the raw result structure
34
- self.logger.info("Inspecting Azure DI result structure:")
35
- self.logger.info(f"Result type: {type(result)}")
36
- self.logger.info(f"Result attributes: {dir(result)}")
37
 
38
- # Check if content exists and log its type
39
- if hasattr(result, "content"):
40
- self.logger.info(f"Content type: {type(result.content)}")
41
- self.logger.info(f"Content preview: {result.content[:500]}")
42
-
43
- # Save content to timestamped file
44
- timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
45
- log_file = self.log_dir / f"di_content_{timestamp}.txt"
46
- with open(log_file, "w", encoding="utf-8") as f:
47
- f.write(result.content)
48
- self.logger.info(f"Saved DI content to {log_file}")
49
 
50
- # Check if tables exist and log their structure
51
- if hasattr(result, "tables"):
52
- self.logger.info(f"Number of tables: {len(result.tables)}")
53
- if result.tables:
54
- self.logger.info(f"First table structure: {dir(result.tables[0])}")
55
- self.logger.info(f"First table cells: {[cell.content for cell in result.tables[0].cells]}")
 
 
56
 
57
- # For now, return empty result until we understand the structure
58
- return {"text": result.content}
 
 
59
 
60
  except HttpResponseError as e:
61
  self.logger.error(f"Azure Document Intelligence API error: {str(e)}")
 
15
  self.log_dir = Path("logs/di_content")
16
  self.log_dir.mkdir(parents=True, exist_ok=True)
17
 
18
+ def _process_table(self, table):
19
+ """Process a table to properly handle rowspans and return expanded rows."""
20
+ if not hasattr(table, 'cells'):
21
+ return []
22
+
23
+ # Get table dimensions
24
+ rows = max(cell.row_index for cell in table.cells) + 1
25
+ cols = max(cell.column_index for cell in table.cells) + 1
26
+
27
+ # Initialize the expanded table
28
+ expanded_table = []
29
+ for _ in range(rows):
30
+ expanded_table.append([None] * cols)
31
+
32
+ # First pass: fill in all cells
33
+ for cell in table.cells:
34
+ expanded_table[cell.row_index][cell.column_index] = cell.content
35
+
36
+ # Second pass: handle rowspans
37
+ for cell in table.cells:
38
+ if hasattr(cell, 'row_span') and cell.row_span > 1:
39
+ # Copy the content to all spanned rows
40
+ for i in range(1, cell.row_span):
41
+ if cell.row_index + i < rows:
42
+ expanded_table[cell.row_index + i][cell.column_index] = cell.content
43
+
44
+ # Convert to list of dictionaries
45
+ headers = expanded_table[0]
46
+ result = []
47
+ for row in expanded_table[1:]:
48
+ row_dict = {}
49
+ for i, value in enumerate(row):
50
+ if i < len(headers):
51
+ row_dict[headers[i]] = value
52
+ result.append(row_dict)
53
+
54
+ return result
55
+
56
  def extract_tables(self, pdf_bytes: bytes):
57
  try:
58
  self.logger.info("Starting document analysis with Azure Document Intelligence")
59
 
60
+ # First call: Get markdown format for document context
61
+ markdown_poller = self.client.begin_analyze_document(
 
 
62
  "prebuilt-layout",
63
  body=pdf_bytes,
64
+ content_type="application/octet-stream",
65
  output_content_format=DocumentContentFormat.MARKDOWN
66
  )
67
+ markdown_result = markdown_poller.result()
68
+
69
+ # Second call: Get JSON format for table processing
70
+ json_poller = self.client.begin_analyze_document(
71
+ "prebuilt-layout",
72
+ body=pdf_bytes,
73
+ content_type="application/octet-stream",
74
+ output_content_format=DocumentContentFormat.JSON
75
+ )
76
+ json_result = json_poller.result()
77
+
78
+ # Process tables from JSON result
79
+ tables_data = []
80
+ if hasattr(json_result, "tables"):
81
+ self.logger.info(f"Number of tables: {len(json_result.tables)}")
82
+ for table in json_result.tables:
83
+ processed_table = self._process_table(table)
84
+ tables_data.extend(processed_table)
85
+ self.logger.info(f"Processed table with {len(processed_table)} rows")
86
 
87
+ # Save both markdown and JSON content for debugging
88
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
 
 
89
 
90
+ # Save markdown content
91
+ markdown_log = self.log_dir / f"di_content_{timestamp}_markdown.txt"
92
+ with open(markdown_log, "w", encoding="utf-8") as f:
93
+ if hasattr(markdown_result, "content"):
94
+ f.write(markdown_result.content)
95
+ self.logger.info(f"Saved markdown content to {markdown_log}")
 
 
 
 
 
96
 
97
+ # Save JSON content
98
+ json_log = self.log_dir / f"di_content_{timestamp}_json.txt"
99
+ with open(json_log, "w", encoding="utf-8") as f:
100
+ if hasattr(json_result, "content"):
101
+ f.write(json_result.content)
102
+ else:
103
+ f.write(json.dumps(json_result.to_dict(), indent=2))
104
+ self.logger.info(f"Saved JSON content to {json_log}")
105
 
106
+ return {
107
+ "text": markdown_result.content if hasattr(markdown_result, "content") else "",
108
+ "tables": tables_data
109
+ }
110
 
111
  except HttpResponseError as e:
112
  self.logger.error(f"Azure Document Intelligence API error: {str(e)}")
src/ui/strategy_selector.py ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Strategy selection UI components for Streamlit."""
2
+ import streamlit as st
3
+ from typing import Dict, Any, List, Tuple
4
+
5
+ def render_strategy_selector() -> Tuple[str, Dict[str, Any]]:
6
+ """Render strategy selection UI and return selected strategy and parameters."""
7
+ strategy = st.radio(
8
+ "Select Extraction Strategy",
9
+ ["Original Strategy", "Unique Indices Strategy"],
10
+ help="Choose how to extract information from the document"
11
+ )
12
+
13
+ params = {}
14
+
15
+ if strategy == "Original Strategy":
16
+ params["strategy"] = "original"
17
+ params["current_field"] = st.text_input(
18
+ "Field to Extract",
19
+ help="Enter the field name to extract from the document"
20
+ )
21
+ else:
22
+ params["strategy"] = "unique_indices"
23
+
24
+ # Get unique indices
25
+ indices_input = st.text_area(
26
+ "Unique Indices",
27
+ help="Enter comma-separated list of indices to look for (e.g., 'peptide, modification, timepoint')"
28
+ )
29
+ params["unique_indices"] = [idx.strip() for idx in indices_input.split(",") if idx.strip()]
30
+
31
+ # Get fields to extract
32
+ fields_input = st.text_area(
33
+ "Fields to Extract",
34
+ help="Enter comma-separated list of fields to extract for each combination"
35
+ )
36
+ params["fields_to_extract"] = [field.strip() for field in fields_input.split(",") if field.strip()]
37
+
38
+ return strategy, params