Spaces:

levalencia
/

doctorecord

Sleeping

App Files Files Community

levalencia commited on Jun 5, 2025

Commit

924cb7d

1 Parent(s): 26b3eb7

Add extraction strategy selection and unique indices handling in app.py and FieldMapperAgent. Enhance Planner to accommodate new parameters for execution plans. Update AzureDIService to process tables and save extracted content in both markdown and JSON formats for improved traceability.

Browse files

Files changed (8) hide show

logs/di_content/di_content_20250605_114836.txt +1608 -0
src/agents/__pycache__/field_mapper_agent.cpython-312.pyc +0 -0
src/agents/field_mapper_agent.py +111 -26
src/app.py +19 -0
src/orchestrator/__pycache__/planner.cpython-312.pyc +0 -0
src/orchestrator/planner.py +28 -4
src/services/azure_di_service.py +80 -29
src/ui/strategy_selector.py +38 -0

logs/di_content/di_content_20250605_114836.txt ADDED Viewed

	@@ -0,0 +1,1608 @@

+# ARGX DISCOVERY: EVALUATION OF LIABILITIES IN VH AND VL REGION OF ONE CONSTRUCT
+Argenx - P3016_R010_v00
+<figure>
+RIC
+biologics
+</figure>
+<!-- PageBreak -->
+<figure>
+</figure>
+<!-- PageHeader="Table of contents" -->
+l Project information
+Scope
+Test samples
+l
+Method
+l
+Results
+· P018_3D6 VHO VL6 _hlgG1_LALAPG_F405L-FJB_2024-04-18_002
+l
+Conclusions
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+<!-- PageNumber="2" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+## Project information
+l
+l
+l
+l
+<table>
+<tr>
+<td>Sales quote:</td>
+<td>SQ20202722</td>
+</tr>
+<tr>
+<td>Project code:</td>
+<td>P3016</td>
+</tr>
+<tr>
+<td>LNB number:</td>
+<td>2023.040</td>
+</tr>
+<tr>
+<td>Project responsible:</td>
+<td>Nathan Cardon</td>
+</tr>
+<tr>
+<td>Report name:</td>
+<td>P3016_R010_v01</td>
+</tr>
+</table>
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+<!-- PageNumber="3" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+Scope
+l This report describes the results of a liability assessment of one argenx
+discovery construct. Non-stressed and temperature-stressed samples
+stored for multiple weeks at 37℃ were evaluated by reduced protein RPLC-
+UV-MS and peptide map analysis. The focus of both analyses was on the
+variable regions of the different constructs.
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+<!-- PageNumber="4" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+### Test samples
+The samples used in this study are listed below
+<table>
+<tr>
+<th>Construct</th>
+<th>Stress condition</th>
+<th>Concentration (mg/mL)</th>
+</tr>
+<tr>
+<td rowspan="3">P018_3D6 VHO VL6 _hIgG1_LALAPG_F405L- FJB_2024-04-18_002</td>
+<td>T0W</td>
+<td>1.00</td>
+</tr>
+<tr>
+<td>T2W_37℃</td>
+<td>1.00</td>
+</tr>
+<tr>
+<td>T4W_37℃</td>
+<td>1.00</td>
+</tr>
+</table>
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+<!-- PageNumber="5" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+## Method: Reduced protein analysis by RPLC-UV-MS
+l The samples were reduced by incubation with DTT while in denaturing conditions. The
+samples were analyzed using a C8 RPLC column on a 1290 Infinity UHPLC system coupled
+to a 6540 Q-TOF mass spectrometer (both from Agilent Technologies).
+l RPLC was performed with trifluoroacetic acid (TFA) as ion pairing additive, and with H2O
+and acetonitrile as mobile phases.
+l Data acquisition and processing were performed with BioConfirm MassHunter 7.0
+(Agilent Technologies). UV 280 nm and MS data were acquired simultaneously.
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+<!-- PageNumber="6" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+## Method: Peptide map analysis in reducing conditions
+l Prior to digestion, the samples were reduced using dithiothreitol (DTT) and alkylated with
+iodoacetamide (IAA). The samples were digested using trypsin as protease.
+l The digests were analyzed on RPLC-MS using a C18 RPLC column. RPLC was performed
+with formic acid (FA) as additive, and with H2O and acetonitrile as mobile phases. The
+analyses were performed on a 1290 Infinity UHPLC system (Agilent Technologies) coupled
+to a 6545 Q-TOF Mass Spectrometer (Agilent Technologies) operated in MS and MS/MS
+mode.
+l Data processing was performed using BioConfirm 10.0 and MassHunter 7.0 (Agilent
+Technologies).
+l Measured signals were matched onto the sequence. Identification was based primarily
+on MS-only data. Enzyme specified was trypsin (C-terminal cleavage at lysine or arginine)
+and 0-2 missed cleavages were allowed. N-terminal cyclization (pyroglutamate from E/Q),
+D isomerization, N/Q deamidation, M/W oxidation and N-glycosylation were considered
+as variable modifications while cysteine carbamidomethylation (sample preparation
+related) was considered as fixed modification. Peak areas from extracted ion
+chromatograms (EICs) were used for quantifying modifications.
+l Note that peptide map data processing was focused on the variable regions of the
+constructs.
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+<!-- PageNumber="7" -->
+<!-- PageBreak -->
+P018_3D6 VHO VL6_HIGG1_LALAPG
+<!-- PageBreak -->
+<figure>
+</figure>
+## AA sequence of P018_3D6 VHO VL6 _hIgG1_LALAPG
+l AA sequence of VH (MW (G0F): 50306.86Da):
+EVOLLESGGGLVQPGGSLRLSCAASGFTFSSYSLSWVRQAPGKGLEWVSTIKARRGTTLYADSVKDRIFTISR
+DNSKNTLYLQMNSLRAEDTAVYYCAKPLYSNLAGDFGSWGQGTTVTVSSASTKGPSVFPLAPSSKSTSGGT
+AALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKV
+DKKVEPKSCDKTHTCPPCPAPEAAGGPSVFLFPPKPKDTLMISRTPEVTCVVVDVSHEDPEVKFNWYVDG
+VEVHNAKTKPREEQYNSTYRVVSVLTVLHQDWLNGKEYKCKVSNKALGAPIEKTISKAKGQPREPQVYTLP
+PSRDELTKNQVSLTCLVKGFYPSDIAVEWESNGQPENNYKTTPPVLDSDGSFLLYSKLTVDKSRWQQGNVF
+SCSVMHEALHNHYTQKSLSLSPGK
+AA sequence of VL (MW: 23376.19Da):
+DIQMTQSPSSLSASVGDRVTITCQASQSISSYLAWYQQKPGKAPKLLIYGGSRLQTGVPSRFSGSGSGTDFT
+LTISSLOPEDFATYYCQQDYSWPLTFGQGTKVEIKRTVAAPSVFIFPPSDEQLKSGTASVVCLLNNFYPREAK
+VQWKVDNALQSGNSQESVTEQDSKDSTYSLSSTLTLSKADYEKHKVYACEVTHQGLSSPVTKSFNRGEC
+<figure>
+RIC
+<!-- PageFooter="biologics" -->
+</figure>
+<table>
+<tr>
+<td>Blue:</td>
+<td>VH and VL</td>
+</tr>
+<tr>
+<td>Blue:</td>
+<td>CDR</td>
+</tr>
+<tr>
+<td>Green:</td>
+<td>N-glycosylation site</td>
+</tr>
+</table>
+<!-- PageNumber="9" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+## Results: reduced protein RPLC-UV-MS
+### l
+#### Overlays of the UV214nm profiles
+<figure>
+×10 3
+DAD1 - A:Sig=214.0,8.0 Ref=360.0,100.0 P3016_10OCT24_rRPLC_UV_MS_P018_3D6_VHO_VL_T0.d
+1\.
+HC
+T0W
+0.95-
+5
+0.9
+T2W_37℃
+0.85-
+T4W_37°℃
+0.8
+LC
+0.75-
+a: SG clip HC (E1-S140)
+0.7-
+GG clip HC (E1-G141)
+g
+1: LC + deamidation
+2: RG clip HC (G51-G450)
+0.65
+b: LC + deamidation
+NL clip HC (L104-G450)
+0.6
+c: PK clip HC E1-P221
+3: HC isomer
+0.55-
+LC
+4: unknown
+0.5
+d & e: LC
+5: HC
+0.45
+LC + oxidation (1x, 2x)
+6: HC + Deamidation
+0.4
+f: LC + LC glycation (1x, 2x)
+7: HC + PyroE
+0.35-
+DR clip LC (R18-C224)
+DP clip HC (E1-D274)
+0.3
+0.25
+g: LC
+8: CD clip HC (E1-C224)
+9: Thioether HC+LC, (73651.6 Da)
+DG Clip HC: (E1-D284)
+0.2
+7
+0.15
+4
+0.1
+0.05
+f
+2
+3
+6
+9
+a
+b
+e
+1
+8
+c
+d
+0
+6
+6.5
+7
+7.5
+8
+8.5
+9
+9.5
+10
+10.5
+11
+11.5
+12
+12.5
+13
+13.5
+14
+14.5
+15
+15.5
+16
+16.5
+17
+17.5
+18
+18.5
+19
+19.5
+20
+20.5
+21
+21.5
+22
+Response Units vs. Acquisition Time (min)
+</figure>
+*T0 chromatogram aligned to facilitate data interpretation.
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+Color legend:
+Black: clear identity on RPLC-UV-MS
+Blue: modification with clear trend visible on RPLC-UV-MS
+Blue: modification in variable region also confirmed in
+peptide mapping
+Grey: most likely method induced modification
+<!-- PageNumber="10" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+## Results: reduced protein RPLC-UV-MS
+### l Deconvoluted spectra of HC and LC
++ESI Scan (rt: 13.0-13.7 min, 45 scans) Frag=200.0V P3016_10OCT24_rRPLC_UV_MS_P018_3D6_VHO_VL_TO.d Deconvoluted (Isotope Width=0.0)
+x10 6
+2.5
+23377.04
+Theoretical mass LC: 23376.2Da
+2.25
+2
+1.75
+1.5
+1.25
+1
+0.75
+0.5
+0.25
+0
+15585.42 18295.12
+11
+28052.48
+34559.61
+x10 5
++ESI Scan (rt: 14.3-15.3 min, 58 scans) Frag=200.0V P3016_10OCT24_rRPLC_UV_MS_P018_3D6_VHO_VL_TO.d Deconvoluted (Isotope Width=0.0)
+<figure>
+Theoretical mass HC (G0F): 50306.9Da
+G0F
+50308.16
+7
+6
+5
+4
+3
+2
+1
+0%
+0
+53210.42
+57837.17
+12000 14000 16000 18000 20000 22000 24000 26000 28000 30000 32000 34000 36000 38000 40000 42000 44000 46000 48000 50000 52000 54000 56000 58000
+Counts vs. Deconvoluted Mass (amu)
+</figure>
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+<!-- PageNumber="11" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+## Results: tryptic peptide map
+### l N-terminal and C-terminal modifications for HC and LC
+<table>
+<tr>
+<th colspan="6">Relative quantification by EIC (%)</th>
+</tr>
+<tr>
+<th>AA Sequence</th>
+<th>Seq Loc</th>
+<th>Modification</th>
+<th>T0</th>
+<th>T2W_37℃</th>
+<th>T4W_37℃</th>
+</tr>
+<tr>
+<td rowspan="2">EVQLLESGGGLVQPGGSLR</td>
+<td rowspan="2">HC(1-19)</td>
+<td></td>
+<td>99.0</td>
+<td>96.1</td>
+<td>93.0</td>
+</tr>
+<tr>
+<td>PyroE</td>
+<td>1.0</td>
+<td>3.9</td>
+<td>7.0</td>
+</tr>
+</table>
+### l Isomerization events in the variable parts of HC and LC
+<table>
+<tr>
+<th colspan="6">Relative quantification by EIC (%)</th>
+</tr>
+<tr>
+<th>AA Sequence</th>
+<th>Seq Loc</th>
+<th>Modification</th>
+<th>T0</th>
+<th>T2W_37℃</th>
+<th>T4W_37℃</th>
+</tr>
+<tr>
+<td rowspan="2">AEDTAVYYCAKPLYSNLAGDFGS WGQGTTVTVSSASTK</td>
+<td rowspan="2">HC(88-125)</td>
+<td></td>
+<td>99.9</td>
+<td>99.7</td>
+<td>99.5</td>
+</tr>
+<tr>
+<td>Isomerization</td>
+<td>0.1</td>
+<td>0.3</td>
+<td>0.5</td>
+</tr>
+</table>
+For information purposes the following color code is applied: Values showing an increase or decrease
+between 1.0% to 10.0% compared to the T0 sample are in blue and underlined, whereas values showing an
+increase of more than 10.0% compared to T0 are in red, bold and underlined.
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+<!-- PageNumber="12" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+## Results: tryptic peptide map
+### l M/W oxidations in variable parts of HC and LC
+<table>
+<tr>
+<th colspan="6">Relative quantification by EIC (%)</th>
+</tr>
+<tr>
+<th>AA Sequence</th>
+<th>Seq Loc</th>
+<th>Modification</th>
+<th>T0</th>
+<th>T2W_37℃</th>
+<th>T4W_37℃</th>
+</tr>
+<tr>
+<td rowspan="2">NTLYLQMNSLR</td>
+<td rowspan="2">HC(77-87)</td>
+<td></td>
+<td>99.5</td>
+<td>99.5</td>
+<td>99.5</td>
+</tr>
+<tr>
+<td>Oxidation</td>
+<td>0.5</td>
+<td>0.5</td>
+<td>0.5</td>
+</tr>
+<tr>
+<td rowspan="2">DIQMTQSPSSLSASVGDR</td>
+<td rowspan="2">LC(1-18)</td>
+<td></td>
+<td>99.5</td>
+<td>99.5</td>
+<td>99.5</td>
+</tr>
+<tr>
+<td>Oxidation</td>
+<td>0.5</td>
+<td>0.5</td>
+<td>0.5</td>
+</tr>
+</table>
+For information purposes the following color code is applied: Values showing an increase or decrease
+between 1.0% to 10.0% compared to the T0 sample are in blue and underlined, whereas values showing an
+increase of more than 10.0% compared to T0 are in red, bold and underlined.
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+<!-- PageNumber="13" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+## Results: tryptic peptide map
+### l N/Q deamidation in variable parts of HC and LC
+<table>
+<tr>
+<th colspan="6">Relative quantification by EIC (%)</th>
+</tr>
+<tr>
+<th>AA Sequence</th>
+<th>Seq Loc</th>
+<th>Modification</th>
+<th>T0</th>
+<th>T2W_37℃</th>
+<th>T4W_37℃</th>
+</tr>
+<tr>
+<td rowspan="2">EVQLLESGGGLVQPGGSLR</td>
+<td rowspan="2">HC(1-19)</td>
+<td></td>
+<td>99.6</td>
+<td>99.6</td>
+<td>99.5</td>
+</tr>
+<tr>
+<td>Deamidation</td>
+<td>0.4</td>
+<td>0.4</td>
+<td>0.5</td>
+</tr>
+<tr>
+<td rowspan="2">NTLYLQMNSLR*</td>
+<td rowspan="2">HC(77-87)</td>
+<td></td>
+<td>99.0</td>
+<td>98.8</td>
+<td>98.6</td>
+</tr>
+<tr>
+<td>Deamidation</td>
+<td>1.0</td>
+<td>1.2</td>
+<td>1.4</td>
+</tr>
+<tr>
+<td rowspan="2">AEDTAVYYCAKPLYSNLAGDFGS WGQGTTVTVSSASTK</td>
+<td rowspan="2">HC(88-125)</td>
+<td></td>
+<td>98.7</td>
+<td>96.1</td>
+<td>93.4</td>
+</tr>
+<tr>
+<td>Deamidation</td>
+<td>1.3</td>
+<td>3.9</td>
+<td>6.6</td>
+</tr>
+<tr>
+<td rowspan="2">VTITCQASQSISSYLAWYQQKPGK</td>
+<td rowspan="2">LC(19-42)</td>
+<td></td>
+<td>99.9</td>
+<td>99.8</td>
+<td>99.5</td>
+</tr>
+<tr>
+<td>Deamidation</td>
+<td>0.1</td>
+<td>0.2</td>
+<td>0.5</td>
+</tr>
+</table>
+*MS/MS confirmed the deamidation site as N84, see next slides.
+For information purposes the following color code is applied: Values showing an increase or decrease
+between 1.0% to 10.0% compared to the T0 sample are in blue and underlined, whereas values showing an
+increase of more than 10.0% compared to T0 are in red, bold and underlined.
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+<!-- PageNumber="14" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+## Results: tryptic peptide map
+l
+### HC(77-87) deamidation site confirmation via MS/MS
+N77:
+<table>
+<tr>
+<th rowspan="2">b --- 1</th>
+<th></th>
+<th></th>
+<th>y</th>
+</tr>
+<tr>
+<th>N(+0.984016)</th>
+<th>11</th>
+<th>---</th>
+</tr>
+<tr>
+<td>217.0819 2</td>
+<td>T</td>
+<td>10</td>
+<td>1238.6562</td>
+</tr>
+<tr>
+<td>330.1660 3</td>
+<td>L</td>
+<td>9</td>
+<td>1137.6085</td>
+</tr>
+<tr>
+<td>493.2293 4</td>
+<td>Y</td>
+<td>8</td>
+<td>1024.5244</td>
+</tr>
+<tr>
+<td>606.3134 5</td>
+<td>L</td>
+<td>7</td>
+<td>861.4611</td>
+</tr>
+<tr>
+<td>734.3719 6</td>
+<td>Q</td>
+<td>6</td>
+<td>748.3770</td>
+</tr>
+<tr>
+<td>865.4124 7</td>
+<td>M</td>
+<td>5</td>
+<td>620.3185</td>
+</tr>
+<tr>
+<td>979.4553 8</td>
+<td>N</td>
+<td>4</td>
+<td>489.2780</td>
+</tr>
+<tr>
+<td>066.4874 9</td>
+<td>S</td>
+<td>3</td>
+<td>375.2350</td>
+</tr>
+<tr>
+<td>179.5714 10</td>
+<td>L</td>
+<td>2</td>
+<td>288.2030</td>
+</tr>
+<tr>
+<td>--- 11</td>
+<td>R</td>
+<td>1</td>
+<td>175.1190</td>
+</tr>
+</table>
+<figure>
+N84:
+Visible
+when
+zooming
+</figure>
+<table>
+<tr>
+<th>b</th>
+<th colspan="3">y</th>
+</tr>
+<tr>
+<td>--- 1</td>
+<td>N</td>
+<td>11</td>
+<td>---</td>
+</tr>
+<tr>
+<td>216.0979 2</td>
+<td>T</td>
+<td>10</td>
+<td>1239.6402</td>
+</tr>
+<tr>
+<td>329.1819 3</td>
+<td>L</td>
+<td>9</td>
+<td>1138.5925</td>
+</tr>
+<tr>
+<td>492.2453 4</td>
+<td>Y</td>
+<td>8</td>
+<td>1025.5084</td>
+</tr>
+<tr>
+<td>605.3293 5</td>
+<td>L</td>
+<td>7</td>
+<td>862.4451</td>
+</tr>
+<tr>
+<td>733.3879 6</td>
+<td>Q</td>
+<td>6</td>
+<td>749.3611</td>
+</tr>
+<tr>
+<td>864.4284 7</td>
+<td>M</td>
+<td>5</td>
+<td>621.3025</td>
+</tr>
+<tr>
+<td>979.4553 8</td>
+<td>N(+0.984016)</td>
+<td>4</td>
+<td>490.2620</td>
+</tr>
+<tr>
+<td>1066.4874 9</td>
+<td>S</td>
+<td>3</td>
+<td>375.2350</td>
+</tr>
+<tr>
+<td>1179.5714 10</td>
+<td>L</td>
+<td>2</td>
+<td>288.2030</td>
+</tr>
+<tr>
+<td>--- 11</td>
+<td>R</td>
+<td>1</td>
+<td>175.1190</td>
+</tr>
+</table>
++ESI Product lon (rt: 20.9 min) Frag=175.0V [email protected] (677.3463[z=2] -> ** ) P3016_PM_09OCT2024_R_MS2_P018_3D6_2024-04-18_002_T4w-2.d
+<figure>
+x10 3
+216.0979
+4
+3
+862.4438
+329.1827
+1025.5063
+2
+492.2425
+749.3614
+621.3030
+1138.5930
+1
+375.2330
+447.2224
+813.6398
+0
+150
+200
+250
+300
+350
+400
+450
+500
+550
+600
+650
+700
+750
+Counts vs. Mass-to-Charge (m/z)
+800
+850
+900
+950
+1000
+1050
+1100
+1150
+</figure>
+<figure>
+RIC
+biologics
+</figure>
+<!-- PageFooter="MS/MS data confirms N84 as the deamidation site." -->
+<!-- PageNumber="15" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+## Results: tryptic peptide map
+### Thioether bond
+<table>
+<tr>
+<th colspan="6">Relative quantification by EIC (%)</th>
+</tr>
+<tr>
+<th>AA Sequence</th>
+<th>Seq Loc</th>
+<th>Modification</th>
+<th>T0</th>
+<th>T2W_37℃</th>
+<th>T4W_37℃</th>
+</tr>
+<tr>
+<td>SCDK</td>
+<td>HC(223-226)</td>
+<td></td>
+<td rowspan="2">99.8</td>
+<td rowspan="2">99.5</td>
+<td rowspan="2">98.8</td>
+</tr>
+<tr>
+<td>SFNRGEC</td>
+<td>LC(208-214)</td>
+<td></td>
+</tr>
+<tr>
+<td>SFNRGEC - SCDK</td>
+<td>HC(223-226) + LC(208-214)</td>
+<td>Thioether</td>
+<td>0.2</td>
+<td>0.5</td>
+<td>1.2</td>
+</tr>
+</table>
+For information purposes the following color code is applied: Values showing an increase or decrease
+between 1.0% to 10.0% compared to the T0 sample are in blue and underlined, whereas values showing an
+increase of more than 10.0% compared to T0 are in red, bold and underlined.
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+<!-- PageNumber="16" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+## Results: tryptic peptide map
+### l Confirmation of clipping events via peptide mapping
+<table>
+<tr>
+<th colspan="6">Relative quantification by EIC (%)*</th>
+</tr>
+<tr>
+<th>AA Sequence</th>
+<th>Seq Loc</th>
+<th>Modification</th>
+<th>T0</th>
+<th>T2W_37℃</th>
+<th>T4W_37℃</th>
+</tr>
+<tr>
+<td>AEDTAVYYCAKPLYSNLAGDFGS WGQGTTVTVSSASTK</td>
+<td rowspan="3">HC(88-125)</td>
+<td></td>
+<td>98.6</td>
+<td>98.1</td>
+<td>96.8</td>
+</tr>
+<tr>
+<td>AEDTAVYYCAKPLYSN</td>
+<td>Clipping</td>
+<td>0.3</td>
+<td>0.4</td>
+<td>0.6</td>
+</tr>
+<tr>
+<td>LAGDFGSWGQGTTVTVSSASTK</td>
+<td>Clipping</td>
+<td>1.1</td>
+<td>1.5</td>
+<td>2.6</td>
+</tr>
+<tr>
+<td>STSGGTAALGCLVK</td>
+<td rowspan="3">HC(138-151)</td>
+<td></td>
+<td>99.9</td>
+<td>99.3</td>
+<td>98.7</td>
+</tr>
+<tr>
+<td>GGTAALGCLVK</td>
+<td>Clipping</td>
+<td>0.0</td>
+<td>0.2</td>
+<td>0.3</td>
+</tr>
+<tr>
+<td>GTAALGCLVK</td>
+<td>Clipping_2</td>
+<td>0.1</td>
+<td>0.5</td>
+<td>0.9</td>
+</tr>
+<tr>
+<td>SCDKTHTCPPCPAPEAAGGPSVFL FPPKPK</td>
+<td rowspan="2">HC(223-252)</td>
+<td></td>
+<td>99.6</td>
+<td>99.4</td>
+<td>99.0</td>
+</tr>
+<tr>
+<td>DKTHTCPPCPAPEAAGGPSVFLFP PKPK</td>
+<td>Clipping</td>
+<td>0.4</td>
+<td>0.6</td>
+<td>1.0</td>
+</tr>
+</table>
+*Note that due to the differences in ionization efficiency between large and small peptides, the quantification of clipping
+events via peptide mapping is most likely an overestimation. Nevertheless, peptide mapping allows for the confirmation of
+clipping events and discern possible trends.
+For information purposes the following color code is applied: Values showing an increase or decrease
+between 1.0% to 10.0% compared to the T0 sample are in blue and underlined, whereas values showing an
+increase of more than 10.0% compared to T0 are in red, bold and underlined.
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+<!-- PageNumber="17" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+## P018_3D6 VHO VL6 _hIgG1_LALAPG: Conclusions
+l Some minor liabilities were found in the variable parts of the
+P018_3D6 VHO VL6 _hIgG1_LALAPG construct
+· On RPLC-UV-MS, some minor clipping events, that increased during the temperature stress, were
+observed. These clippings were further confirmed via peptide mapping. The most apparent clipping is the
+clipping between N103 and LL104 in CDR3 of the HC. This clipping slightly increased during temperature
+stress from 1.4 to 3.2%.
+· A clear deamidation that increased during stress was found on peptide HC(88-125),
+AEDTAVYYCAKPLYSN 103 LAGDFGSWGQGTTVTVSSASTK where deamidation increased from 1.3% to 6.6%
+after 4 weeks at 37℃. This asparagine is part of the HC CDR3 domain. Some other minor deamidation
+events were also observed.
+· Cyclization of the N-terminal glutamate on the HC increased up to 7.0% during temperature stress (both
+present in RPLC-UV-MS and peptide map).
+· Only minor oxidation events were identified in the variable regions of both the heavy and light chain.
+· Both RPLC-UV-MS and peptide mapping confirmed the presence of a thioether bound HC+LC.
+· Light chain glycation was clearly observed on RPLC-UV-MS (1x and 2x).
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+<!-- PageNumber="18" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+## P018_3D6 VHO VL6 _hIgG1_LALAPG: Conclusions
+### Graphical summary (T0->T4W)
+Oxidation: 0.5%
+Deamidation: 1.0 -> 1.4%
+Deamidation: 1.3 -> 6.6%
+PyroE: 1.0 -> 7.0%
+l
+VH
+EVOLLESGGGLVQPGGSLRLSCAASGFTFSSYSLSWVRQAPGKGLEWVSTIKARRGTTLY
+ADSVKDRFTISRDNSKNTLYLQMg3N34SLRAEDTAVYYCAKPLYSN103L104AGDFGSWGQ
+GTTVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTF
+PAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKKVEPKSCDKTHTCPPCPA,
+PEAAGGPSVFLFPPKPKDTLMISRTPEVTCVVVDVSHEDPEVKFNWYVDGVEVHNAKT
+KPREEQYNSTYRVVSVLTVLHQDWLNGKEYKCKVSNKALGAPIEKTISKAKGQPREPQVY
+TLPPSRDELTKNQVSLTCLVKGFYPSDIAVEWESNGQPENNYKTTPPVLDSDGSFLLYSKLT
+VDKSRWQQGNVFSCSVMHEALHNHYTQKSLSLSPGK
+Clipping: 1.4 -> 3.2%
+G0F most abundant
+l
+VL
+Oxidation: 0.5%
+DIQMATQSPSSLSASVGDRVTITCQASQSISSYLAWYQQKPGKAPKLLIYGGSRLQTGVPS
+RFSGSGSGTDFTLTISSLOPEDFATYYCQQDYSWPLTFGQGTKVEIKRTVAAPSVFIFPPSD
+EQLKSGTASVVCLLNNFYPREAKVQWKVDNALQSGNSQESVTEQDSKDSTYSLSSTLTLS
+KADYEKHKVYACEVTHQGLSSPVTKSFNRGEC
+<table>
+<tr>
+<td>Blue:</td>
+<td>VH and VL</td>
+</tr>
+<tr>
+<td>Blue:</td>
+<td>CDR</td>
+</tr>
+<tr>
+<td>Green:</td>
+<td>N-glycosylation site</td>
+</tr>
+</table>
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+For information purposes the following color code is applied: Values showing an increase or decrease
+between 1.0% to 10.0% compared to the T0 sample are in blue and underlined, whereas values showing an
+increase of more than 10.0% compared to T0 are in red, bold and underlined.
+<!-- PageNumber="19" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+### Author section RIC
+#### Author
+Nathan Cardon
+Senior research associate | Project Responsible
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+<!-- PageNumber="20" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+### Signature section RIC
+#### Reviewer
+<table>
+<tr>
+<td>Mabelle Meersseman</td>
+<td>Date:</td>
+</tr>
+<tr>
+<td>Group Leader</td>
+<td>Signature:</td>
+</tr>
+<tr>
+<td>Approver</td>
+<td></td>
+</tr>
+<tr>
+<td>Koen Sandra Ph.D.</td>
+<td>Date:</td>
+</tr>
+<tr>
+<td>CEO</td>
+<td>Signature:</td>
+</tr>
+</table>
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+<!-- PageNumber="21" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+### Signature section client
+Approver
+Name:
+Date:
+Signature:
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+<!-- PageNumber="22" -->
+<!-- PageBreak -->
+<figure>
+</figure>
+### Version control
+<table>
+<tr>
+<th>Version</th>
+<th>Date of issue</th>
+<th>Reason for version update</th>
+</tr>
+<tr>
+<td>00</td>
+<td>20 November 2024</td>
+<td>Draft</td>
+</tr>
+<tr>
+<td></td>
+<td></td>
+<td></td>
+</tr>
+<tr>
+<td></td>
+<td></td>
+<td></td>
+</tr>
+</table>
+<figure>
+RIC
+</figure>
+<!-- PageFooter="biologics" -->
+<!-- PageNumber="23" -->
+<!-- PageBreak -->
+<figure>
+RIC
+biologics
+</figure>
+YOUR MOLECULE. OUR ANALYTICS. NO SECRETS.
+<!-- PageFooter="www.RIC-biologics.com" -->

src/agents/__pycache__/field_mapper_agent.cpython-312.pyc CHANGED Viewed

Binary files a/src/agents/__pycache__/field_mapper_agent.cpython-312.pyc and b/src/agents/__pycache__/field_mapper_agent.cpython-312.pyc differ

src/agents/field_mapper_agent.py CHANGED Viewed

@@ -266,9 +266,82 @@ class FieldMapperAgent(BaseAgent):
             self.logger.error(f"Error extracting field value from page: {str(e)}", exc_info=True)
             return None
     def execute(self, ctx: Dict[str, Any]):  # noqa: D401
         field = ctx.get("current_field")
-        self.logger.info(f"Starting field mapping for: {field}")
         # Store context for use in extraction methods
         self.ctx = ctx
@@ -286,9 +359,6 @@ class FieldMapperAgent(BaseAgent):
             text = ctx["text"]
             self.logger.info(f"Using text from direct context (length: {len(text)})")
-        if not field:
-            self.logger.warning("No field provided in context")
-            return None
         if not text:
             self.logger.warning("No text content found in context or index")
             return None
@@ -297,28 +367,43 @@ class FieldMapperAgent(BaseAgent):
         if "document_context" not in ctx:
             ctx["document_context"] = self._infer_document_context(text)
-        self.logger.info(f"Processing field: {field}")
         self.logger.info(f"Using document context: {ctx['document_context']}")
-        # Process entire document at once
-        self.logger.info("Processing entire document...")
-        value = self._extract_field_value_from_page(field, text, ctx["document_context"])
-        if value:
-            return value
-        # If no value found, try the search-based approach as fallback
-        self.logger.warning("No value found in document analysis, falling back to search-based approach")
-        if index and "embeddings" in index:
-            self.logger.info("Using semantic search with embeddings")
-            search_query = f"{field} in {ctx['document_context']}"
-            similar_chunks = self._find_similar_chunks_search(search_query, index)
-            if similar_chunks:
-                self.logger.info(f"Found {len(similar_chunks)} relevant chunks, attempting value extraction")
-                value = self._extract_field_value_search(field, similar_chunks, ctx["document_context"])
-                if value:
-                    return value
-        self.logger.warning(f"No candidate found for field: {field}")
-        return f"<no candidate for {field}>"

             self.logger.error(f"Error extracting field value from page: {str(e)}", exc_info=True)
             return None
+    def _extract_with_unique_indices(self, text: str, context: str, unique_indices: List[str], fields_to_extract: List[str]) -> Optional[str]:
+        """Extract values using unique indices strategy."""
+        self.logger.info(f"Using unique indices strategy with indices: {unique_indices}")
+        self.logger.info(f"Fields to extract: {fields_to_extract}")
+        # Get filename from context if available
+        filename = self.ctx.get("pdf_meta", {}).get("filename", "")
+        filename_context = f"\nDocument filename: {filename}" if filename else ""
+        prompt = f"""You are an expert in {context}
+        Your task is to extract information from the document based on unique combinations of indices and their corresponding fields.
+        Unique Indices to look for: {', '.join(unique_indices)}
+        Fields to extract for each combination: {', '.join(fields_to_extract)}{filename_context}
+        Consider the following document:
+        {text}
+        Instructions:
+        1. First, identify all unique combinations of the specified indices in the document
+        2. For each unique combination found, extract the values for all specified fields
+        3. Return the data in a tabular format where:
+           - Each row represents a unique combination
+           - Each column represents a field value
+        4. Return ONLY the JSON value, no explanations
+        5. Format the response as a valid JSON object with arrays for each field
+        6. Keep the structure flat - do not nest values
+        Example response format:
+        {{
+            "index1": ["value1", "value2", "value3"],
+            "index2": ["value4", "value5", "value6"],
+            "field1": ["value7", "value8", "value9"],
+            "field2": ["value10", "value11", "value12"]
+        }}
+        Field values:"""
+        try:
+            self.logger.info("Calling LLM for unique indices extraction")
+            # Get cost tracker from context
+            cost_tracker = self.ctx.get("cost_tracker") if hasattr(self, 'ctx') else None
+            value = self.llm.responses(
+                prompt, temperature=0.0,
+                ctx={"cost_tracker": cost_tracker} if cost_tracker else None,
+                description="Unique Indices Field Extraction"
+            )
+            # Log cost tracking results if available
+            if cost_tracker:
+                self.logger.info(f"Unique indices extraction costs - Input tokens: {cost_tracker.llm_input_tokens}, Output tokens: {cost_tracker.llm_output_tokens}")
+                self.logger.info(f"Unique indices extraction cost: ${cost_tracker.calculate_current_file_costs()['openai']['total_cost']:.4f}")
+            if value and value.lower() not in ["none", "null", "n/a"]:
+                try:
+                    json_value = json.loads(value)
+                    self.logger.info(f"Successfully extracted values: {json.dumps(json_value, indent=2)}")
+                    return json.dumps(json_value, indent=2)
+                except json.JSONDecodeError:
+                    self.logger.error("Failed to parse LLM response as JSON")
+                    return None
+            else:
+                self.logger.warning("LLM returned no valid value")
+                return None
+        except Exception as e:
+            self.logger.error(f"Error in unique indices extraction: {str(e)}", exc_info=True)
+            return None
     def execute(self, ctx: Dict[str, Any]):  # noqa: D401
         field = ctx.get("current_field")
+        strategy = ctx.get("strategy", "original")  # Default to original strategy
+        self.logger.info(f"Starting field mapping for: {field} using strategy: {strategy}")
         # Store context for use in extraction methods
         self.ctx = ctx
             text = ctx["text"]
             self.logger.info(f"Using text from direct context (length: {len(text)})")
         if not text:
             self.logger.warning("No text content found in context or index")
             return None
         if "document_context" not in ctx:
             ctx["document_context"] = self._infer_document_context(text)
         self.logger.info(f"Using document context: {ctx['document_context']}")
+        # Process based on selected strategy
+        if strategy == "unique_indices":
+            unique_indices = ctx.get("unique_indices", [])
+            fields_to_extract = ctx.get("fields_to_extract", [])
+            if not unique_indices or not fields_to_extract:
+                self.logger.warning("Missing unique indices or fields to extract")
+                return None
+            return self._extract_with_unique_indices(text, ctx["document_context"], unique_indices, fields_to_extract)
+        else:
+            # Original strategy
+            if not field:
+                self.logger.warning("No field provided in context")
+                return None
+            self.logger.info(f"Processing field: {field}")
+            self.logger.info("Processing entire document...")
+            value = self._extract_field_value_from_page(field, text, ctx["document_context"])
+            if value:
+                return value
+            # If no value found, try the search-based approach as fallback
+            self.logger.warning("No value found in document analysis, falling back to search-based approach")
+            if index and "embeddings" in index:
+                self.logger.info("Using semantic search with embeddings")
+                search_query = f"{field} in {ctx['document_context']}"
+                similar_chunks = self._find_similar_chunks_search(search_query, index)
+                if similar_chunks:
+                    self.logger.info(f"Found {len(similar_chunks)} relevant chunks, attempting value extraction")
+                    value = self._extract_field_value_search(field, similar_chunks, ctx["document_context"])
+                    if value:
+                        return value
+            self.logger.warning(f"No candidate found for field: {field}")
+            return f"<no candidate for {field}>"

src/app.py CHANGED Viewed

@@ -238,6 +238,23 @@ else:  # page == "Execution"
     fields_str = st.text_input("Fields (comma‑separated)", "Protein Lot, Chain, Residue")
     desc_blob = st.text_area("Field descriptions / rules (YAML, optional)")
     def flatten_json_response(json_data, fields):
         """Flatten the nested JSON response into a tabular structure with dynamic columns."""
         logger = logging.getLogger(__name__)
@@ -327,6 +344,8 @@ else:  # page == "Execution"
                     doc_preview=preview,
                     fields=field_list,
                     field_descs=field_descs,
                 )
                 # Add a visual separator

     fields_str = st.text_input("Fields (comma‑separated)", "Protein Lot, Chain, Residue")
     desc_blob = st.text_area("Field descriptions / rules (YAML, optional)")
+    # Add strategy selector
+    strategy = st.radio(
+        "Select Extraction Strategy",
+        ["Original Strategy", "Unique Indices Strategy"],
+        help="Original Strategy: Process document page by page. Unique Indices Strategy: Process entire document at once using unique indices."
+    )
+    # Add unique indices input if Unique Indices Strategy is selected
+    unique_indices = None
+    if strategy == "Unique Indices Strategy":
+        unique_indices_str = st.text_input(
+            "Unique Fields (comma-separated)",
+            help="Enter the field names that uniquely identify each record (e.g., 'timepoint, Modification, peptide')"
+        )
+        if unique_indices_str:
+            unique_indices = [idx.strip() for idx in unique_indices_str.split(",") if idx.strip()]
     def flatten_json_response(json_data, fields):
         """Flatten the nested JSON response into a tabular structure with dynamic columns."""
         logger = logging.getLogger(__name__)
                     doc_preview=preview,
                     fields=field_list,
                     field_descs=field_descs,
+                    strategy=strategy,
+                    unique_indices=unique_indices
                 )
                 # Add a visual separator

src/orchestrator/__pycache__/planner.cpython-312.pyc CHANGED Viewed

Binary files a/src/orchestrator/__pycache__/planner.cpython-312.pyc and b/src/orchestrator/__pycache__/planner.cpython-312.pyc differ

src/orchestrator/planner.py CHANGED Viewed

@@ -37,6 +37,8 @@ class Planner:
         fields: List[str],
         doc_preview: str | None = None,
         field_descs: Dict | None = None,
     ) -> Dict[str, Any]:
         """Return a JSON dict representing the execution plan."""
@@ -45,9 +47,14 @@ class Planner:
             "doc_preview": doc_preview or "",
             "fields": fields,
             "field_descriptions": field_descs or {},
         }
         logger.info(f"Building plan for fields: {fields}")
         logger.debug(f"User context: {user_context}")
         prompt = self.prompt_template.format_json(**user_context)
@@ -71,8 +78,11 @@ class Planner:
                 # ensure minimal structure exists
                 if "steps" in plan and "fields" in plan:
                     logger.info("Plan successfully generated with required structure")
-                    # Add pdf_meta to the plan
                     plan["pdf_meta"] = pdf_meta
                     return plan
                 else:
                     missing_keys = []
@@ -93,7 +103,7 @@ class Planner:
         # ---------- fallback static plan ----------
         logger.info("Falling back to static plan")
-        return self._static_plan(fields)
     # --------------------------------------------------
     @staticmethod
@@ -115,6 +125,8 @@ class Planner:
                 field_descriptions = kwargs.get("field_descriptions", {})
                 doc_preview = kwargs.get("doc_preview", "")
                 pdf_meta = kwargs.get("pdf_meta", {})
                 # Create a formatted string with the actual values
                 formatted = self.s
@@ -128,6 +140,10 @@ class Planner:
                     formatted = formatted.replace("a few kB of raw text from the uploaded document", f"document preview: {doc_preview[:1000]}...")
                 if pdf_meta:
                     formatted = formatted.replace("pdf_meta / field_descriptions for extra context", f"document metadata: {json.dumps(pdf_meta)}")
                 return formatted
@@ -135,7 +151,7 @@ class Planner:
     # --------------------------------------------------
     @staticmethod
-    def _static_plan(fields: List[str]) -> Dict[str, Any]:
         """Return a hard-coded plan to guarantee offline functionality."""
         logger.info("Generating static fallback plan")
         steps = [
@@ -148,4 +164,12 @@ class Planner:
                 ],
             },
         ]
-        return {"steps": steps, "fields": fields, "pdf_meta": {}}  # Include empty pdf_meta in static plan

         fields: List[str],
         doc_preview: str | None = None,
         field_descs: Dict | None = None,
+        strategy: str = "Original Strategy",
+        unique_indices: List[str] | None = None,
     ) -> Dict[str, Any]:
         """Return a JSON dict representing the execution plan."""
             "doc_preview": doc_preview or "",
             "fields": fields,
             "field_descriptions": field_descs or {},
+            "strategy": strategy,
+            "unique_indices": unique_indices or [],
         }
         logger.info(f"Building plan for fields: {fields}")
+        logger.info(f"Using strategy: {strategy}")
+        if unique_indices:
+            logger.info(f"Unique indices: {unique_indices}")
         logger.debug(f"User context: {user_context}")
         prompt = self.prompt_template.format_json(**user_context)
                 # ensure minimal structure exists
                 if "steps" in plan and "fields" in plan:
                     logger.info("Plan successfully generated with required structure")
+                    # Add pdf_meta and strategy info to the plan
                     plan["pdf_meta"] = pdf_meta
+                    plan["strategy"] = strategy
+                    if unique_indices:
+                        plan["unique_indices"] = unique_indices
                     return plan
                 else:
                     missing_keys = []
         # ---------- fallback static plan ----------
         logger.info("Falling back to static plan")
+        return self._static_plan(fields, strategy, unique_indices)
     # --------------------------------------------------
     @staticmethod
                 field_descriptions = kwargs.get("field_descriptions", {})
                 doc_preview = kwargs.get("doc_preview", "")
                 pdf_meta = kwargs.get("pdf_meta", {})
+                strategy = kwargs.get("strategy", "Original Strategy")
+                unique_indices = kwargs.get("unique_indices", [])
                 # Create a formatted string with the actual values
                 formatted = self.s
                     formatted = formatted.replace("a few kB of raw text from the uploaded document", f"document preview: {doc_preview[:1000]}...")
                 if pdf_meta:
                     formatted = formatted.replace("pdf_meta / field_descriptions for extra context", f"document metadata: {json.dumps(pdf_meta)}")
+                if strategy:
+                    formatted = formatted.replace("strategy for extraction", f"extraction strategy: {strategy}")
+                if unique_indices:
+                    formatted = formatted.replace("unique indices for extraction", f"unique indices: {json.dumps(unique_indices)}")
                 return formatted
     # --------------------------------------------------
     @staticmethod
+    def _static_plan(fields: List[str], strategy: str = "Original Strategy", unique_indices: List[str] | None = None) -> Dict[str, Any]:
         """Return a hard-coded plan to guarantee offline functionality."""
         logger.info("Generating static fallback plan")
         steps = [
                 ],
             },
         ]
+        plan = {
+            "steps": steps,
+            "fields": fields,
+            "pdf_meta": {},
+            "strategy": strategy
+        }
+        if unique_indices:
+            plan["unique_indices"] = unique_indices
+        return plan

src/services/azure_di_service.py CHANGED Viewed

@@ -15,47 +15,98 @@ class AzureDIService:
         self.log_dir = Path("logs/di_content")
         self.log_dir.mkdir(parents=True, exist_ok=True)
     def extract_tables(self, pdf_bytes: bytes):
         try:
             self.logger.info("Starting document analysis with Azure Document Intelligence")
-            # Analyze the entire document at once
-            #poller = self.client.begin_analyze_document("prebuilt-layout", body=pdf_bytes)
-            poller = self.client.begin_analyze_document(
                 "prebuilt-layout",
                 body=pdf_bytes,
-                content_type="application/octet-stream",
                 output_content_format=DocumentContentFormat.MARKDOWN
             )
-            result = poller.result()
-            # Log the raw result structure
-            self.logger.info("Inspecting Azure DI result structure:")
-            self.logger.info(f"Result type: {type(result)}")
-            self.logger.info(f"Result attributes: {dir(result)}")
-            # Check if content exists and log its type
-            if hasattr(result, "content"):
-                self.logger.info(f"Content type: {type(result.content)}")
-                self.logger.info(f"Content preview: {result.content[:500]}")
-                # Save content to timestamped file
-                timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-                log_file = self.log_dir / f"di_content_{timestamp}.txt"
-                with open(log_file, "w", encoding="utf-8") as f:
-                    f.write(result.content)
-                self.logger.info(f"Saved DI content to {log_file}")
-            # Check if tables exist and log their structure
-            if hasattr(result, "tables"):
-                self.logger.info(f"Number of tables: {len(result.tables)}")
-                if result.tables:
-                    self.logger.info(f"First table structure: {dir(result.tables[0])}")
-                    self.logger.info(f"First table cells: {[cell.content for cell in result.tables[0].cells]}")
-            # For now, return empty result until we understand the structure
-            return {"text": result.content}
         except HttpResponseError as e:
             self.logger.error(f"Azure Document Intelligence API error: {str(e)}")

         self.log_dir = Path("logs/di_content")
         self.log_dir.mkdir(parents=True, exist_ok=True)
+    def _process_table(self, table):
+        """Process a table to properly handle rowspans and return expanded rows."""
+        if not hasattr(table, 'cells'):
+            return []
+        # Get table dimensions
+        rows = max(cell.row_index for cell in table.cells) + 1
+        cols = max(cell.column_index for cell in table.cells) + 1
+        # Initialize the expanded table
+        expanded_table = []
+        for _ in range(rows):
+            expanded_table.append([None] * cols)
+        # First pass: fill in all cells
+        for cell in table.cells:
+            expanded_table[cell.row_index][cell.column_index] = cell.content
+        # Second pass: handle rowspans
+        for cell in table.cells:
+            if hasattr(cell, 'row_span') and cell.row_span > 1:
+                # Copy the content to all spanned rows
+                for i in range(1, cell.row_span):
+                    if cell.row_index + i < rows:
+                        expanded_table[cell.row_index + i][cell.column_index] = cell.content
+        # Convert to list of dictionaries
+        headers = expanded_table[0]
+        result = []
+        for row in expanded_table[1:]:
+            row_dict = {}
+            for i, value in enumerate(row):
+                if i < len(headers):
+                    row_dict[headers[i]] = value
+            result.append(row_dict)
+        return result
     def extract_tables(self, pdf_bytes: bytes):
         try:
             self.logger.info("Starting document analysis with Azure Document Intelligence")
+            # First call: Get markdown format for document context
+            markdown_poller = self.client.begin_analyze_document(
                 "prebuilt-layout",
                 body=pdf_bytes,
+                content_type="application/octet-stream",
                 output_content_format=DocumentContentFormat.MARKDOWN
             )
+            markdown_result = markdown_poller.result()
+            # Second call: Get JSON format for table processing
+            json_poller = self.client.begin_analyze_document(
+                "prebuilt-layout",
+                body=pdf_bytes,
+                content_type="application/octet-stream",
+                output_content_format=DocumentContentFormat.JSON
+            )
+            json_result = json_poller.result()
+            # Process tables from JSON result
+            tables_data = []
+            if hasattr(json_result, "tables"):
+                self.logger.info(f"Number of tables: {len(json_result.tables)}")
+                for table in json_result.tables:
+                    processed_table = self._process_table(table)
+                    tables_data.extend(processed_table)
+                    self.logger.info(f"Processed table with {len(processed_table)} rows")
+            # Save both markdown and JSON content for debugging
+            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+            # Save markdown content
+            markdown_log = self.log_dir / f"di_content_{timestamp}_markdown.txt"
+            with open(markdown_log, "w", encoding="utf-8") as f:
+                if hasattr(markdown_result, "content"):
+                    f.write(markdown_result.content)
+            self.logger.info(f"Saved markdown content to {markdown_log}")
+            # Save JSON content
+            json_log = self.log_dir / f"di_content_{timestamp}_json.txt"
+            with open(json_log, "w", encoding="utf-8") as f:
+                if hasattr(json_result, "content"):
+                    f.write(json_result.content)
+                else:
+                    f.write(json.dumps(json_result.to_dict(), indent=2))
+            self.logger.info(f"Saved JSON content to {json_log}")
+            return {
+                "text": markdown_result.content if hasattr(markdown_result, "content") else "",
+                "tables": tables_data
+            }
         except HttpResponseError as e:
             self.logger.error(f"Azure Document Intelligence API error: {str(e)}")

src/ui/strategy_selector.py ADDED Viewed

	@@ -0,0 +1,38 @@

+"""Strategy selection UI components for Streamlit."""
+import streamlit as st
+from typing import Dict, Any, List, Tuple
+def render_strategy_selector() -> Tuple[str, Dict[str, Any]]:
+    """Render strategy selection UI and return selected strategy and parameters."""
+    strategy = st.radio(
+        "Select Extraction Strategy",
+        ["Original Strategy", "Unique Indices Strategy"],
+        help="Choose how to extract information from the document"
+    )
+    params = {}
+    if strategy == "Original Strategy":
+        params["strategy"] = "original"
+        params["current_field"] = st.text_input(
+            "Field to Extract",
+            help="Enter the field name to extract from the document"
+        )
+    else:
+        params["strategy"] = "unique_indices"
+        # Get unique indices
+        indices_input = st.text_area(
+            "Unique Indices",
+            help="Enter comma-separated list of indices to look for (e.g., 'peptide, modification, timepoint')"
+        )
+        params["unique_indices"] = [idx.strip() for idx in indices_input.split(",") if idx.strip()]
+        # Get fields to extract
+        fields_input = st.text_area(
+            "Fields to Extract",
+            help="Enter comma-separated list of fields to extract for each combination"
+        )
+        params["fields_to_extract"] = [field.strip() for field in fields_input.split(",") if field.strip()]
+    return strategy, params