Spreadsheet parser consolidation (temporary planning doc for PR #2592)¶
Temporary planning document. Delete once the consolidation in this PR (or the follow-up PR it scopes out) has landed and a decision record has been captured in
/docs/decisions/.
Why this document exists¶
PR #2591's preview-environment testing surfaced an import failure (issue
narrative) whose root cause is that the screening-value
tolerance introduced by PR #2509 (shipped in the April 2026 user-guide
release as "more flexible
screening values") was applied to one of three parser paths. The other
two paths still call (ScreeningDecision) int.Parse(columnStringValue), which
throws FormatException on any non-numeric input. Rather than silently fan the
fix out to the two laggard paths, this doc names the duplication so we can
decide whether to consolidate now or later.
User-facing symptom¶
On the syrf_pr_2591 preview database, a systematic-search upload named
TestWithSecreening to the "Screening In Progress" project transitioned to the
Error state at 2026-04-24T16:21:08. The PM service log shows:
System.FormatException: The input string 'Include' was not in a correct format.
at System.Int32.Parse(String s)
at SpreadsheetParseImplementation.<>c__DisplayClass22_1.<ParseStudiesAsync>b__2
(in SpreadsheetParseImplementation.cs:line 168)
The user-facing impression: "your fix for tolerant screening values isn't actually working on this branch." Reality: the fix never landed on the search-import code path in the first place.
The three parsers¶
All three live under
src/libs/project-management/SyRF.ProjectManagement.Core/Services/ParserImplementations/.
They all contain a ScreeningImportSettings?.UserColumnMap.ForEach(...) block
that interprets a per-row column value as a ScreeningDecision.
1. SpreadsheetParseImplementation.cs — legacy search-import¶
- Directory:
ParserImplementations/(alongside Endnote/Pubmed XML parsers). - Interface:
IStudyParseImplementation(resolved byParserRegistryon the basis ofLibraryType). - Used by:
StudyReferenceFileParser.ParseStudiesAsync— the production code path for systematic-search CSV imports (RIS, SyRF library, EndNote export, etc.). - Screening-value parse:
(ScreeningDecision) int.Parse(columnStringValue)at line 170. Throws on non-numeric input. This is the code path the user hit. - Status in this PR: to be converted to
ScreeningDecisionParser.TryParse.
2. NewParser/StudySpreadsheetRecordProcessor.cs — dead code¶
- Directory:
ParserImplementations/NewParser/. - Interface:
IRecordProcessor<Study>(new-style: reader + processor split viaNewSpreadsheetReader). - Used by: nothing in production. The only callers outside the file itself
are its own test suite (
StudySpreadsheetRecordProcessorTests). - Screening-value parse:
(ScreeningDecision) int.Parse(GetValueFromColumn(recordFields, u.Key))at line 166. Same bug, just not reachable from production. - Status in this PR: to be converted to
ScreeningDecisionParser.TryParseas well, so that if/when we wire it up it inherits the tolerant behaviour. (Not deleting the file here — see "Consolidation options" below.)
3. NewParser/StudyUpdateRecordProcessor.cs — bulk-study-update¶
- Directory:
ParserImplementations/NewParser/. - Interface:
IRecordProcessor<StudyUpdate>. - Used by:
StudyReferenceFileParser.ProcessStreamToUpdates— the production code path for bulk study update CSVs. - Screening-value parse: already uses
ScreeningDecisionParser.TryParsesince PR #2509 (2026-04-07). Surfaces a structured "accepted values" error instead of an unhandled exception. - Status in this PR: unchanged (reference implementation).
Why there are two "spreadsheet" parsers¶
The NewParser/ directory is an in-flight but incomplete migration from a
monolithic IStudyParseImplementation per file format (spreadsheet/endnote/pubmed)
to a cleaner reader-plus-processor architecture
(IRecordReader<TIn> + IRecordProcessor<TIn, TOut>). Evidence:
- The new architecture only has a production consumer for the StudyUpdate processor (the bulk-update flow, added as part of that feature).
StudySpreadsheetRecordProcessorandStudyEndnoteRecordProcessorwere written as planned replacements forSpreadsheetParseImplementationandEndnoteXmlParseImplementation, and both have their own test suites — butStudyReferenceFileParser.ParseStudiesAsyncstill uses the legacyIStudyParseImplementationcollection via.First(spi => spi.HandlesTypes.ContainsKey(searchImportJob.LibraryType)).- Several static field references in
StudySpreadsheetRecordProcessorreach back into the legacy file (SpreadsheetParseImplementation.ReferenceType,SpreadsheetParseImplementation.PublicationName, etc.) because the header constants were never extracted into a sharedSpreadsheetFieldNamesclass.
So today the picture is:
Search import ──► IStudyParseImplementation ──► SpreadsheetParseImplementation
└──► EndnoteXmlParseImplementation
└──► PubmedXmlParseImplementation
Bulk update ──► NewSpreadsheetReader ──► StudyUpdateRecordProcessor
[Orphan] NewSpreadsheetReader ──► StudySpreadsheetRecordProcessor
[Orphan] NewEndnoteReader ──► StudyEndnoteRecordProcessor
Consolidation options¶
Option A — minimal: fan out the tolerant parser, leave structure alone (this PR)¶
- Replace
int.ParsewithScreeningDecisionParser.TryParsein bothSpreadsheetParseImplementation.csandStudySpreadsheetRecordProcessor.cs. - Add per-row
ParseErrorentries (withScreeningDecisionParser.AcceptedValuesDescriptionin the message) when a value can't be parsed, matching the patternStudyUpdateRecordProcessoralready uses. - Extend tests in
SpreadsheetParseImplementationTests.csandStudySpreadsheetRecordProcessorTests.csto cover the same screening-value matrix the bulk-update tests already cover. - Update the April 2026 changelog on
pr2567to note that tolerant screening values now apply to search-library uploads as well as bulk updates.
Pros: closes the user-facing bug in a single small PR; easy to review; low regression risk. Cons: the three copies of the same per-row screening block remain in three files.
Option B — full migration (follow-up PR, scoped here)¶
- Migrate
StudyReferenceFileParser.ParseStudiesAsyncoffIStudyParseImplementationonto theNewSpreadsheetReader/NewEndnoteReader+StudySpreadsheetRecordProcessor/StudyEndnoteRecordProcessorpair. - Extract shared field-name constants from
SpreadsheetParseImplementationinto aSpreadsheetFieldNamesstatic class soStudySpreadsheetRecordProcessorno longer reaches back into the legacy file. - Delete
SpreadsheetParseImplementation,EndnoteXmlParseImplementation, and theIStudyParseImplementation/ParserRegistryscaffolding. - Folding the shared "convert UserColumnMap rows into Screenings" logic into
a single private helper (on
Studyor a small domain service) so it isn't re-implemented by future processors.
Pros: deletes ~280 LOC of genuinely duplicated code; makes the PubMed XML
parser the odd one out (which it already is) so the abstraction has one
shape, not two. Cons: non-trivial migration — the legacy parser has
several behaviours (quoting, throttled IProgress<T> reporting, tracked
agreement thresholds) that need careful translation; ParserRegistry has
opinions about LibraryType that the new-parser path doesn't replicate
today. This is not safe to do under the same merge window as #2591.
Option C — delete dead code, accept the duplication for now¶
- Delete
NewParser/StudySpreadsheetRecordProcessor.csandNewParser/StudyEndnoteRecordProcessor.cs(and their tests) on the grounds that they have no production consumer and no near-term plan to acquire one. - Apply tolerant parsing to only
SpreadsheetParseImplementation.cs.
Pros: closes the immediate bug and removes a pile of dead code in one shot. Cons: any future re-migration to the reader/processor split (Option B) has to re-derive those processors from scratch. Also conflates a bug fix with a deletion decision we haven't explicitly agreed to.
Decision for this PR¶
Option A, with Option B scoped as a follow-up.
Deleting the two NewParser/Study{Spreadsheet,Endnote}RecordProcessor files
(Option C) is tempting given they've sat unused since their introduction, but
they do encode a reasonable interface decomposition for a migration we'll
probably still want to do. Keeping them — and keeping them in sync with
ScreeningDecisionParser — costs a small amount of code duplication for the
duration of this PR; replacing them if we did delete them would cost much
more.
Test plan for Option A (this PR)¶
- Unit: extend
SpreadsheetParseImplementationTestswith tolerant-value parsing cases (1/0, include/exclude, yes/no, true/false, case variants, invalid value produces a structuredParseError, whitespace trimmed). - Unit: mirror those cases in
StudySpreadsheetRecordProcessorTests. - Regression: ensure the existing happy-path tests (integer
0/1values) still pass end-to-end. - Manual verification in the PR-2592 preview: re-upload
TestWithSecreening(with text decisions) and confirm it now reaches theCompletedstate; the resultingpmStudydocuments have the correctIncluded/Excludedscreenings attached.
Follow-ups (not in this PR)¶
- Option B migration — open a separate PR once #2591 and #2592 have merged. Reference this planning doc in the PR description so the historical context isn't lost.
- User-guide note — extend the April 2026 release notes on
pr2567(under "Bulk study update → More flexible screening values") to cover search-library uploads once this PR lands. - ADR — once Option B lands, replace this planning doc with a decision
record under
/docs/decisions/capturing the reader/processor split.