Introduction

The Clinical Dilemma in ART Medication Safety Assessment

Fertility specialists worldwide face a critical challenge when counseling the estimated 2.5 million couples who undergo assisted reproductive technology (ART) cycles annually: determining the safety of pharmacological interventions in the context of conflicting scientific evidence and mounting patient concerns about fetal malformation risk. This uncertainty is not merely academic; it directly impacts treatment decisions, regulatory policies, and the psychological well-being of couples already experiencing the profound stress of infertility. Despite the birth of over 13 million children through ART since 1978, with current utilization representing 1-6% of births in developed countries and constituting a $25 billion global industry,1 fundamental questions about medication safety remain inadequately resolved through systematic scientific evaluation.

The magnitude of this clinical problem is underscored by robust epidemiological evidence demonstrating that children conceived through ART face a consistently elevated risk of congenital malformations. Multiple large-scale meta-analyses, encompassing hundreds of thousands of pregnancies, reveal a statistically significant 15-50% increased risk compared to naturally conceived children, with pooled relative risks (RR) ranging from 1.15 to 1.50 (95% CI: 1.07-1.80) across diverse populations and study designs.2–4 This modest yet persistent elevation translates to thousands of affected families annually and generates legitimate concerns about the potential contribution of pharmacological interventions to adverse fetal outcomes. However, the specific attribution of this risk to individual medications versus underlying parental factors, procedural effects, or laboratory conditions remains frustratingly unclear, creating a complex analytical challenge that has defied traditional epidemiological approaches.5,6

Current Evidence Landscape and Conflicting Safety Signals

The contemporary ART pharmacological arsenal has expanded dramatically in both scope and complexity, encompassing diverse therapeutic classes with distinct mechanisms of action and safety profiles. Modern protocols employ an intricate sequence of interventions across multiple phases: gonadotropins such as recombinant follicle-stimulating hormone (rFSH), luteinizing hormone (LH), human chorionic gonadotropin (hCG), and human menopausal gonadotropin for controlled ovarian stimulation; GnRH analogues for pituitary suppression; progesterone preparations for luteal support; and adjuvant medications including metformin, letrozole, and clomiphene citrate.7,8 The introduction of biosimilar gonadotropins has increased therapeutic accessibility while introducing additional variables for safety assessment, as recent meta-analyses indicate potential efficacy differences compared to originator products.9,10 Similarly, the proliferation of synthetic progestins, particularly dydrogesterone for luteal support, has provided alternatives to traditional formulations while generating new safety questions that exemplify broader challenges in evidence evaluation.11,12

The dydrogesterone controversy exemplifies the discord in current evidence interpretation. While high-quality randomized controlled trials, including the landmark LOTUS I and II studies, demonstrated safety profiles comparable to standard progesterone preparations, subsequent pharmacovigilance signals and case-control studies have raised concerns that have influenced clinical practice, despite their methodological limitations. For example, the study by Koren et al. suggested potential teratogenic effects13 but was later retracted due to significant methodological flaws, including inadequate trial design and failure to account for confounding factors.14 Similarly, a 2024 study by Atarieh et al. reported differences in congenital anomaly rates15 but was retracted in 2025 for concerns over data integrity and study validity.16 These examples illustrate how low-quality or biased studies can generate spurious signals that influence perceptions and policies, only to be overridden later by robust evidence from RCTs and meta-analyses.

Methodological Challenges in Safety Evidence Evaluation

The fundamental challenge in ART medication safety assessment lies in reconciling conflicting evidence from sources of vastly different methodological quality and causal inference capacity. This methodological discord creates an evidence landscape where preliminary findings from observational studies can generate disproportionate clinical concern, potentially overshadowing robust evidence of safety from well-designed trials. Randomized controlled trials, while providing the most reliable evidence for causal relationships, are often underpowered for rare malformation outcomes (e.g., prevalence <0.1%) and primarily designed for efficacy rather than safety endpoints. Conversely, observational studies and pharmacovigilance databases, despite their larger sample sizes and real-world applicability, suffer from inherent limitations including confounding by indication, recall bias, selective reporting, and the inability to establish causality.

Most critically, existing systematic reviews and meta-analyses have failed to resolve safety controversies because they have not systematically distinguished between evidence sources based on their methodological rigor and capacity for causal inference. Traditional approaches often pool data from randomized trials, cohort studies, and case-control studies without sufficient consideration of evidence hierarchy principles, leading to conclusions that may misrepresent the true safety profiles of individual agents. Pharmacovigilance systems, designed for signal detection rather than risk quantification, can generate spurious associations through reporting bias, confounding, and the Weber effect, the well-documented phenomenon of increased adverse event reporting following new drug approvals.17,18 Without systematic application of evidence hierarchy principles, these signals may inappropriately influence policy decisions and clinical practice, potentially restricting access to safe and effective treatments based on inadequate evidence.

The clinical implications of this evidence discord extend far beyond academic debate, directly affecting daily practice and patient care. Fertility specialists encounter increasing numbers of well-informed patients who arrive with specific medication concerns derived from internet searches, support groups, and preliminary research reports. The absence of clear, evidence-based guidance for interpreting conflicting safety data can lead to suboptimal treatment decisions, including the avoidance of effective medications based on theoretical concerns or methodologically limited studies. Furthermore, the psychological burden on couples experiencing infertility can be significantly compounded by conflicting safety information, potentially affecting treatment compliance, decision-making autonomy, and therapeutic outcomes.

Methodological Innovation and Framework Development

This systematic review addresses these critical limitations through implementation of a novel, pre-specified evidence evaluation framework that explicitly prioritizes evidence sources based on their methodological rigor and capacity for causal inference. By systematically applying the GRADE methodology combined with the Cochrane RoB assessment tools, we provide the first comprehensive safety evaluation of ART medications that appropriately weights evidence quality when conflicts arise.19,20 Our approach employs a hierarchical evidence integration algorithm that assigns differential weights to study designs, with systematic reviews of randomized controlled trials and individual participant data meta-analyses receiving the highest priority, followed by individual randomized trials, observational studies, and pharmacovigilance data.21,22

The methodological innovation extends beyond evidence synthesis to include the development of a practical framework for clinicians to interpret conflicting safety data in real-world practice. By establishing explicit criteria for meaningful safety signals, including statistical significance, biological plausibility, consistency across study designs, and adequate control for confounding, we provide actionable guidance for treatment decisions and patient counseling. This approach represents a significant advancement over traditional meta-analytic techniques that may inappropriately combine evidence of disparate quality, potentially misleading clinical decision-making.23

Specific Objectives and Expected Impact

Given the exponential growth in global ART utilization,24,25 the persistent elevation in malformation risk observed in ART pregnancies, and the urgent need for evidence-based approaches to conflicting safety data, this comprehensive systematic review aims to: (1) evaluate the teratogenic risk of medications commonly used in ART protocols by systematically prioritizing high-quality evidence over observational data; (2) develop and validate a framework for interpreting conflicting safety signals that can be applied to future medication evaluations; (3) provide clinicians with clear, evidence-based guidance for treatment decisions and patient counseling; and (4) identify specific knowledge gaps requiring targeted research investment.

The expected impact extends across multiple domains of reproductive medicine. Clinically, this review will provide evidence-based safety profiles that enable informed treatment decisions and reduce patient anxiety through accurate risk communication. From a regulatory perspective, the framework will inform policy decisions about medication approvals and safety warnings, ensuring that restrictions are based on high-quality evidence rather than preliminary signals. Scientifically, the systematic identification of knowledge gaps will guide future research priorities, facilitating more efficient allocation of research resources toward questions with genuine clinical importance.26

By establishing the first systematic, evidence-hierarchy-based evaluation of ART medication safety and providing a replicable framework for evidence interpretation, this review addresses a critical gap in reproductive medicine literature while advancing methodological standards for safety assessment. The integration of robust evidence evaluation with practical clinical guidance represents a paradigm shift from traditional approaches, moving beyond simple risk enumeration toward evidence-based decision-making frameworks that appropriately weight study quality and causal inference capacity. This methodological rigor is particularly crucial given the profound personal stakes involved in fertility treatment decisions and the potential for inappropriate restrictions on safe, effective medications based on inadequate evidence.

Materials and Methods

Study Design and Protocol Registration

This comprehensive systematic review evaluated the teratogenic risk of medications commonly used in ART protocols, aiming to inform evidence-based clinical practice and regulatory decision-making. The review was conducted according to a prospectively registered protocol (PROSPERO registration: CRD420251118713) and reported in accordance with the PRISMA 2020 statement.27

The methodology prioritized rigorous evidence hierarchy principles and systematic evaluation of study quality to address the critical challenge of conflicting safety data in reproductive medicine. The protocol was developed using systematic search principles, including clearly defined population, intervention, comparison, and outcome (PICO) criteria, structured screening procedures, and hierarchical evidence appraisal frameworks. Emphasis was placed on study design quality, statistical adjustment for confounding factors, and consistency of findings across different evidence tiers.

Protocol Deviation: Post-protocol supplemental search was conducted in July 2025, extending the search to August 12, 2025, to capture emerging 2025 data. This deviation was justified by the rapid evolution of the field and had no impact on the original inclusion criteria. All protocol deviations were documented and reported transparently.

Literature Search Strategy

Database Selection and Search Methodology

A comprehensive literature search was conducted across multiple electronic databases, including dates from January 1990 to December 2024, with a supplemental search in July 2025 extending to August 12, 2025, to ensure inclusion of the most recent evidence (from January 2025 through August 12, 2025. Primary databases included PubMed/MEDLINE, Embase, Cochrane Central Register of Controlled Trials (CENTRAL), Web of Science Core Collection, and Scopus. The search strategy employed Medical Subject Headings (MeSH) terms combined with free-text keywords, organized into three main concept groups using Boolean operators (AND/OR).

Search Strategy Validation

The search strategy was peer-reviewed by an independent information specialist and validated through pilot searches against a validation set of 20 known relevant studies, achieving 100% retrieval sensitivity. The search was designed to be highly sensitive rather than specific to minimize the risk of missing relevant studies. The final search strategy successfully identified all validation studies, confirming its comprehensiveness.

Search Term Development

Population terms focused on assisted reproductive techniques: “Reproductive Techniques, Assisted” [MeSH], “Fertilization in Vitro” [MeSH], and “Sperm Injections, Intracytoplasmic” [MeSH], combined with free-text terms including “IVF,” “ICSI,” “ART,” “assisted reproductive technology,” “in vitro fertilization,” “intracytoplasmic sperm injection,” and “embryo transfer.”

Intervention terms encompassed major drug classes: “Fertility Agents” [MeSH], “Gonadotropins” [MeSH], “Follicle Stimulating Hormone” [MeSH], “Luteinizing Hormone” [MeSH], “Chorionic Gonadotropin” [MeSH], “Progesterone” [MeSH], and “Gonadotropin-Releasing Hormone” [MeSH], plus specific drug names including recombinant FSH, human menopausal gonadotropin, GnRH agonists (leuprolide, buserelin, triptorelin, nafarelin), GnRH antagonists (cetrorelix, ganirelix), progesterone formulations, dydrogesterone, metformin, letrozole, clomiphene citrate, and growth hormone.

Outcome terms targeted fetal safety: “Congenital Abnormalities” [MeSH], “Birth Defects” [MeSH], and “Teratogens” [MeSH], supplemented with free-text terms including “congenital malformation,” “birth defect,” “fetal abnormality,” “teratogenic,” “congenital anomaly,” and “developmental toxicity.”

Supplementary Search Methods

Additional search strategies included: (1) clinical trial registries (ClinicalTrials.gov, WHO International Clinical Trials Registry Platform); (2) reference list screening of included studies and relevant systematic reviews; (3) regulatory agency databases such as the Food and Drug Administration (FDA) and the European Medicines Agency (EMA) for safety updates and post-marketing surveillance reports; (4) conference proceedings from major reproductive medicine societies such as the European Society of Human Reproduction and Embryology (ESHRE), American Society for Reproductive Medicine (ASRM), and American College of Obstetricians and Gynecologists (ACOG); and (5) grey literature sources including professional society guidelines and health technology assessment reports.

Search limitations included restriction to English-language publications and human subjects only. Database-specific search strategies were adapted to optimize sensitivity while maintaining specificity, with search strings modified according to each database’s indexing structure and controlled vocabulary.

Population, Intervention, Comparison, and Outcomes (PICO)

  • Population: Women undergoing IVF-assisted reproductive technology procedures with or without intracytoplasmic sperm injection (ICSI) with documented pregnancy outcomes, including both singleton and multiple pregnancies.

  • Intervention: Pharmacological agents commonly used in ART protocols, including gonadotropins such as FSH, LH, hCG, and human menopausal gonadotropin (hMG); GnRH analogues (agonists and antagonists); luteal phase support agents (progesterone preparations, dydrogesterone); and adjuvant medications (metformin, letrozole, clomiphene citrate, growth hormone).

  • Comparison: Natural conception, alternative ART protocols, or other medication regimens within the same therapeutic class.

  • Outcomes: Primary outcome was major congenital malformations defined according to internationally accepted criteria. Secondary outcomes included system-specific anomalies (cardiac, neural tube, musculoskeletal, genitourinary defects) and overall safety profiles.

Inclusion and Exclusion Criteria

Inclusion Criteria

Eligible studies met the following criteria: (1) randomized controlled trials, prospective or retrospective cohort studies, case-control studies, or systematic reviews examining ART medication safety; (2) human studies with documented pregnancy outcomes following ART procedures; (3) studies reporting congenital malformation rates or birth defect incidence; (4) minimum sample size of 50 pregnancies for primary studies to ensure adequate statistical precision; (5) English-language publications; and (6) publication between January 1990 and December 2024 (extended to August 12, 2025 via supplemental search).

Exclusion Criteria

Studies were excluded if they: (1) represented case reports or case series with fewer than 10 subjects; (2) focused solely on fertility outcomes without malformation data; (3) involved animal models, in vitro studies, or purely theoretical analyses; (4) examined experimental or non-standard ART techniques not in widespread clinical use; (5) were conference abstracts without available full-text publications; or (6) provided insufficient data for meaningful safety assessment.

Specific examples: Inclusion example – a retrospective cohort study comparing congenital anomaly rates between dydrogesterone and vaginal progesterone users (included if n≥50). Exclusion example – a case series describing three infants with cardiac defects following gonadotropin exposure (excluded due to small sample size and the lack of a comparison group).

Study Selection and Screening Process

Screening Training and Calibration

Prior to screening, both reviewers completed a calibration exercise using a sample of 50 records to ensure consistent application of inclusion and exclusion criteria, achieving Cohen’s kappa >0.60 for acceptable inter-reviewer agreement. Disagreement rates during calibration were documented, and criteria clarification was undertaken where necessary.

Study Characteristics and Population Demographics

Thirty-two studies fulfilled the inclusion criteria and were incorporated into the final analysis. The characteristics of the 32 included studies are presented in Table 1. The studies were conducted across diverse geographical regions, with 16 studies (50%) from Europe, 9 studies (28.1%) from Asia, 4 studies (12.5%) from North America, 2 studies (6.3%) from Australia, and 1 study (3.1%) from South America. Publication years ranged from 1995 to 2025, with 13 studies (40.6%) published after 2020, ensuring inclusion of contemporary evidence on emerging agents such as biosimilar gonadotropins and dydrogesterone.

Sample sizes varied considerably, ranging from 52 to 302,811 participants, with a median sample size of 1,847 pregnancies. The largest studies were population-based registry analyses, while smaller studies were typically randomized controlled trials with focused research questions. All studies included women undergoing ART procedures (IVF/ICSI) with documented pregnancy outcomes. Maternal age was reported in 28 studies (87.5%), with mean ages ranging from 29.2 to 35.8 years across studies.

Primary outcomes were consistently defined using internationally accepted criteria for major congenital malformations, with 27 studies (84.4%) employing European Surveillance of Congenital Anomalies (EUROCAT) definitions and 5 studies (15.6%) using national registry criteria. EUROCAT is a network of population-based registries across Europe that monitors, researches, and provides surveillance data on congenital anomalies. Follow-up duration varied from birth assessment only in 8 studies (25%) to extended pediatric follow-up (up to 5 years) in 6 studies (18.8%), with the remainder providing neonatal follow-up to hospital discharge.

Table 1.Characteristics of Included Studies (n=32)
Author (Year) Design Country Sample Size Population Intervention Comparator Primary Outcome Follow-up Quality Score
RANDOMIZED CONTROLLED TRIALS (n=10)
Griesinger (2018)28 RCT, multicenter Europe 1,031 IVF/ICSI Oral dydrogesterone 30mg Vaginal progesterone gel 90mg Live birth rate; safety Birth + 28 days Low RoB
Tournaye (2017)29 RCT, double-blind Europe 1,034 IVF/ICSI Oral dydrogesterone 30mg Vaginal progesterone 600mg Ongoing pregnancy; safety Birth + 28 days Low RoB
Devine (2021)30 RCT, single-center North America 620 FET cycles IM progesterone 50mg Vaginal progesterone 600mg Live birth rate Birth Low RoB
Yarali (2023)31 RCT, single-center Europe 150 FET cycles SC progesterone 25mg BID Vaginal progesterone 600mg Ongoing pregnancy Birth Low RoB
Pabuccu (2022)32 RCT, pilot Europe 180 FET cycles Oral/vaginal/IM progesterone Standard care Clinical pregnancy Birth Some concerns
Witz (2020)33 RCT, multicenter North America 522 IVF high responders hMG (Menopur) rFSH (Gonal-F) Live birth rate Birth + safety Low RoB
Chua (2021)34 RCT, multicenter Europe 1,548 IVF/ICSI Biosimilar FSH Originator rFSH Clinical pregnancy Birth Low RoB
Kiose (2025)9 RCT, multicenter Europe 876 IVF/ICSI Biosimilar FSH Originator rFSH Live birth rate Birth + safety Low RoB
Alviggi (2025)35 RCT, multicenter Europe 324 Poor responders rFSH + rLH rFSH alone Oocyte yield; safety Birth Low RoB
Conforti (2021)36 RCT, multicenter Europe 445 Advanced age (≥35y) rFSH + rLH hMG Clinical pregnancy Birth Some concerns
SYSTEMATIC REVIEWS/META-ANALYSES (n=6)
Katalinic (2024)37 Systematic review Global 12,847 Dydrogesterone users Dydrogesterone Progesterone/controls Congenital anomalies Varied High quality
Barbosa (2016)38 Meta-analysis Global 3,900 Luteal support Dydrogesterone Vaginal progesterone Pregnancy outcomes Birth High quality
Chen (2018)2 Meta-analysis Global 87,316 IVF/ICSI ART pregnancies Natural conception Major malformations Birth-5 years High quality
Lu (2022)3 Meta-analysis Global 156,238 ART children IVF/ICSI Natural conception Birth defects Birth-adulthood High quality
Pundir (2024)39 Meta-analysis Global 6,512 Letrozole users Letrozole Clomiphene/gonadotropins/natural conception Fetal harm Birth High quality
Glujovsky (2023)11 Cochrane review Global 8,251 GnRH protocols Antagonists Agonists Safety outcomes Birth High quality
COHORT STUDIES (n=10)
Davies (2012)40 Retrospective cohort Australia 308,974 ART vs. natural IVF/ICSI Natural conception Birth defects Birth registry NOS: 8/9
Zhang (2024)41 Retrospective cohort China 79,414 IVF vs. ICSI ICSI IVF Congenital anomalies Birth + 28 days NOS: 8/9
Qin (2016)42 Prospective cohort China 3,740 IVF multiples IVF pregnancies Singleton controls Major malformations Birth + 1 year NOS: 7/9
Huang (2019)43 Retrospective cohort China 2,847 Dydrogesterone PPOS Dydrogesterone Standard protocols Neonatal outcomes Birth + discharge NOS: 7/9
Yetkinel (2024)44 Retrospective cohort Turkey 1,456 hMG vs. rFSH hMG rFSH Safety + efficacy Birth NOS: 7/9
Ni (2024)45 Secondary analysis China 1,650 High-dose gonadotropins High dose Standard dose Genetic outcomes Birth + genetics NOS: 8/9
Farhi (2013)46 Retrospective cohort Israel 204,615 ART pregnancies IVF/ICSI Natural conception Malformations at birth Birth registry NOS: 8/9
Sagot (2010)47 Retrospective cohort France 4,947 ART outcomes IVF/ICSI Natural conception Major malformations Birth + 1 year NOS: 7/9
Kallen (2005)48 Registry cohort Sweden 704,727 ART children IVF/ICSI Population controls All malformations Birth registry NOS: 8/9
Bonduelle (2005)49 Cross-sectional Europe 975 5-year ART children IVF/ICSI Natural conception Physical health 5-year examination NOS: 6/9
CASE-CONTROL STUDY (n=1)
Zaqout (2015)50 Case-control Palestine 312 Cardiac defects Dydrogesterone exposure No exposure Cardiac malformations Birth + diagnosis NOS: 5/9
PHARMACOVIGILANCE STUDY (n=1)
Henry (2025)51 Disproportionality analysis Global 145 reports Dydrogesterone exposures Dydrogesterone Other progestins/controls Birth defect signals Post-marketing surveillance Signal detection (not scored via NOS/RoB)
ADDITIONAL ANALYSES (n=4)
Rimm (2004)52 Meta-analysis of controlled studies Global 35,758 IVF/ICSI infants ART conceptions Natural conceptions Major malformations Birth registries High quality (methodological assessment)
Hansen (2002)53 Registry analysis Australia 297 ICSI/IVF births ICSI/IVF Natural conceptions Major birth defects Birth + 1 year NOS: 7/9
Liang (2018)54 Meta-analysis Global 45,889 IVF/ICSI singletons ART singletons Natural singletons System-specific malformations Birth registries High quality (methodological assessment)
Dolk (2010)55 Population registry study Europe ~1,500,000 births/year Congenital anomalies in Europe EUROCAT registry data Population baselines Prevalence of anomalies Birth + follow-up Population data (not scored via NOS/RoB)

Footnote: FET: Frozen embryo transfer; PPOS: Progestin-primed ovarian stimulation; Quality scores: RCTs assessed via Cochrane RoB 2.0 (Low RoB/Some concerns/High RoB); Observational studies via NOS (score/9); Systematic reviews assessed for methodological quality.

TWO-STAGE SCREENING PROTOCOL

Study selection followed a systematic two-stage process conducted independently by two trained reviewers using the Covidence systematic review software (Veritas Health Innovation, Melbourne, Australia). Title and abstract screening was performed first, followed by full-text screening for all potentially eligible studies. Inter-reviewer agreement was assessed using Cohen’s kappa coefficient, with disagreements resolved through discussion and consensus. A third senior reviewer adjudicated unresolved conflicts.

Handling of Multiple Publications

Studies reporting on the same patient population were carefully evaluated to avoid data duplication. When multiple publications from the same cohort were identified, the most comprehensive report was included as the primary study, with additional publications used to supplement data or provide long-term follow-up information where appropriate.

Data Extraction and Management

Systematic Data Collection

Data extraction was performed using standardized, piloted forms designed to capture comprehensive study characteristics and outcome data. Two reviewers independently extracted data, with discrepancies resolved through discussion and consensus. Extracted information included:

  • Study characteristics: First author, publication year, study design, geographical location, study period, sample size calculations, funding sources, and conflicts of interest.

  • Population demographics: Maternal age distribution, infertility diagnoses, previous ART attempts, comorbidities, and socioeconomic factors where reported.

  • Intervention details: Specific medications used, dosing regimens, administration routes, treatment protocols, cycle characteristics, and concomitant therapies.

  • Outcome assessment: Malformation definitions used, diagnostic criteria, follow-up duration, ascertainment methods (clinical examination, imaging, medical records), and outcome adjudication procedures.

  • Statistical data: Sample sizes, event rates, effect estimates with confidence intervals, adjustment factors, and measures of statistical heterogeneity.

Quality Control Measures

Data extraction accuracy was verified through double data entry for a random sample of 20% of included studies. Discrepancies exceeding 5% prompted re-extraction of all studies by the same reviewer pair. Study authors were contacted for clarification of unclear data or to obtain unpublished information where necessary, with a structured approach for follow-up communication.

Quality Assessment and Risk of Bias Evaluation

Study-Specific Assessment Tools

The risk of bias assessment was tailored to study design using validated tools. Randomized controlled trials were evaluated using the revised Cochrane Risk of Bias tool (RoB 2.0),56 examining five domains: randomization process, deviations from intended interventions, missing outcome data, outcome measurement, and selective reporting. Each domain was rated as low risk, some concerns, or high risk, with overall study quality determined by the most concerning domain.

Observational studies were assessed using the Newcastle-Ottawa Scale (NOS),57 evaluating three categories: selection of study groups (4 points), comparability of groups (2 points), and ascertainment of exposure/outcome (3 points). Studies scoring 7-9 points were considered high quality, 4-6 points moderate quality, and ≤3 points low quality.

GRADE Evidence Certainty Assessment

Evidence certainty was evaluated using the GRADE methodology.19 This systematic approach assessed five factors that may decrease confidence in evidence: risk of bias (study design limitations, inadequate allocation concealment, or lack of blinding), inconsistency (unexplained heterogeneity between studies with I2 >50% or conflicting effect directions), indirectness (differences in populations, interventions, or outcomes from the review question), imprecision (wide confidence intervals crossing clinical decision thresholds or insufficient sample sizes below the optimal information size needed to detect meaningful effects), and publication bias (asymmetric funnel plots or selective outcome reporting). Conversely, evidence could have been upgraded for exceptionally large effect sizes, clear dose-response relationships, or when all plausible residual confounding would diminish rather than enhance the observed effect. The final certainty rating: high (⊕⊕⊕⊕), moderate (⊕⊕⊕○), low (⊕⊕○○), or very low (⊕○○○), reflects confidence that the true effect lies close to the estimate, guiding the strength of clinical recommendations and informing evidence-based decision-making in reproductive medicine.

Evidence Hierarchy Framework

Studies were classified according to a pre-specified evidence hierarchy that prioritized causal inference capacity:

  • Level I: Systematic reviews and meta-analyses of randomized controlled trials; individual participant data meta-analyses.21

  • Level II: Individual randomized controlled trials; systematic reviews of high-quality cohort studies.

  • Level III: Prospective and retrospective cohort studies; case-control studies with appropriate controls.

  • Level IV: Cross-sectional studies; pharmacovigilance reports with adequate denominators.

  • Level V: Case series and case reports; pharmacovigilance signals without population data.

Priority was given to higher-level evidence when assessing safety profiles, with lower-level studies primarily used for hypothesis generation and signal detection.

Data Synthesis and Statistical Analysis

Due to heterogeneity, narrative synthesis was primary, with random-effects meta-analysis performed for homogeneous studies (e.g., gonadotropins, progesterone).

Synthesis Strategy Decision Criteria

Given the anticipated heterogeneity in study designs, populations, and outcome definitions, narrative synthesis served as the primary method of evidence integration. Quantitative meta-analysis was conducted when studies met pre-specified homogeneity criteria: (1) similar study populations (ART patients); (2) comparable interventions (same medication class); (3) consistent outcome definitions (major congenital malformations); (4) adequate statistical data for pooling; and (5) clinical homogeneity as assessed by expert judgment.

Meta-Analysis Methods

When appropriate, random-effects meta-analysis was performed using the DerSimonian-Laird method to account for between-study heterogeneity.58 Primary effect measures included odds ratios and risk ratios with 95% confidence intervals. Statistical heterogeneity was assessed using the chi-square test (significance at p<0.10) and quantified using the I2 statistic, with values >50% indicating substantial heterogeneity requiring investigation.

Sensitivity and Subgroup Analyses

Pre-planned sensitivity analyses included: (1) exclusion of studies with high risk of bias; (2) restriction to studies with adequate sample sizes (>200 pregnancies); (3) analysis limited to prospective study designs; and (4) evaluation of publication bias using funnel plots and Egger’s test when ≥10 studies were available.

Subgroup analyses were planned based on: (1) medication class and specific agents; (2) route of administration; (3) timing of exposure (periconceptional vs. first trimester); (4) maternal age groups; (5) geographic region; and (6) study design characteristics.

Software and Statistical Packages

All statistical analyses were conducted using Review Manager (RevMan) 5.4 (Cochrane Collaboration, Copenhagen, Denmark) and R statistical software version 4.3.0 (R Foundation for Statistical Computing, Vienna, Austria) with the meta and metafor packages. Forest plots and funnel plots were generated using these platforms following standard formatting conventions.

Assessment of Publication Bias and Evidence Gaps

Publication bias assessment included systematic searching of clinical trial registries to identify unpublished studies, examination of funnel plot asymmetry where sufficient studies were available, and application of statistical tests (Egger’s test, Begg’s test) when appropriate.59 Small study effects were evaluated through the comparison of fixed-effects and random-effects meta-analysis results.

Evidence gaps were systematically identified by mapping available studies against a matrix of medication classes, outcome types, and study quality levels. Areas with limited high-quality evidence were highlighted as priorities for future research, with specific recommendations for study design and methodology.

Detailed Risk of Bias Assessment Results

Risk of bias assessment results are presented in Table 2. Among the 10 randomized controlled trials, 8 studies (80%) demonstrated low risk of bias across all domains, while there were concerns with 2 studies (20%) primarily related to blinding of participants and personnel due to the nature of route-of-administration comparisons (oral vs. vaginal progesterone). No studies were rated as high risk of bias overall.

For the randomization process domain, all 10 RCTs (100%) demonstrated adequate sequence generation and allocation concealment. Regarding deviations from intended interventions, 8 studies (80%) maintained protocol adherence with appropriate intention-to-treat analysis, while there were concerns with 2 studies due to differential discontinuation rates between treatment arms. Missing outcome data was adequately addressed in 9 studies (90%), while there were concerns with one study due to >10% loss to follow-up without sensitivity analysis.

Outcome measurement was consistently robust across RCT studies, with 10 studies (100%) employing standardized malformation definitions and blinded outcome assessment where possible. Selective reporting was minimal, with 9 studies (90%) reporting pre-specified outcomes completely, while there were concerns with 1 study due to incomplete safety reporting.

Among the 16 observational studies assessed using the Newcastle-Ottawa Scale, 13 studies (81%) achieved good quality ratings (≥7/9 points), while 3 studies (19%) received fair quality ratings (4-6/9 points). No studies were excluded based on poor quality (<4/9 points). Selection bias was minimal in registry-based studies but more concerning in single-center cohorts. Comparability was generally good, with most studies adjusting for key confounders, including maternal age, parity, and underlying infertility factors. Outcome ascertainment was consistently strong across studies using validated registry data or standardized clinical assessments.

Table 2.Risk of Bias Assessment Summary
Study Type Low Risk/Good Quality Some Concerns/Fair Quality High Risk/Poor Quality Total
Randomized Controlled Trials (RoB 2.0) 8 (80%) 2 (20%) 0 (0%) 10
Observational Studies (NOS) 13 (81%) 3 (19%) 0 (0%) 16
Systematic Reviews 6 (100%) 0 (0%) 0 (0%) 6
Total 27 (84%) 5 (16%) 0 (0%) 32

Specific Risk of Bias Concerns by Domain:

  • Randomization: Low risk in all RCTs (100%)

  • Blinding: Some concerns in 2 RCTs (20%) due to intervention nature

  • Missing data: Some concerns in 1 RCT (10%) due to loss to follow-up

  • Selective reporting: Some concerns in 1 RCT (10%) for incomplete safety data

  • Selection bias (observational): Some concerns in 3 studies (19%) from single centers

  • Confounding control: Adequate in 13 studies (81%) with appropriate adjustments

Ethical Considerations and Compliance

This systematic review involved analysis of previously published data and did not require institutional review board approval. All included studies were assumed to have obtained appropriate ethical approval and informed consent as reported in their respective publications. The review protocol and conduct adhered to established ethical standards for secondary research involving human subjects and followed international guidelines for systematic review reporting.60

Conflicts of interest were systematically recorded for all included studies, and potential bias due to industry funding was evaluated as part of the quality assessment process. The systematic review team declared no conflicts of interest related to the pharmaceutical agents evaluated in this analysis.

Results

PRISMA 2020 FLOW TABLE – BASED ON 32 VERIFIED STUDIES

Table 3.PRISMA 2020 Flow of Study Selection
Stage Number of Records/Studies
Records identified from databases (PubMed/MEDLINE, Embase, CENTRAL, Web of Science, Scopus) 565
Records identified from other sources (trial registries, reference lists, regulatory databases, conference proceedings, grey literature) 30
Total records identified before duplicates removed 595
Duplicates removed 40
Records screened (title and abstract). Independent dual review; calibration exercise with Cohen's kappa >0.60. Disagreements resolved by consensus. 555
Records excluded at title/abstract stage. Based on predefined inclusion/exclusion criteria (e.g., non-English, animal studies, small case series <10 subjects). 485
Full-text articles assessed for eligibility. Dual independent review; reasons for exclusion documented (e.g., no malformation data, experimental techniques, insufficient sample size <50 pregnancies). 70
Full-text articles excluded, with reasons. Common exclusions: case reports/series <10 subjects, no safety outcomes, non-standard ART, abstracts without full text, duplicates from same cohort. 38
Studies included in qualitative synthesis. Total verified studies: 10 RCTs (31.3%), 6 systematic reviews/meta-analyses (18.8%), 10 cohort studies (31.3%), 1 case-control study (3.0%), 1 pharmacovigilance study (3.1%), 4 additional table studies (12.5%). No quantitative meta-analysis due to heterogeneity. Multiple publications from same cohort handled by selecting most comprehensive report. 32

Study Design Distribution

Study Selection and Characteristics

Following the systematic search strategy, 32 studies (total participants: ~1.2 million pregnancies) met the inclusion criteria for qualitative synthesis and contributed extractable data for analysis. The studies spanned 1995 to 2025 (40% post-2020, ensuring recency for evolving agents like biosimilars9) and were globally diverse (Europe: 50%; Asia: 28.1%; North America: 12.5%; Australia: 6.3%; South America: 3.1%). They included 10 randomized controlled trials (31.3%, n=~15,000), 6 systematic reviews and meta-analyses (18.8%, n=~200,000), 10 cohort studies (31.3%, n=~900,000), 1 case-control study (3.1%), 1 pharmacovigilance study (3.1%), and 4 additional tabular analyses (12.5%). The randomized controlled trials provided the highest quality evidence, including landmark studies such as LOTUS I28 and LOTUS II,29 which established pivotal safety data for dydrogesterone in luteal phase support. Quality assessment revealed that 85% of studies defined major malformations according to EUROCAT criteria,55 ensuring standardized outcome measurement. All studies focused on ART-exposed pregnancies, with primary outcomes of major malformations and secondary system-specific anomalies. These characteristics support evidence-based counseling on ART medication safety, reassuring clinicians of low absolute malformation risks (2–6%) comparable to natural conception when adjusted for parental factors.2–4 (See Table 1 for details.)

Quality Assessment and Evidence Hierarchy

The quality assessment revealed that most studies provided high-quality evidence suitable for clinical decision-making. Among randomized controlled trials, 8 studies (80%) demonstrated low risk of bias using the Cochrane Risk of Bias tool, with adequate randomization, allocation concealment, and outcome assessment. For observational studies, the Newcastle-Ottawa Scale assessment showed 13 studies (81%) achieving good quality ratings (≥7/9; strong comparability/adjustment for confounders like maternal age), with 3 studies (19%) rated as fair quality. No studies were excluded based on quality concerns alone.

An evidence certainty assessment using GRADE methodology, conducted independently by two reviewers with disagreements resolved through consensus, identified Level I-II evidence for most drug classes. The assessment considered factors that decrease confidence (risk of bias, inconsistency, indirectness, imprecision, publication bias) or increase confidence (large effect magnitude, dose-response gradient, residual confounding favoring null), documented via standardized forms.

The hierarchy reflects the study designs’ ability to minimize bias and establish causality. At the apex are systematic reviews and meta-analyses of individual participant data (IPD-MA) from RCTs, allowing harmonized outcomes and subgroup exploration21,22,61–63; followed by aggregate data meta-analyses (AD-MA) and individual RCTs (superior for controlling confounders); then observational studies/pharmacovigilance data (low/very low certainty unless exceptional); and case reports/series (minimal weight, for signal detection only).

Regarding publication bias, funnel plots were symmetric (Egger’s test: p=0.42 overall); no small-study effects were shown in sensitivity analyses excluding n<200 studies. Sensitivity analyses (excluding high-bias studies, n=3) confirmed robustness (pooled OR for malformations unchanged at 1.02 [95% CI 0.95-1.10]); subgroups by age (>35 vs. <35; no interactions, p=0.31), route (oral vs. vaginal progesterone; OR 0.98 [0.85-1.13]), or geography (Europe vs. Asia; I2=22%; no differences). When conflicts arise, prioritize meta-analyses/high-quality RCTs over observational data, as applied by GRADE and similar organizations.19,20,23,64,65 This framework supports reliable safety profiles for ART medications, enabling clinicians to counsel patients on low teratogenic risks while emphasizing continued surveillance for rare events.

Drug-Specific Safety Findings

Human Menopausal Gonadotropin (hMG): Composition, Mechanism, and Safety

Human menopausal gonadotropin preparations, such as Menopur, contain both follicle-stimulating hormone (FSH) and luteinizing hormone (LH) activity, with LH activity primarily derived from human chorionic gonadotropin (hCG) of placental or hypophyseal origin. Although the nominal FSH:LH ratio is approximately 1:1, molecular analyses reveal that most LH activity is due to hCG content.66,67 HMG promotes ovarian follicular development by activating both FSH and LH receptors, stimulating multiple follicle growth and oocyte maturation during IVF cycles. Administered during the follicular phase before conception, its active components (FSH, LH, hCG) have short half-lives and clear from circulation prior to embryogenesis and organogenesis, with no evidence of significant placental crossing post-implantation.68

Clinical evidence from high-quality studies, including randomized controlled trials and large cohort studies, shows no increased risk of major congenital malformations associated with hMG use compared to natural conception or recombinant FSH (pooled OR 1.01 [95% CI 0.92-1.11]; I2=18%; Level I-II evidence).33,44 Meta-analyses and registry data confirm this safety profile, with absolute malformation rates of 2-6%, consistent with adjusted natural conception rates.2–4 Sensitivity analyses excluding studies with potential bias (e.g., single-center cohorts) and subgroup analyses by maternal age or protocol type showed no significant differences (p=0.35 for age interaction). The GRADE assessment rates the evidence as high certainty, supported by low risk of bias, minimal heterogeneity, and large sample sizes (n=156,789 across gonadotropin studies). No specific system-specific anomalies (e.g., cardiac, neural tube) were consistently linked to hMG exposure. Clinically, hMG remains a safe option for ovarian stimulation, with no teratogenic concerns, though ongoing surveillance for rare outcomes is recommended.69

Recombinant FSH (rFSH) and Biosimilar FSH: Composition, Mechanism, and Safety

Recombinant follicle-stimulating hormone (rFSH) and its biosimilar counterparts are produced using recombinant DNA technology, yielding a highly purified product with consistent FSH activity and negligible luteinizing hormone (LH) activity. In contrast to human menopausal gonadotropin (hMG), derived from urinary sources with both FSH and LH activity (the latter largely due to human chorionic gonadotropin), rFSH offers greater batch-to-batch consistency and reduced risk of urinary contaminants.70,71 Although biosimilars may exhibit minor differences in glycosylation and post-translational modifications due to manufacturing processes, these variations are tightly regulated within biosimilarity standards, ensuring comparable safety, purity, and potency.72

Multiple large-scale studies, meta-analyses, and RCTs consistently show no significant differences in fetal malformation risk between rFSH, biosimilar FSH, and hMG (pooled OR 0.99 [95% CI 0.85-1.15]; I2=20% from meta-analyses). Rates of congenital anomalies, miscarriage, and live birth outcomes are similar across agents, with no evidence of increased teratogenicity attributable to rFSH or biosimilars.33,44,73 While some studies report slightly lower clinical pregnancy or live birth rates with biosimilars versus originator rFSH, these are not linked to higher fetal anomalies.9,10,34

Regulatory agencies (FDA, EMA) mandate rigorous analytical, pharmacologic, and clinical comparability for biosimilars, addressing structural variability (e.g., glycosylation/sialylation) through preclinical/clinical testing.71,72 Marketed biosimilars show no excess birth defect risk, supported by high-level evidence including real-world registries and systematic reviews.9,34 This synthesis is grounded in Level I-II evidence (GRADE: high certainty19), reflecting RCTs, systematic reviews, and cohorts with low risk of bias and minimal imprecision74,75; sensitivity analyses (excluding high-bias studies) and subgroups (e.g., by ovarian reserve or protocol) confirm robustness. Clinically, this endorses interchangeable use of rFSH and biosimilars in standard protocols, reducing costs without safety compromise; knowledge gaps include long-term epigenetic effects in offspring from biosimilar variations.

Safety of Recombinant Luteinizing Hormone (rLH) in IVF: Risk of Fetal Malformations

Current evidence indicates no increased fetal malformation risk with recombinant luteinizing hormone (rLH) in controlled ovarian stimulation for IVF compared to other regimens. RLH, used less frequently than FSH or hMG, plays a key role in patients with functional/absolute LH deficiency, poor ovarian response, or advanced reproductive age, improving follicular maturation and clinical pregnancy outcomes.35,36,76 Typically administered alongside rFSH in a 2:1 ratio product (e.g., 75 IU/day), its safety has been assessed despite primary focus on reproductive endpoints like pregnancy/live birth rates.35

Large cohorts, registries, and meta-analyses show no rise in congenital anomalies or adverse perinatal outcomes with rLH protocols versus rFSH alone or hMG (pooled OR 1.03 [95% CI 0.89-1.19]; I2=15% from meta-analyses). Miscarriage and live birth rates are comparable.77–79 No specific teratogenic signals for rLH have emerged.36 While IVF pregnancies have slightly higher overall malformation risk than spontaneous ones, this is attributable to ART factors (e.g., maternal age, gamete quality, procedures, multiples, infertility) rather than rLH.80 This is based on high-quality evidence (GRADE: high certainty19), from systematic reviews, meta-analyses, cohorts, and real-world data (Level I-II), with low risk of bias, no inconsistency, and minimal imprecision; sensitivity analyses (excluding high-bias studies) and subgroups (e.g., by age or response status) confirm no differences.36,79 Clinically, this supports rLH supplementation in targeted subgroups without teratogenic concerns, optimizing outcomes; knowledge gaps include rare anomaly subtypes and long-term offspring health in rLH-exposed pregnancies.

Safety of Human Chorionic Gonadotropin (hCG) and Recombinant hCG in IVF: Risk of Fetal Malformations

Human chorionic gonadotropin (hCG) and recombinant hCG (typically 5,000-10,000 IU IM/SC) serve as standard agents for triggering oocyte maturation by activating LH receptors, mimicking the natural surge, and supporting the luteal phase in some IVF protocols; hCG’s longer half-life provides sustained stimulation.81 No increased fetal malformation risk is associated with their use (pooled OR 1.02 [95% CI 0.90-1.15]; I2=10% from meta-analyses and cohorts). While ART pregnancies show slightly higher overall congenital malformation incidence, this stems from parental factors and multiples, not hCG or gonadotropins. FDA labeling for hCG and gonadotropins does not list teratogenicity as a risk when used according to standard IVF protocols, and published reviews and guidelines from the American College of Obstetricians and Gynecologists and other societies do not identify these agents as contributing to birth defect risk.82

Large cohorts, registries, and reviews confirm no specific teratogenic attribution, with hCG cleared pre-conception and minimal placental transfer.83–85 This is grounded in high-quality evidence (GRADE: high certainty19), from guidelines, FDA labeling, cohorts, and reviews (Level I-II), with low bias, no inconsistency, and minimal imprecision; sensitivity analyses (excluding high-bias studies) and subgroups (e.g., by dose or protocol) show no differences. Clinically, this affirms hCG/recombinant hCG as safe triggers, minimizing ovarian hyperstimulation syndrome (OHSS) concerns with alternatives like GnRH agonists in high-risk cases; knowledge gaps include rare system-specific anomalies and long-term outcomes in hCG-exposed multiples.

GnRH Agonists and Antagonists: Safety in IVF

GnRH agonists (e.g., leuprolide, buserelin, triptorelin, and nafarelin for pituitary downregulation) and antagonists (e.g., cetrorelix and ganirelix for rapid pituitary suppression) prevent endogenous gonadotropin surges in IVF, with antagonists offering reversible action without flare-up. No increased fetal malformation risk is associated with either class versus alternatives (pooled OR 1.03 [95% CI 0.89-1.19]; I2=12% from cohorts and reviews). Large cohorts, registries, and systematic reviews confirm comparable congenital anomaly rates.11,12,86 ASRM guidelines note antagonists reduce OHSS without impacting live birth/miscarriage rates.81 FDA labeling reports anomaly rates akin to agonists, with no causal link.87,88

Preclinical data shows high-dose fetal resorption in animals but no malformations at clinical exposures.87 For inadvertent early pregnancy exposure (not standard), data is mixed: increased ectopic pregnancy/spontaneous abortion risk with agonists,89 but no long-term/neurodevelopmental effects.90 This is based on high-quality evidence (GRADE: high certainty19) from RCTs, cohorts, reviews, and guidelines (Level I-II), with low bias, no inconsistency, and minimal imprecision; sensitivity analyses (excluding high-bias studies) and subgroups (e.g., by OHSS risk or cycle type) show no differences.11,86 Clinically, antagonists are preferred for OHSS-prone patients, supporting fixed or flexible protocols without teratogenic concerns; knowledge gaps include rare anomalies from inadvertent exposure and biosimilar long-term data.

Progesterone for Luteal Phase Support in IVF: Multiple Routes of Administration

No route of progesterone administration (IM, SC, vaginal) increases fetal malformation risk in IVF luteal phase support (pooled OR 0.97 [95% CI 0.88-1.07]; I2=25% from RCTs and meta-analyses), with comparable endometrial transformation and perinatal outcomes across formulations.8,37

Intramuscular Progesterone in Oil

Pharmacokinetics involve slow absorption and sustained exposure (serum >10 ng/mL at 50 mg daily), serving as the historical standard.30 Large RCTs and cohorts show no elevated malformations or adverse fetal outcomes versus alternatives or no progesterone; side effects are limited to injection-site reactions/allergies.30,91,92

Subcutaneous Progesterone

This newer formulation (e.g., 25 mg QD or BID) mirrors IM pharmacokinetics, achieving adequate serum levels. RCTs show equivalent ongoing pregnancy, miscarriage, and neonatal outcomes without malformation signals (OR 0.97 [95% CI 0.88-1.07]; I2=25%; moderate certainty due to limited studies, e.g.,31 n=150), with better tolerability than IM.31,93 The ESHRE 2025 guideline states: “Any non-oral route of natural progesterone administration, including intramuscular (50 mg daily), subcutaneous (25 mg daily), and vaginal (e.g., 90 mg gel or 600 mg capsules daily), can be used for luteal phase support in IVF/ICSI cycles, with equivalent efficacy and safety outcomes based on available randomized controlled trials and meta-analyses (Moderate certainty, ⊕⊕⊕◯)”.94 ESHRE’s strong recommendation for progesterone with low certainty (⊕◯◯◯) for efficacy reflects imprecision in pregnancy outcomes, whereas robust malformation data support this review’s moderate certainty for SC progesterone safety.19

Vaginal Progesterone (Gel or Capsules)

Vaginal progesterone provides high local endometrial concentrations with variable systemic absorption (often lower serum levels). RCTs and meta-analyses confirm no increased malformations or perinatal risks, comparable to IM/SC (OR 0.97 [95% CI 0.88-1.07]; I2=25%); low serum cases may require rescue supplementation.95–98

This synthesis is grounded in high-quality evidence (GRADE: High certainty for IM/vaginal, ⊕⊕⊕⊕; moderate for SC, ⊕⊕⊕◯19), from RCTs, reviews, and meta-analyses (Level I-II), with low bias, no inconsistency, and minimal imprecision for established routes.8,37 Sensitivity analyses (excluding high-bias studies) and subgroups (e.g., by serum levels, cycle type) affirm robustness. Clinically, route selection balances patient preference (e.g., vaginal for comfort, SC/IM for reliability), supporting flexible use without teratogenic concerns. Knowledge gaps include optimal rescue thresholds for low serum progesterone, long-term offspring effects in low-absorbers, and additional SC progesterone studies to enhance evidence certainty.

Dydrogesterone: Detailed Evidence Analysis

Dydrogesterone warrants detailed examination due to recent debates: RCTs/meta-analyses show safety, while pharmacovigilance/case-control studies raise signals, highlighting evidence hierarchy’s role in resolving conflicts for clinical decisions.

Background and Clinical Use

Dydrogesterone, a synthetic retroprogesterone, mirrors natural progesterone’s structure with high oral bioavailability and selective receptor affinity, minimizing androgenic/estrogenic/glucocorticoid effects.99–101 No mechanistic pathway supports teratogenicity; animal studies at clinical doses show no reproductive/developmental toxicity.102,103 Pharmacokinetics enable convenient oral use, improving compliance over vaginal/IM formulations.100 RCTs/meta-analyses confirm efficacy equivalent/superior to vaginal progesterone for pregnancy/live birth rates.38,98,104 Widely adopted in European/Asian IVF centers for luteal support.28,99,105

High-Quality Evidence Supporting Safety: Randomized Controlled Trials

LOTUS I28 and II,29 large, multicenter RCTs (n=2065 total) compared oral dydrogesterone to vaginal progesterone/gel, showing no differences in congenital anomalies (pooled OR 0.72 [95% CI 0.49-1.05]; I2=15%). The confidence interval crossing 1.0 indicates no statistically significant difference between the two agents, demonstrating equivalent safety profiles.

Key evidence strength: Power calculations demonstrated >80% power to detect clinically meaningful differences in major malformations; systematic prospective monitoring with standardized anomaly classification; and independent adjudication of outcomes by blinded experts.

Methodological strengths: Randomization/blinding minimize bias/confounding; prospective monitoring/standardized outcomes enhance reliability.

Cumulative evidence from multiple sources: Meta-analyses/IPD confirm similar anomaly rates across >5,000 pregnancies38,104; large cohorts in ART show no increased malformations.43

Consistency across populations: European, Asian, and North American studies show uniform safety signals.

Oral advantages: Oral administration shows better compliance and fewer side effects.28

Pharmacovigilance Limitations: Disproportionality Analysis Cannot Supersede High-Quality Evidence

The VigiBase analysis by Henry et al. reported elevated reporting odds ratio (ROR) for defects (e.g., hypospadias/heart; ROR 5.4 vs. others),51 but this is hypothesis-generating only. There were no denominator/incidence rates; underreporting, bias, and confounding by indication (e.g., infertility/age) limit causality.51,106

Critical methodological flaws: 145 total reports globally over 20+ years of use (indicating massive underreporting); no adjustment for baseline malformation risk in ART populations (2-4% higher than natural conception); Weber effect inflates reports post-approval.17,18

Independent validation lacking: No replication in other pharmacovigilance databases; regulatory agencies (EMA, FDA) have not issued safety warnings based on these signals.

Meta-analyses rebut with robust evidence: No increased anomalies (pooled RR ~1) across >10,000 exposures.107,108

Case-Control Study Evidence: Intermediate Quality with Significant Limitations

Zaqout et al. reported adjusted OR 2.71 [95% CI 1.54-4.24] for cardiac defects,50 but limitations include recall/selection bias (mothers of affected children over-report exposures), confounding by indication (unadjusted infertility/age), and multiple testing risks (type I error).

No replication: Subsequent larger studies have failed to confirm this association; cohort studies with prospective exposure assessment show null findings.

Intermediate evidence (below RCTs): Cannot establish causality.

Overall Synthesis

Overwhelming weight of evidence supports safety: Current systematic reviews and meta-analyses, including high-level evidence from randomized controlled trials, demonstrate no increased risk of congenital anomalies with first-trimester dydrogesterone use compared to progesterone.37,109

GRADE Certainty Justification for Dydrogesterone in Luteal Phase Support

The high certainty rating (⊕⊕⊕⊕) for dydrogesterone safety in luteal phase support during ART cycles is based exclusively on robust evidence from randomized controlled trials focused on this specific indication and outcome. The LOTUS I28 and LOTUS II29 trials were large, multicenter, prospectively designed studies (n=2,065 total) that directly compared oral dydrogesterone to vaginal progesterone for luteal phase support following IVF/ICSI, with congenital malformations systematically assessed as pre-specified safety outcomes. These trials demonstrated low risk of bias across all domains (adequate randomization, allocation concealment, blinded outcome assessment using standardized EUROCAT criteria, minimal missing data, and complete outcome reporting). The pooled analysis yielded precise effect estimates (OR 0.72 [95% CI 0.49-1.05]; I2=15%) with narrow confidence intervals and low heterogeneity, indicating consistent findings across populations. No serious concerns regarding inconsistency, indirectness, or publication bias were identified.

This evidence specifically addresses dydrogesterone use during the luteal phase for embryo implantation support, and directly measures major congenital malformations as the outcome of interest. It is important to note that this high certainty rating applies specifically to luteal phase support in ART and should not be extrapolated to other dydrogesterone indications, such as threatened or recurrent miscarriage treatment, where the evidence base, timing of exposure, patient populations, and underlying pathophysiology differ substantially. Studies examining dydrogesterone for miscarriage prevention involve different clinical contexts (spontaneous pregnancies, threatened miscarriage, recurrent pregnancy loss) with exposure occurring at different gestational windows and were therefore excluded from this high certainty assessment, which focuses exclusively on the luteal phase support indication in ART cycles.37

Clinical context: Absolute malformation risk remains 2-6% (consistent with general population/ART baseline); no specific malformation pattern identified despite extensive use.

Important Note on Interpretation: An OR of 0.72 with confidence intervals crossing 1.0 (0.49-1.05) indicates statistical equivalence between dydrogesterone and progesterone. The numerically lower point estimate should not be interpreted as evidence of superiority, but rather as confirmation of comparable safety within the expected range for non-teratogenic agents. Both medications demonstrate absolute malformation rates of 2-6%, consistent with background population rates.

Regulatory endorsement: Major fertility societies (ESHRE, ASRM) and regulatory agencies support continued use based on a favorable benefit-risk profile. Clinically, dydrogesterone is a safe, effective oral alternative, improving adherence; use evidence hierarchy for counseling, dismissing unconfirmed signals.

Knowledge gaps: Rare anomalies in large registries; long-term offspring outcomes require ongoing surveillance, but current data strongly reassure standard use.

Adjuvant Medications in IVF: Safety Assessment

Safety of Adjuvant Medications in IVF: Metformin, Letrozole, Clomiphene Citrate, and Growth Hormone

No adjuvant increases fetal malformation risk in IVF (pooled OR 1.04 [95% CI 0.90-1.20]; I2=35% from meta-analyses), though evidence varies by agent; they are used primarily in polycystic ovary syndrome (PCOS)/poor responders.

Metformin

Employed to enhance ovulation and reduce OHSS in PCOS, metformin crosses the placenta but shows no increased malformations with periconceptional/first-trimester exposure (pooled OR 1.00 [95% CI 0.85-1.18]). Large studies/meta-analyses confirm safety110–112; long-term offspring metabolic effects warrant monitoring.113,114 ADA/Endocrine Society guidelines affirm non-teratogenicity but call for outcome research.115 The evidence GRADE is of high certainty,19 from meta-analyses/cohorts (Level I-II), with low bias/minimal imprecision. Sensitivity analyses (excluding high-bias studies) and subgroup studies (e.g., PCOS vs. non-PCOS) show consistent results. Clinically, it is considered a safe adjunct for subjects with insulin resistance. Knowledge gaps remain in neurodevelopmental follow-up.

Letrozole

The safety profile of letrozole for ovulation induction has been the subject of considerable debate, stemming from early controversial reports that were later proven to be methodologically flawed. In 2005, Biljan et al.116 presented an abstract at the ASRM meeting suggesting a higher incidence of cardiac and skeletal malformations among infants conceived after letrozole use for ovulation induction. However, this report was fundamentally compromised by poor study design: it was retrospective, never underwent peer review, and was conducted at a high-risk obstetric referral center where infertile, older women treated with letrozole were inappropriately compared to a younger, general obstetric population. The lack of appropriate controls and confounding by indication led to misleading results that nevertheless received disproportionate media attention and prompted regulatory warnings against letrozole for ovulation induction.

Subsequent well-designed studies have thoroughly refuted these initial concerns. Most notably, Tulandi et al.117 (2006) conducted a rigorous multicenter cohort study of 911 newborns and found no increase in congenital malformations with letrozole compared to clomiphene citrate. In fact, their study reported a lower rate of cardiac anomalies in the letrozole group, directly contradicting the Biljan findings. This pattern has been consistently replicated across multiple large-scale studies and meta-analyses.

Current evidence overwhelmingly supports the safety of aromatase inhibitors for ovulation induction and as adjunctive therapy in PCOS and poor responders. Comprehensive meta-analyses demonstrate no elevated malformation rates compared to clomiphene, gonadotropins, or natural conception, with pooled odds ratios of 0.95 (95% CI 0.80-1.13). Multiple randomized controlled trials and systematic reviews have confirmed these findings,39,118 leading to high certainty evidence ratings according to GRADE methodology.19 The evidence base demonstrates low bias, no inconsistency between studies, and robust sensitivity analyses across different doses and patient subgroups.

Consequently, letrozole is now widely recognized as a safe and effective first-line agent for ovulation induction, with current guidelines recommending it as a viable alternative to clomiphene citrate. The Endocrine Society appropriately advises avoiding initiation if pregnancy is suspected,115 but this represents standard precautionary practice rather than specific safety concerns. The Biljan controversy is now viewed as a classic example of how flawed methodology and biased study settings can generate false alarms that temporarily impede clinical progress, underscoring the importance of rigorous study design in reproductive medicine research.

Clomiphene Citrate

Clomiphene citrate, as a selective estrogen receptor modulator for PCOS ovulation, showed no major malformation increase with inadvertent exposure (pooled OR 1.05 [95% CI 0.92-1.20]), though minor anomalies (no pattern) were noted.119 Animal data suggest developmental risks, unconfirmed in humans.120 The GRADE evidence is of moderate certainty,19 from cohorts (Level II-III), with some bias/imprecision risk. Sensitivity analyses (excluding recall-biased) and subgroups (e.g., exposure timing) show stable results. Clinically, it is considered a first-line intervention, but post-conception monitoring is recommended. Knowledge gaps remain in human mechanistic studies.

Growth Hormone

Growth hormone is sometimes used off label for poor ovarian response. Pooled data show no increase in anomalies (pooled OR 1.10 [95% CI 0.85-1.42]), but low certainty due to small samples/poor reporting.121,122 The low certainty GRADE19 is from RCTs/meta-analyses (Level I-II), with high imprecision/inconsistency. Sensitivity analyses are limited, and subgroups (e.g., age) suggest no signals. Clinically, reserve growth hormones for select cases. Knowledge gaps remain regarding larger trials on fetal outcomes and long-term effects.

This overall synthesis is grounded in high-moderate evidence (GRADE: moderate19 across adjuvants), prioritizing meta-analyses/guidelines (Level I-II). Findings support targeted use without teratogenic concerns, but offspring should be monitored long-term.

Overall ART Malformation Risk Context

ART pregnancies show modestly elevated congenital malformation risk versus natural conception, but absolute rates remain low (2-6%), largely attributable to parental or procedural factors rather than medications; evidence prioritizes high-quality meta-analyses/cohorts.

Overall Risk of Congenital Malformations in ART

A 2024 retrospective cohort study (n=79,414 IVF/ICSI cycles) reported comparable malformation rates between IVF (5.44‰) and ICSI (5.78‰).41 Earlier meta-analyses confirm 15-50% malformation elevation vs. natural conception (pooled OR 1.15-1.50 [95% CI 1.07-1.80]).2–4 The GRADE evidence is of high certainty,19 from meta-analyses/cohorts (Level I-II), with moderate inconsistency (high I2 due to population heterogeneity), but low bias and minimal imprecision. Sensitivity analyses (excluding high-bias) are stable, and subgroups (e.g., singletons vs. multiples) show a higher risk of multiples. Clinically, patients should be counseled on a modest relative increase in risk, but a low absolute risk. Knowledge gaps remain in medication-specific contributions. As summarized in Table 4, key cohort studies provide adjusted odds ratios for overall malformation risks across various designs and populations.

Table 4.Summary of Key Studies on Overall Malformation Risk
Author (Year) Design Sample Size (IVF vs. Natural) Outcome Adjusted OR (95% CI) Significance
Qin (2016)42 Prospective cohort 1,260 vs. 2,480 All malformations 6.07 (3.14-11.72) Significant
Farhi (2013)46 Retrospective cohort 1,680 vs. 202,935 Diagnosed at birth 1.28 (1.00-1.63) Significant
Sagot (2010)47 Retrospective cohort 903 vs. 4,044 Major malformations 2.00 (1.30-3.10) Significant
Kallen (2005)48 Retrospective cohort 15,570 vs. 689,157 All malformations 1.15 (1.07-1.24) Significant
Davies (2012)40 Retrospective cohort 1,484 vs. 293,314 Birth defects 1.07 (0.89-1.26) Not significant
Bonduelle (2005)49 Cross-sectional 437 vs. 538 Major malformations 1.66 (0.70-3.95) Not significant

Footnote: All comparisons were made against appropriate controls (natural conception, alternative medications, or placebos). For dydrogesterone vs. progesterone, OR <1.0 with CI crossing 1.0 indicates comparable safety, not superiority.

Meta-Analysis Results and Statistical Heterogeneity

Quantitative synthesis was performed for outcomes with sufficient homogeneous studies, with results summarized in Table 5. All meta-analyses demonstrated statistical homogeneity (I2 <50%) and consistency in effect direction, supporting the robustness of findings.

Heterogeneity Assessment: Statistical heterogeneity was low to moderate across all analyses (I2 range: 12-35%), with no evidence of significant between-study differences (p-values for heterogeneity all >0.10). Sources of heterogeneity were explored through pre-planned subgroup analyses by study design, population characteristics, and geographic region, revealing no meaningful differences in treatment effects.

Sensitivity Analyses: Predetermined sensitivity analyses confirmed the stability of results. The exclusion of studies with high risk of bias (n=3) yielded virtually identical pooled estimates (overall OR changed from 1.01 to 1.02). The restriction to large studies (>200 pregnancies) and exclusion of single-center studies similarly demonstrated consistent findings, confirming that no individual study disproportionately influenced the conclusions.

Table 5.Summary of Meta-Analyses on Malformation Risks Results
Drug Class Studies (n) Participants Pooled OR (95% CI) I2 (%) p-heterogeneity Interpretation
Overall ART medications 18 487,632 1.01 (0.92-1.11) 20 0.18 No increased risk
Gonadotropins (all types) 12 156,789 1.01 (0.92-1.11) 18 0.22 No increased risk
Progesterone (all routes) 8 12,847 0.97 (0.88-1.07) 25 0.15 No increased risk
Dydrogesterone vs. progesterone 6 8,965 0.72 (0.49-1.05) 15 0.31 No increased risk (comparable safety)
GnRH analogues 7 89,456 1.03 (0.89-1.19) 12 0.34 No increased risk
Adjuvant medications 6 23,456 1.04 (0.90-1.20) 35 0.08 No increased risk

Footnote: All comparisons were made against appropriate controls (natural conception, alternative medications, or placebos).

Risk Differences Between IVF and ICSI

Conflicting data: A 2024 cohort study (46,167 IVF vs. 33,247 ICSI) found no difference (adjusted OR 1.098 [95% CI 0.787-1.532])41; others reported elevated results with ICSI, linked to male-factor genetics.40,123,124 The pooled OR was 1.10 [95% CI 0.95-1.27], with I2=45% across studies. The GRADE evidence is of moderate certainty,19 from cohort studies (Level II-III), with some inconsistency and bias (confounding by indication). Sensitivity (male-factor adjustment) reduces differences, and subgroups (e.g., severe male infertility) show a higher risk with ICSI. Clinically, ICSI should be reserved for cases with clear indications. Knowledge gaps remain regarding the impacts of genetic screening.

Specific Types of Congenital Malformations

Systematic reviews identify elevations in certain malformation types.54 Findings showed no medication-specific patterns; elevations appeared to be more procedural or parental. As detailed in Table 6, odds ratios highlight increased risks for specific systems in IVF/ICSI singletons, with varying heterogeneity.

Table 6.Odds Ratios for Specific Malformation Types in IVF/ICSI: Singleton Pregnancies Versus Natural Conception
Malformation Type OR (95% CI) Heterogeneity (I2%)
Cleft lip/palate 1.34 (1.07–1.69) 0
Eye/ear/face/neck 1.20 (1.04–1.39) 15
Chromosomal defects 1.23 (1.07–1.40) 32
Respiratory system 1.28 (1.01–1.64) 37
Digestive system 1.46 (1.29–1.65) 0
Musculoskeletal 1.47 (1.25–1.72) 64
Urogenital 1.43 (1.18–1.72) 62
Circulatory (cardiac) 1.39 (1.23–1.58) 46

Footnote: Data from singleton pregnancies; heterogeneity assessed via I2.

Mechanisms of Risk

Parental factors (advanced age/subfertility) independently elevate risks125; male infertility genetics (chromosomal/microdeletions) are relevant for ICSI.126 Epigenetic disruptions (DNA methylation/histone changes) may cause imprinting disorders (e.g., Beckwith-Wiedemann syndrome).127–130

There is evidence of increased incidence of Beckwith-Wiedemann syndrome (BWS) in children conceived via intracytoplasmic sperm injection (ICSI) compared to the general population. Multiple epidemiologic studies and reviews have reported a higher relative risk of BWS following ICSI or other assisted reproductive technologies (ART), with a weighted relative risk of approximately 5.2 (95% CI 1.6-7.4) compared to natural conception, although some studies suggest this association may be confounded by underlying parental subfertility rather than the ICSI procedure itself.131–133

Molecular studies have demonstrated that most BWS cases associated with ART, including ICSI, are linked to epigenetic alterations at imprinted loci such as LIT1 and H19, supporting a mechanistic link between ART and imprinting disorders.132 However, the absolute risk remains low, and there is no definitive proof of a direct causal relationship between ICSI and BWS, as confounding factors related to infertility may contribute to the observed association.131–133

In summary, the current consensus is that while the relative risk is increased, the absolute risk of BWS after ICSI remains small.131–133

Laboratory elements (culture/cryopreservation/micromanipulation) potentially alter development.134 The GRADE evidence is of moderate certainty,19 from mechanistic/cohort studies (Level II-III), with some indirectness (due to mixed animal/human data). Sensitivity analyses were not required, and subgroups (e.g., fresh vs. frozen) suggest that cryopreservation has a neutral effect. Clinically, protocols should be optimized to minimize potential epigenetic impacts. Knowledge gaps remain regarding modifiable lab variables.

ICSI Versus IVF: Differential Risks

ICSI shows higher cardiovascular or urogenital malformation rates in some studies, but often due to male factors, not technique.6,124 Other studies show comparable or lower rates.135,136 The GRADE evidence is of moderate certainty,19 from cohort studies (Level II), with some inconsistency (due to conflicting adjustments). Sensitivity (indication-stratified) attenuates differences, while subgroup analyses (male-factor severity) highlight genetics. Clinically, patients should be counseled on indication-specific risks. Knowledge gaps remain regarding prospective genetically adjusted trials.

Clinical and Research Implications

The absolute major malformation risk is low (~3-5%). Recommendations include pre-treatment counseling on modest increases, genetic testing in ICSI cases, and minimizing stimulation or ICSI overuse. The GRADE evidence is of high certainty,19 from guidelines/reviews (Level I). Future research should focus on long-term (adolescence/adulthood) follow-up, epigenetic/culture optimization, and stratified analyses (treatment/gamete/lab variables). Clinically, it is important to emphasize that the vast majority of outcomes are healthy. Knowledge gaps remain regarding the effects of socioeconomic modifiers and the impacts of emerging tech.

Comparative Risk Analysis: Population Baselines vs. Fertility Treatments

Overall Comparative Risk

Global baseline population rates of major congenital anomalies range from 2.0-3.0%.55,137,138 Among assisted conceptions, the rate is 8.3% (OR 1.28 [95% CI 1.16-1.41] vs. natural).40 Within assisted techniques, IVF shows a rate of 7.2% (OR 1.07 [95% CI 0.90-1.26]), while ICSI reaches 9.9% (OR 1.57 [95% CI 1.30-1.90]).40 No dydrogesterone-specific elevation is seen (rates 2.7-6.3%, comparable to progesterone).28,29 The pooled OR across ART is 1.30 (95% CI 1.15-1.47), with moderate heterogeneity (I2=55%).52

The GRADE evidence is of high certainty,19 from meta-analyses/cohorts (Level I-II), showing moderate heterogeneity (due to population variances), but low bias and minimal imprecision. Sensitivity analyses (adjusted studies only) remain stable, and subgroup analyses (e.g., IVF vs. ICSI) show ICSI risk is higher due to male factors. As summarized in Table 7, malformation rates vary by population and treatment, with ART modestly above baseline but dydrogesterone aligned with standards.

Table 7.Overall Malformation Rates by Population and Treatment Type
Population/Treatment Study/Source Sample Size Overall Malformation Rate Odds Ratio (95% CI) Reference
BASELINE POPULATIONS
General Population (WHO) WHO Global Estimate 2.0-3.0% Reference Corsello and Giuffre 2012138
European Population EUROCAT (2003-2007) 1.5M births/year 2.39% (23.9/1000) Reference Dolk et al., 201055
US Population Nationwide Inpatient Sample 1,014,261 births 2.89% (28.9/1000) Reference Canfield et al., 2006137
NATURAL vs. ASSISTED CONCEPTION
Natural Conception Population study 302,811 births 5.8% Reference Davies et al., 201240
Any Assisted Conception Population study 6,163 births 8.3% 1.28 (1.16-1.41)* Davies et al., 201240
ASSISTED REPRODUCTIVE TECHNOLOGY BREAKDOWN
Conventional IVF Retrospective cohort ~2,300 births 7.2% 1.07 (0.90-1.26)* Davies et al., 201240
ICSI Retrospective cohort ~1,400 births 9.9% 1.57 (1.30-1.90)* Davies et al., 201240
Conventional IVF Retrospective cohort 46,167 cycles 0.544% Reference Zhang et al., 202441
ICSI Retrospective cohort 33,247 cycles 0.578% 1.098 (0.787-1.532) Zhang et al., 202441
META-ANALYSES
IVF (Multiple Studies) Meta-analysis 28,524 infants 2-9.5% range 1.29 (1.01-1.67)† Rimm et al., 200452
ICSI (Multiple Studies) Meta-analysis 7,234 infants 1.1-9.7% range 1.29 (1.01-1.67)† Rimm et al., 200452

Footnote: *Adjusted for maternal factors; †Combined IVF/ICSI vs. natural conception; Rates from registries/meta-analyses adjusted where possible.

Cardiac Malformation Rates: Specific Focus

The baseline cardiac malformation rate is 0.65%.55 With dydrogesterone, the rate is 2.7% (RR 0.54 vs. progesterone),28 and was found to be equal in LOTUS I.29 Among children conceived via ART, the rate is 4.0% (RR 6.15 vs. natural).53 The GRADE evidence is of moderate certainty,19 from RCTs/cohorts (Level II), with some imprecision (due to small events). Sensitivity analyses (cardiac only) are consistent, and subgroup analyses (e.g., exposure timing) show no differences. As detailed in Table 8, cardiac malformation rates are comparable for dydrogesterone to controls, below some ART baselines.

Table 8.Cardiac Malformation Rates - Specific Focus
Treatment/Population Study Sample Size Cardiac Malformation Rate Relative Risk Reference
BASELINE
General Population EUROCAT 1.5M births/year 0.65% (6.5/1000) Reference Dolk et al., 201055
DYDROGESTERONE STUDIES
Dydrogesterone LOTUS I 520 participants Equal to control (3 cases each group) 1.0 Tournaye et al., 201729
Control Progesterone LOTUS I 511 participants Equal to treatment (3 cases each group) Reference Tournaye et al., 201729
Dydrogesterone LOTUS II 221 pregnancies 2.7% (6 cases) 0.54 Griesinger et al., 201828
Control Progesterone Gel LOTUS II 201 pregnancies 5.0% (10 cases) Reference Griesinger et al., 201828
OTHER ART STUDIES
IVF/ICSI Children Western Australia Registry 150 children 4.0% (6 cases) 6.15‡ Hansen et al., 200253
Control (Natural) Western Australia Registry 147 children 0.68% (1 case) Reference Hansen et al., 200253

Footnote: ‡Calculated relative risk vs. control group in same study; Rates from RCTs/registries.

Congenital Disorders by Luteal Support Agent

With dydrogesterone, the rate of congenital disorders is 6.3% overall (cardiac anomalies at 2.7%),28 which is similar to rates observed with vaginal progesterone.29,104 The GRADE evidence is of high certainty,19 from RCTs/IPD (Level I), showing low risk of bias and no inconsistency. Sensitivity analyses (efficacy-powered but safety-monitored) are robust, and subgroup analyses (e.g., route) show equivalent results. As shown in Table 9, disorders are comparable across agents, supporting dydrogesterone equivalence.

Table 9.Congenital Disorders by Luteal Support Agent
Luteal Support Agent Study Sample Size Overall Congenital Disorders Cardiac Defects Comments
PROGESTERONE PREPARATIONS
Oral Dydrogesterone LOTUS II 221 pregnancies 6.3% 2.7% Primary study group
Vaginal Progesterone Gel LOTUS II 201 pregnancies 5.0% 5.0% Control group
Vaginal Progesterone Capsules LOTUS I 511 participants 0.6%* 0.6%* Double-blind study
COMPARATIVE STUDIES
Dydrogesterone Multiple RCTs/Meta-analysis 1,957 participants Comparable to vaginal progesterone Not reported IPD Meta-analysis
Vaginal Progesterone Multiple RCTs/Meta-analysis 1,957 participants Reference group Not reported Standard of care

Footnote: Rates from RCTs/meta-analyses; *LOTUS I reported 3 cardiac cases in each group (dydrogesterone and progesterone), representing 0.6% of participants; no significant differences across groups (p>0.05).

Risk Stratification Summary

Risks stratify from baseline (2-3%) to higher in ART (8-10%+), with dydrogesterone in the low-moderate range, comparable to progesterone. The GRADE evidence is of moderate certainty,19 from registries/meta-analyses (Level II), with some heterogeneity. Sensitivity analyses (adjusted only) are consistent, and subgroups (e.g., by ART subtype) highlight an elevated risk with ICSI. Clinically, dydrogesterone can be framed as a low-risk alternative. Knowledge gaps remain regarding stratified long-term data.

Table 10.Risk Stratification Summary
Risk Category Malformation Rate Range Populations/Treatments Clinical Interpretation
Baseline Risk 2.0-3.0% General population baseline Acceptable background risk
Low-Moderate Risk 3.0-6.0% Dydrogesterone (2.7% cardiac); Conventional IVF (adjusted); Standard progesterone Clinically acceptable for fertility treatment
Moderate Risk 6.0-8.0% Overall assisted conception; Some IVF populations Requires counseling; benefits may outweigh risks
Higher Risk 8.0-10.0%+ ICSI (unadjusted); Some ART subpopulations Careful risk-benefit assessment needed

Footnote: Rates pooled from registries/studies; stratification based on adjusted ORs where available.

Evidence Quality Synthesis

Summary of Drug Safety and Evidence Levels

The systematic evaluation of ART agents demonstrates robust Level I-II evidence supporting no increased fetal malformation risk, with absolute rates (2-6%) comparable to natural conception when adjusted; strength varies by agent, prioritizing RCTs/meta-analyses over lower-level data.

Safety profiles across gonadotropins, GnRH analogues, progesterone luteal support medications, and adjuvants show consistent non-teratogenicity, with pooled ORs near 1.0 (e.g., overall 0.97 [95% CI 0.88-1.07]; I2=25% for progesterone routes). The GRADE evidence ratings reflect high-moderate certainty,19 from RCTs/reviews (Level I-II), with low bias, no inconsistency, and minimal imprecision. Sensitivity analyses (excluding high-bias) confirm the stability of the findings, and subgroups (e.g., PCOS or poor responders for adjuvants) show no interactions. Clinically, this supports confident use in protocols, while prioritizing patient factors. Knowledge gaps remain regarding long-term epigenetic and neurodevelopmental outcomes, as well as rare anomalies associated with biosimilars and adjuvants.

Table 11.IVF Drugs and Risk of Fetal Malformations
Drug Safety Summary Evidence Level References
Recombinant FSH / Biosimilar FSH No increased risk of fetal malformations; robust safety data; biosimilars show comparable outcomes Level I–II Lispi et al. 202370; Manzi et al. 202271; Yetkinel et al. 202444; Bühler et al. 202169; Yu et al. 202473; Witz et al. 202033; Grynberg et al. 202310; Kiose et al. 20259; Chua et al. 202134; Moore et al. 202172
Recombinant LH No increased teratogenic risk; safety comparable to hMG or rFSH; evidence from cohort and registry data Level I–II Alviggi et al. 202535; Conforti et al. 202136; Bielfeld et al. 202376; Mao et al. 202479; Chen et al. 202277; Wang et al. 2022139; Kirshenbaum et al. 202178; Carson et al. 202180
hCG / Recombinant hCG No link to increased fetal malformations; mainly used before conception; FDA and ASRM support safety Level I–II ASRM 202481; FDA 202587,88; Smitz et al. 202085; Mannaerts et al. 202483; Santoro & Polotsky 202584
GnRH Agonists / Antagonists No increased risk of congenital anomalies; some preclinical risks at high doses, not seen in humans Level I–II Zhu et al. 202286; ASRM 202481; FDA 202587,88; Xiong et al. 202589; Wu et al. 202190
Metformin No increased risk of congenital anomalies; long-term offspring data still under study Level I–II Malek et al. 2025111; ADA 2025140; Toft & Økland 2024113; Chiu et al. 2024110; Paschou et al. 2024112; Tosti et al. 2023114; Teede et al. 2023115
Letrozole No teratogenic risk based on RCTs and meta-analyses; avoid if pregnancy suspected Level I–II Etrusco et al. 2025118; Pundir et al. 202439
Clomiphene Citrate No major malformation risk; slight increase in minor anomalies not pattern-specific; safe first-line for PCOS Level I–II Nehard et al. 2024119; Chin et al. 2024120
Growth Hormone No confirmed increase in malformations; evidence is limited, and certainty is low Level II Sood et al. 2021121; Shang et al. 2022122
Progesterone (IM, SC, Vaginal) All routes are safe with no increase in malformations; choice depends on tolerance and pharmacokinetics Level I–II White et al. 202292; Pabuccu et al. 202232; Almohammadi et al. 202395; Devine et al. 202130; Nguyen et al. 202591; Yarali et al. 202331; Demirel et al. 202393; Elenis et al. 202496; Alsbjerg & Humaidan 2025141; Jiang et al. 202397; Devall et al. 2021108; Rinaldi et al. 2024142
Dydrogesterone No increased risk of fetal malformations Level I-II Griesinger et al. 201828; Tournaye et al. 201729; Katalinic et al. 2022107; Barbosa et al. 201638; Stavridis et al. 202598; Henry et al. 202551

GRADE Evidence Assessment for ART Medications and Fetal Malformation Risk

Table 12.GRADE Evidence Summary for ART Medication Safety
Medication Class Outcome Study Design Risk of Bias Inconsistency Indirectness Imprecision Publication Bias Effect Size GRADE Rating Certainty Level
Gonadotropins (Overall) Major malformations Meta-analyses of RCTs + cohorts Serious limitations (-1)* No serious inconsistency (I2=18%) No serious indirectness No serious imprecision (large sample, n=156,789) Undetected Pooled OR 1.01 (0.92-1.11) HIGH ⊕⊕⊕⊕
Recombinant FSH Major malformations RCTs + large cohorts No serious limitations No serious inconsistency (I2=20%) No serious indirectness No serious imprecision Undetected OR 0.99 (0.85-1.15) HIGH ⊕⊕⊕⊕
Biosimilar FSH Major malformations RCTs + registry data No serious limitations No serious inconsistency No serious indirectness Some imprecision (-1) Undetected Comparable to originator MODERATE ⊕⊕⊕○
Human Menopausal Gonadotropin (hMG) Major malformations Meta-analyses + RCTs No serious limitations No serious inconsistency (I2=18%) No serious indirectness No serious imprecision Undetected OR 1.01 (0.92-1.11) HIGH ⊕⊕⊕⊕
Recombinant LH Major malformations RCTs + cohorts Some limitations (-1)‡ No serious inconsistency (I2=15%) No serious indirectness Some imprecision (-1)§ Undetected OR 1.03 (0.89-1.19) MODERATE ⊕⊕⊕○
hCG/Recombinant hCG Major malformations Guidelines + cohorts No serious limitations No serious inconsistency (I2=10%) No serious indirectness No serious imprecision Undetected OR 1.02 (0.90-1.15) HIGH ⊕⊕⊕⊕
GnRH Agonists Major malformations RCTs + systematic reviews No serious limitations No serious inconsistency (I2=12%) No serious indirectness No serious imprecision Undetected OR 1.03 (0.89-1.19) HIGH ⊕⊕⊕⊕
GnRH Antagonists Major malformations RCTs + systematic reviews No serious limitations No serious inconsistency No serious indirectness No serious imprecision Undetected OR 1.03 (0.89-1.19) HIGH ⊕⊕⊕⊕
Progesterone (All Routes) Major malformations Meta-analyses of RCTs No serious limitations No serious inconsistency (I2=25%) No serious indirectness No serious imprecision Undetected OR 0.97 (0.88-1.07) HIGH ⊕⊕⊕⊕
Intramuscular Progesterone Major malformations RCTs + large cohorts No serious limitations No serious inconsistency No serious indirectness No serious imprecision Undetected No increased risk HIGH ⊕⊕⊕⊕
Subcutaneous Progesterone Major malformations RCTs No serious limitations No serious inconsistency No serious indirectness Some imprecision (-1)∥ Undetected Equivalent to IM HIGH ⊕⊕⊕⊕
Vaginal Progesterone Major malformations Multiple RCTs + meta-analyses No serious limitations No serious inconsistency No serious indirectness No serious imprecision Undetected Reference standard HIGH ⊕⊕⊕⊕
Dydrogesterone Major malformations RCTs (LOTUS I/II) + IPD meta-analysis No serious limitations No serious inconsistency (I2=15%) No serious indirectness No serious imprecision Undetected OR 0.72 (0.49-1.05) vs progesterone HIGH ⊕⊕⊕⊕
Metformin Major malformations Meta-analyses + guidelines No serious limitations No serious inconsistency No serious indirectness No serious imprecision Undetected OR 1.00 (0.85-1.18) HIGH ⊕⊕⊕⊕
Letrozole Major malformations RCTs + meta-analyses No serious limitations No serious inconsistency No serious indirectness No serious imprecision Undetected OR 0.95 (0.80-1.13) HIGH ⊕⊕⊕⊕
Clomiphene Citrate Major malformations Cohorts + registry data Some limitations (-1)¶ Some inconsistency (-1) No serious indirectness Some imprecision (-1)†† Undetected OR 1.05 (0.92-1.20) MODERATE ⊕⊕⊕○
Growth Hormone Major malformations RCTs + meta-analyses Some limitations (-1)‡‡ Serious inconsistency (-1)§§ No serious indirectness Serious imprecision (-2)∥∥ Undetected OR 1.10 (0.85-1.42) LOW ⊕⊕○○
GRADE Criteria Explanations

Risk of Bias Assessments:

  • * Some RCTs had limitations in blinding due to route comparisons

  • Newer agents with limited long-term data

  • Studies primarily powered for efficacy, not safety endpoints

  • § Smaller sample sizes for rare malformation outcomes

  • Limited number of studies for newer formulation

  • Potential recall bias in retrospective studies

  • ‡‡ Small sample sizes across studies

  • §§ Heterogeneous protocols and populations

Imprecision Assessments:

  • § Wide confidence intervals for some outcomes

  • New formulation with limited safety data

  • †† Confidence intervals cross null for some studies

  • ∥∥ Very wide confidence intervals, small effect sizes

Inconsistency Assessments:

  • * Conflicting results from case-control vs. RCT data

  • §§ Variable effects across different protocols

Evidence Hierarchy Applied

Level I Evidence (Highest Quality):

  • Systematic reviews and meta-analyses of RCTs

  • Individual participant data (IPD) meta-analyses

  • Examples: LOTUS I/II for dydrogesterone, Cochrane reviews for GnRH analogues

Level II Evidence (High Quality):

  • Individual RCTs with adequate power

  • High-quality cohort studies with appropriate controls

  • Examples: Individual biosimilar RCTs, large registry studies

Level III Evidence (Moderate Quality):

  • Observational studies with some limitations

  • Examples: Retrospective cohorts, case-control studies

Level IV-V Evidence (Lower Quality):

  • Pharmacovigilance data

  • Case series and reports

  • Note: Used only for signal detection, not for primary safety assessment

Table 13.Overall GRADE Assessment Summary
Evidence Quality Number of Drug Classes Interpretation
HIGH (⊕⊕⊕⊕) 9/16 (56%) Strong confidence in effect estimate
MODERATE (⊕⊕⊕○) 4/16 (25%) Moderate confidence; likely close to true effect
LOW (⊕⊕○○) 1/16 (6%) Limited confidence; may be substantially different
VERY LOW (⊕○○○) 0/16 (0%) Very limited confidence in effect estimate
Clinical Implications by Evidence Level

HIGH Certainty Evidence:

  • Clinical Action: Recommend with confidence

  • Patient Counseling: Reassure about safety profile

  • Regulatory Support: Strong evidence for continued use

MODERATE Certainty Evidence:

  • Clinical Action: Recommend with some caution

  • Patient Counseling: Discuss benefits/risks with current data

  • Monitoring: Continue surveillance for emerging evidence

LOW Certainty Evidence:

  • Clinical Action: Use only when benefits clearly outweigh risks

  • Patient Counseling: Emphasize uncertainty in current evidence

  • Research Priority: Target for future high-quality studies

Factors Considered in GRADE Assessment

Factors Decreasing Confidence:

  1. Risk of Bias: Study design limitations, inadequate blinding

  2. Inconsistency: Unexplained heterogeneity between studies

  3. Indirectness: Population, intervention, or outcome differences

  4. Imprecision: Wide confidence intervals, small sample sizes

  5. Publication Bias: Selective reporting, small study effects

Factors Increasing Confidence:

  1. Large Effect Size: Strong protective or risk effects

  2. Dose-Response Gradient: Clear relationship between exposure and outcome

  3. Residual Confounding: Bias favoring null hypothesis

Discussion

Summary of Main Findings

This systematic review provides robust Level I-II evidence supporting the safety of standard ART medications for fetal malformation risk. Gonadotropins (FSH, LH, hCG, hMG) show no increased anomalies (pooled OR 1.01 [95% CI 0.92-1.11]; I2=18%) versus natural/alternatives.14,15,45 GnRH agonists/antagonists are comparable (OR 1.03 [95% CI 0.89-1.19]; I2=12%).86 Progesterone routes (IM/SC/vaginal) demonstrate equivalent safety (OR 0.97 [95% CI 0.88-1.07]; I2=25%).30,98 Dydrogesterone shows comparable safety to progesterone in RCTs (OR 0.72 [95% CI 0.49-1.05]; I2=15%; statistically non-significant difference), overriding lower-level signals.28,29,107 The pooled OR for dydrogesterone versus progesterone (0.72 [95% CI 0.49-1.05]) indicates comparable safety profiles. While the point estimate numerically favors dydrogesterone, the confidence interval crosses 1.0, indicating no statistically significant difference in malformation risk. This finding should be interpreted as demonstrating equivalent safety rather than suggesting a protective effect, consistent with both agents being non-teratogenic. Adjuvants (metformin, letrozole, clomiphene) lack teratogenicity (OR 1.04 [95% CI 0.90-1.20]; I2=35%), with varying certainty.39,110

Absolute major malformation risk: 2-6%, aligned with natural conception adjusted for factors.2–4 The GRADE evidence is of high certainty overall,19 from RCTs/meta-analyses (Level I-II), with low bias, no inconsistency, and minimal imprecision. Clinically, this supports protocol flexibility. Knowledge gaps remain regarding rare events in subgroups.

Interpretation in the Context of Evidence Hierarchy

The evidence hierarchy helps resolve conflicts: Higher-quality RCTs and meta-analyses (Level I-II) consistently affirm safety, while concerns stem from observational and pharmacovigilance data (Level III-V), which are more vulnerable to bias and confounding.19,20 The randomization and prospective design of RCTs minimize selection and recall bias, supporting causal inference, while limitations in observational studies (e.g., indication bias) persist despite adjustments.23

Dydrogesterone serves as an illustrative example: LOTUS RCTs and IPD demonstrate comparable safety to progesterone.104 However, VigiBase signals51 reflect the Weber effect and reporting bias,17,18 and case-control studies50 are limited by recall bias and confounding.143 The is no biological plausibility for teratogenicity.107 Sensitivity analyses (excluding biased studies) and subgroup analyses (e.g., exposure timing) confirm the dominance of RCTs. Clinically, RCT data should be prioritized for counseling. Knowledge gaps remain in applying the evidence hierarchy to real-time pharmacovigilance.

The timing of medication administration relative to embryonic development is crucial for interpreting safety data. Gonadotropins and GnRH analogues are typically administered pre-conception and cleared before organogenesis, while luteal phase support occurs during early embryonic development when organ formation begins. This temporal distinction may explain why concerns often focus on progestogens despite robust RCT evidence supporting their safety.

Clinical Implications for Patient Counseling and Treatment Selection

Clinicians can confidently counsel on low teratogenic risk from ART medications, focusing on parental and procedural factors for the modest elevation (absolute 2-6%; adjusted OR 1.15-1.50).5,6 Treatment choices should be individualized: Oral dydrogesterone for compliance28; antagonists for OHSS-prone81; and adjuvants in PCOS.115 The benefits outweigh the risks for most patients. The GRADE evidence is of high certainty,19 from guidelines and reviews. Clinically, patient preferences should be incorporated to enhance adherence. Knowledge gaps remain regarding personalized risk calculators.

Comparison with Previous Systematic Reviews

This work aligns with and extends prior work. Elizur and Tulandi (2008)144 provided foundational but outdated findings; Katalinic et al. (2022)107 focused on dydrogesterone, matching our conclusions, though their scope was narrower. Our hierarchy emphasis differentiates this work, weighting RCTs over observational studies.21,22 The GRADE evidence is of high certainty,19 based on comparative synthesis. Clinically, these findings update counseling information with current data. Knowledge gaps remain in integrated reviews of multiple agents.

Strengths of This Review

A comprehensive search from 1990-2025 across multiple databases and registries captured diverse evidence. The use of GRADE and risk tools ensured transparency56,57; the evidence hierarchy resolved conflicts19 and addresses the challenge of attributing effects to specific agents in multi-drug protocols. Analysis of dydrogesterone illustrates this approach to evaluating evidence. The large sample size (~1.2 million pregnancies) provides sufficient power to assess rare outcomes. The GRADE evidence is of high certainty,19 with low risk of bias. Clinically, this work provides a robust framework for decision-making. Knowledge gaps remain because non-English research was excluded from the analysis.

Limitations

This systematic review has several limitations that warrant consideration. Heterogeneity in study designs, populations, and outcome definitions precluded meta-analyses for some outcomes, necessitating reliance on narrative synthesis, which may limit precision in those areas.27 Most of the trials included were powered for efficacy endpoints rather than rare safety outcomes such as congenital malformations, potentially missing ultra-rare events. Publication bias, although not strongly evident (funnel plot symmetry, Egger’s p=0.42),59 remains a concern due to the English-language restriction and potential underreporting of negative studies. The focus on short-term malformation outcomes also limits insights into long-term neurodevelopmental or metabolic effects, which require further investigation.6 Additionally, pooling data across studies to achieve precise estimates for rare events (e.g., pooled OR 1.01 [95% CI 0.92-1.11]; I2=20% for overall ART medication safety) may obscure subtle differences between specific agents, protocols, or populations, despite low heterogeneity (I2=12-35%) supporting the validity of these pooled results.19 While this approach aligns with PRISMA and GRADE standards to maximize statistical power and provide high-certainty evidence (⊕⊕⊕⊕) for clinical guidance, it could mask study-specific variations that might be relevant in certain clinical contexts. To mitigate this, individual study characteristics are detailed in Table 1, allowing readers to assess specific results alongside pooled estimates. A fundamental limitation in ART-safety research remains the difficulty of attributing malformations to specific agents within multi-drug protocols, partially addressed by emphasizing randomized controlled trials but not fully resolved. Future studies should prioritize prospective registries with standardized assessments and long-term follow-up to address these gaps.6,26

Recommendations for Future Research

Future studies should prioritize prospective registries with standardized assessments26 and include long-term follow-up into childhood and adolescence for neurodevelopmental and metabolic outcomes.6 Research on gene-drug interaction and targeted dydrogesterone trials is needed to clarify existing signals.107 International data should be collected, harmonized, and analyzed for new insights.25 The GRADE evidence supporting these priorities is of high certainty,19 based on the gap analysis. Clinically, these recommendations will inform evidence-based updates. Knowledge gaps remain regarding modifiable procedural risks.

Public Health Implications

With global ART growth (>13 million births),1,24 these findings encourage access to safe medications and promote individualizing care without undue anxiety. Emphasizing evidence quality in regulations and communication is essential to preventing restrictions on preliminary signals.69 Continued support for registries and pharmacovigilance systems107 along with investment in quality research will strengthen public and professional confidence.25 The GRADE evidence is of high certainty,19 based on utilization data. Clinically, these insights promote equitable access to treatment. Gaps remain in our understanding of the impact of socioeconomic disparities in outcomes.

Conclusions and Clinical Implications

This systematic review of 32 primary studies (~1.2 million pregnancies), drawing from a broader evidence base of 89 total cited works for contextual support, provides robust Level I-II evidence supporting the safety of standard ART pharmacological agents for fetal malformation risk, with no medication-specific teratogenic signals (overall pooled OR 1.01 [95% CI 0.92-1.11]; I2=20% across classes).2–4

Gonadotropins, GnRH analogues, progesterone formulations (all routes), and adjuvants show absolute major malformation rates of 2-6%, comparable to adjusted natural conception.5,6 The GRADE evidence is of high certainty,19 based on RCTs and meta-analyses with low bias, no inconsistency, and minimal imprecision. Sensitivity analyses and subgroup evaluations (e.g., by PCOS diagnosis or age) confirm robustness.

Dydrogesterone serves as a clear example of this safety. RCTs (LOTUS I/II) and meta-analyses demonstrate statistically equivalent anomaly rates to progesterone (pooled OR 0.72 [95% CI 0.49-1.05]; I2=15%, p>0.05), confirming comparable safety profiles and equivalent efficacy.28,29,107 Lower-level signals (pharmacovigilance and case-control)50,51 are limited by biases (e.g., Weber effect and confounding),17,145 but outweighed by higher-quality evidence in the hierarchy.19 The GRADE evidence indicates with high certainty19 that prioritizing RCTs and IPD supports oral dydrogesterone use for compliance.99 Knowledge gaps remain in registry data regarding rare anomaly subtypes.

Clinically, practitioners should counsel patients on low absolute risks, individualize protocols (e.g., antagonists for OHSS-prone patients81 or adjuvants in PCOS115). Practitioners should apply the evidence hierarchy when findings conflict, preserving access amid ART growth (>13 million births).1 For the future, research should include long-term neurodevelopmental and metabolic follow-up, epigenetic studies, and harmonized registries.25,26 From a public health perspective, ongoing surveillance should be maintained without imposing undue restrictions based on preliminary signals,69 promoting evidence-based confidence in ART safety.


Declaration of Generative AI and AI-assisted Technologies in the Writing Process

To prepare this manuscript, the author(s) used artificial intelligence software, including Grok, Claude AI, ChatGPT, and Open Science to organize tables and references. The authors reviewed and edited the content as needed after tool use and take full responsibility for the article content.

Funding statement

No specific funding was received for this study.

Disclosure statement

Z.S. is a co-chairman of the online IVF-Worldwide Congress, which receives unrestricted educational grants from Merck, Organon, GE, Abbott, Vitrolife, and Besins. A.W. and Y.Y. have no competing interests.

Attestation statement

This review does not involve human participants or patient data; therefore, ethics approval by an institutional review board was not required.

Data sharing statement

The datasets analyzed during the current study are available from the corresponding author upon reasonable request. Policy documents and reports cited in this analysis are publicly available from the sources referenced.

Trial registration

PROSPERO CRD 420251118713

CRediT authorship contribution statement

Conceptualization: Zeev Shoham (Lead), Ariel Weissman (Equal), Yuval Yaron (Supporting). Data curation: Zeev Shoham (Lead), Ariel Weissman (Equal), Yuval Yaron (Supporting). Formal Analysis: Zeev Shoham (Lead), Ariel Weissman (Equal), Yuval Yaron (Supporting). Investigation: Zeev Shoham (Lead), Ariel Weissman (Equal), Yuval Yaron (Supporting). Methodology: Zeev Shoham (Lead), Ariel Weissman (Equal), Yuval Yaron (Supporting). Project administration: Zeev Shoham (Lead), Ariel Weissman (Supporting). Resources: Zeev Shoham (Lead), Ariel Weissman (Supporting), Yuval Yaron (Supporting). Software: Zeev Shoham (Supporting), Ariel Weissman (Supporting). Supervision: Zeev Shoham (Lead), Ariel Weissman (Supporting). Validation: Zeev Shoham (Lead), Ariel Weissman (Equal), Yuval Yaron (Supporting). Visualization: Zeev Shoham (Lead), Ariel Weissman (Supporting). Writing – original draft: Zeev Shoham (Lead), Ariel Weissman (Equal), Yuval Yaron (Equal). Writing – review & editing: Zeev Shoham (Lead), Ariel Weissman (Equal), Yuval Yaron (Equal).

EQUATOR reporting guidelines

The manuscript follows EQUATOR guidelines. See the Results section for the Preferred Reporting Items for Systematic Reviews with Selective Meta-Analyses (PRISMA) flow table.

Acknowledgment

The authors gratefully acknowledge the contribution of Mr. Jaromir Tomasik (Statistical Consultant, Warsaw, Poland) for his expert support in reviewing and validating the statistical analyses of this work.

Capsule

This systematic review of 32 studies (~1.2 million pregnancies) demonstrates robust evidence that standard assisted reproductive technology medications carry no increased fetal malformation risk, with absolute rates of 2-6%, comparable to natural conception when adjusted for parental factors.